C H A P T E R  3

Prerequisites

This chapter describes Lustre installation prerequisites, and includes the following sections:


3.1 Preparing to Install Lustre

This chapter describes the prerequisites to install Lustre.

3.1.1 How to Get Lustre

The most recent versions of Lustre are available at the Sun Lustre download page.

The Lustre software is released under the GNU General Public License (GPL). We strongly recommend that you read the complete GPL and release notes before downloading Lustre (if you have not done so already). The GPL and release notes can also be found at the aforementioned websites.

3.1.2 Supported Configurations

Lustre supports the following configurations:


Configuration Aspect

Supported Type

Operating systems

Red Hat Enterprise Linux 4 and 5

SuSE Linux Enterprise Server 9 and 10

Linux 2.4, and a higher kernel than 2.6.9

Platforms

IA-32, IA-64, x86-64

PowerPC architectures and mixed-endian clusters

Interconnect

TCP/IP

Quadrics Elan 3 and 4

Myri-10G and Myrinet - 2000

Mellanox

InfiniBand (Voltaire, OpenIB and Silverstorm)


Lustre clients running on architectures with different endianness are supported. One limitation is that the PAGE_SIZE on the client must be as large as the PAGE_SIZE of the server. In particular, ia64 clients with large pages (up to 64kB pages) can run with i386 servers (4kB pages). If you are running i386 clients with ia64 servers, you must compile the ia64 kernel with a 4kB PAGE_SIZE (so the server page size is not larger than the client page size).


3.2 Using a Pre-Packaged Lustre Release

Due to the complexity involved in building and installing Lustre, we offer several pre-packaged releases that cover several of the most common configurations.
A pre-packaged release consists of five different RPM packages (described below). Install these packages in the following order:

The source package is only required if you need to build your own modules (networking, for example) against the kernel source.



caution icon Caution - Lustre contains kernel modifications, which interact with your storage devices and may introduce security issues and data loss if not installed, configured or administered properly. Before using this software, exercise caution and back up ALL data.


3.2.1 Choosing a Pre-Packaged Kernel

Choosing the most suitable pre-packaged kernel depends largely on the combination of hardware and software being used where Lustre will be installed. Pre-packaged kernel releases are available at the Lustre download website.

3.2.2 Lustre Tools

The lustre-<release-ver>.rpm package, required for proper Lustre setup and monitoring, contains many tools. The most important tools are:

Another tool is LNET self-test, which helps site administrators confirm that LNET has been properly installed and configured. The self-test also confirms that LNET and the network software and hardware underlying it are performing according to expectations.

3.2.3 Other Required Software

Although we provide some tools and utilities, Lustre also requires several separate software tools to be installed.

3.2.3.1 Core-Required Tools

http://downloads.clusterfs.com/public/tools/e2fsprogs/latest



Note - This directory contains both SUSE Linux Enterprise Server (SLES) and Red Hat Enterprise Linux (RHEL) versions of e2fsprogs.




Note - You may need to install e2fsprogs with rpm -ivh --force to override any dependency issues of your distribution. Lustre-patched e2fsprogs only needs to be installed on machines that mount backend (ldiskfs) filesystems, such as OSS, MDS and MGS nodes.


Another option is to:

1. Install db4-devel for your distribution (if it is not already installed).

2. Download e2fsprogs-1.40.2-cfs1-0redhat.src.rpm

3. Run:

#rpmbuild --rebuild e2fsprogs-1.40.2.cfs1-0redhat.src.rpm

4. Install the resulting RPMs.

3.2.3.2 High-Availability Software

If you plan to enable failover server functionality with Lustre (either on an OSS or an MDS), high-availability software must be added to your cluster software. Heartbeat is one of the better known high-availability software packages.

Linux-HA (Heartbeat) supports a redundant system with access to the Shared (Common) Storage with dedicated connectivity; it can determine the system’s general state. For more information, see Failover.

3.2.3.3 Debugging Tools

Things inevitably go wrong--disks fail, packets get dropped, software has bugs, and when they do it is useful to have debugging tools on hand to help figure out how and why a problem occurred.

In this regard, the most useful tool is GDB, coupled with crash. You can use these tools to investigate live systems and kernel core dumps. There are also useful kernel patches/ modules, such as netconsole and netdump, that allow core dumps to be made across the network.

For more information about these tools, see the following websites:


Tool

URL

GDB

http://www.gnu.org/software/gdb/gdb.html

crash

http://oss.missioncriticallinux.com/projects/crash/

netconsole

http://lwn.net/2001/0927/a/netconsole.php3

netdump

http://www.redhat.com/support/wpapers/redhat/netdump/



3.3 Environmental Requirements

When preparing to install Lustre, make sure the following environmental requirements are met.

3.3.1 SSH Access

Although not strictly required, in many cases it is helpful to have remote SSH[1] access to all nodes in a cluster. Some Lustre configuration and monitoring scripts depend on SSH (or Pdsh[2]) access, although these are not required to run Lustre.

3.3.2 Consistent Clocks

Lustre always uses the client clock for timestamps. If the machine clocks across the cluster are not in sync, Lustre should not break. However, the unsynchronized clocks in a cluster will always be a headache as it is very difficult to debug any multi-node issue or correlate logs. For this reason, we recommend that you keep machine clocks in sync as much as possible. The standard way to accomplish this is by using the Network Time Protocol (NTP). All machines in your cluster should synchronize their time from a local time server (or servers) at a suitable time interval. For more information about NTP, see:

http://www.ntp.org/

3.3.3 Universal UID/GID

To maintain uniform file access permissions on all nodes in your cluster, use the same user IDs (UID) and group IDs (GID) on all clients. Like most cluster usage, Lustre uses a common UID/GID on all cluster nodes.

3.3.4 Choosing a Proper Kernel I/O Scheduler

One of the many functions of the Linux kernel (indeed of any OS kernel), is to provide access to disk storage. The algorithm which decides how the kernel provides disk access is known as the "I/O Scheduler," or "Elevator." In the 2.6 kernel series, there are four interchangeable schedulers:


Scheduler

Description

cfq

"Completely Fair Queuing" makes a good default for most workloads on general-purpose servers. It is not a good choice for Lustre OSS nodes, however, as it introduces overhead and I/O latency.

as

"Anticipatory Scheduler" is best for workstations and other systems with slow, single-spindle storage. It is not at all good for OSS nodes, as it attempts to aggregate or batch requests in order to improve performance for slow disks.

deadline

“Deadline” is a relatively simple scheduler which tries to minimize I/O latency by re-ordering requests to improve performance. Best for OSS nodes with "simple" storage, that is software RAID, JBOD, LVM, and so on.

noop

“NOOP” is the most simple scheduler of all, and is really just a single FIFO queue. It does not attempt to optimize I/O at all, and is best for OSS nodes that have high-performance storage, that is DDN, Engenio, and so on. This scheduler may yield the best I/O performance if the storage controller has been carefully tuned for the I/O patterns of Lustre.


The above observations on the schedulers are just our best advice. We strongly suggest that you conduct local testing to ensure high performance with Lustre. Also, note that most distributions ship with either “cfq” or “as” configured as the default scheduler. Choosing an alternate scheduler is an absolutely necessary step to optimally configure Lustre for the best performance. The “cfq” and “as” schedulers should never be used for server platforms.

For more in-depth discussion on choosing an I/O scheduler algorithm for Linux, see:

3.3.5 Changing the I/O Scheduler

There are two ways to change the I/O scheduler--at boot time or with new kernels at runtime. For all Linux kernels, appending elevator={noop|deadline} to the kernel boot string sets the I/O elevator.

With LILO, you can use the append keyword:

image=/boot/vmlinuz-2.6.14.2
label=14.2
append="elevator=deadline"
read-only
optional

With GRUB, append the string to the end of the kernel command:

title Fedora Core (2.6.9-5.0.3.EL_lustre.1.4.2custom)
root (hd0,0)
kernel /vmlinuz-2.6.9-5.0.3.EL_lustre.1.4.2custom ro
root=/dev/VolGroup00/LogVol00 rhgb noapic quiet elevator=deadline

With newer Linux kernels, youg can change the scheduler while running[3]. If the file /sys/block/<DEVICE>/queue/scheduler exists (where <DEVICE> is the block device you wish to affect), it contains a list of available schedulers and can be used to switch the schedulers.

(hda is the <disk>):

[root@cfs2]# cat /sys/block/hda/queue/scheduler
noop [anticipatory] deadline cfq
[root@cfs2 ~]# echo deadline > /sys/block/hda/queue/scheduler
[root@cfs2 ~]# cat /sys/block/hda/queue/scheduler
noop anticipatory [deadline] cfq

For desktop use, the other schedulers (anticipatory and cfq) are better suited.


3.4 Memory Requirements

This section describes the memory requirements of Lustre.

3.4.1 Determining the MDS’s Memory

Use the following factors to determine the MDS’s memory:

The amount of memory used by the MDS is a function of how many clients are on the system, and how many files they are using in their working set. This is driven, primarily, by the number of locks a client can hold at one time. The default maximum number of locks for a compute node is 100*num_cores, and interactive clients can hold in excess of 10,000 locks at times. For the MDS, this works out to approximately 2 KB per file, including the Lustre DLM lock and kernel data structures for it, just for the current working set.

There is, by default, 400 MB for the filesystem journal, and additional RAM usage for caching file data for the larger working set that is not actively in use by clients, but should be kept "HOT" for improved access times. Having file data in cache can improve metadata performance by a factor of 10x or more compared to reading it from disk. Approximately 1.5 KB/file is needed to keep a file in cache.

For example, for a single MDT on an MDS with 1,000 clients, 16 interactive nodes, and a 2 million file working set (of which 400,000 files are cached on the clients):

filesystem journal = 400 MB

1000 * 4-core clients * 100 files/core * 2kB = 800 MB

16 interactive clients * 10,000 files * 2kB = 320 MB

1,600,000 file extra working set * 1.5kB/file = 2400 MB

This suggests a minimum RAM size of 4 GB, but having more RAM is always prudent given the relatively low cost of this single component compared to the total system cost.

If there are directories containing 1 million or more files, you may benefit significantly from having more memory. For example, in an environment where clients randomly access one of 10 million files, having extra memory for the cache significantly improves performance.

3.4.2 OSS Memory Requirements

When planning the hardware for an OSS node, consider the memory usage of several components in the Lustre system. Although Lustre versions 1.4 and 1.6 do not cache file data in memory on the OSS node, there are a number of large memory consumers that need to be taken into account. Also consider that future Lustre versions will cache file data on the OSS node, so these calculations should only be taken as a minimum requirement.

By default, each Lustre ldiskfs filesystem has 400 MB for the journal size. This can pin up to an equal amount of RAM on the OSS node per filesystem. In addition, the service threads on the OSS node pre-allocate a 1 MB I/O buffer for each ost_io service thread, so these buffers do not need to be allocated and freed for each I/O request. Also, a reasonable amount of RAM needs to be available for filesystem metadata. While no hard limit can be placed on the amount of filesystem metadata, if more RAM is available, then the disk I/O is needed less often to retrieve the metadata. Finally, if you are using TCP or other network transport that uses system memory for send/receive buffers, this must also be taken into consideration.

Also, if the OSS nodes are to be used for failover from another node, then the RAM for each journal should be doubled, so the backup server can handle the additional load if the primary server fails.

OSS Memory Usage for a 2 OST server (major consumers):

This consumes over 1,300 MB just for the pre-allocated buffers, and does not include memory for the OS or filesystem metadata. For a non-failover configuration, 2 GB of RAM would be the minimum. For a failover configuration, 3 GB of RAM would be the minimum.


1 (Footnote) Secure SHell (SSH)
2 (Footnote) Parallel Distributed SHell (Pdsh)
3 (Footnote) Red Hat Enterprise Linux v3 Update 3 does not have this feature. It is present in the main Linux tree as of 2.6.15.