| C H A P T E R 3 |
|
Prerequisites |
This chapter describes Lustre installation prerequisites, and includes the following sections:
This chapter describes the prerequisites to install Lustre.
The most recent versions of Lustre are available at the Sun Lustre download page.
The Lustre software is released under the GNU General Public License (GPL). We strongly recommend that you read the complete GPL and release notes before downloading Lustre (if you have not done so already). The GPL and release notes can also be found at the aforementioned websites.
Lustre supports the following configurations:
|
Red Hat Enterprise Linux 4 and 5 |
|
Lustre clients running on architectures with different endianness are supported. One limitation is that the PAGE_SIZE on the client must be as large as the PAGE_SIZE of the server. In particular, ia64 clients with large pages (up to 64kB pages) can run with i386 servers (4kB pages). If you are running i386 clients with ia64 servers, you must compile the ia64 kernel with a 4kB PAGE_SIZE (so the server page size is not larger than the client page size).
Due to the complexity involved in building and installing Lustre, we offer several pre-packaged releases that cover several of the most common configurations.
A pre-packaged release consists of five different RPM packages (described below). Install these packages in the following order:
The source package is only required if you need to build your own modules (networking, for example) against the kernel source.
Choosing the most suitable pre-packaged kernel depends largely on the combination of hardware and software being used where Lustre will be installed. Pre-packaged kernel releases are available at the Lustre download website.
The lustre-<release-ver>.rpm package, required for proper Lustre setup and monitoring, contains many tools. The most important tools are:
Another tool is LNET self-test, which helps site administrators confirm that LNET has been properly installed and configured. The self-test also confirms that LNET and the network software and hardware underlying it are performing according to expectations.
Although we provide some tools and utilities, Lustre also requires several separate software tools to be installed.
http://downloads.clusterfs.com/public/tools/e2fsprogs/latest
| Note - This directory contains both SUSE Linux Enterprise Server (SLES) and Red Hat Enterprise Linux (RHEL) versions of e2fsprogs. |
1. Install db4-devel for your distribution (if it is not already installed).
2. Download e2fsprogs-1.40.2-cfs1-0redhat.src.rpm
#rpmbuild --rebuild e2fsprogs-1.40.2.cfs1-0redhat.src.rpm
4. Install the resulting RPMs.
If you plan to enable failover server functionality with Lustre (either on an OSS or an MDS), high-availability software must be added to your cluster software. Heartbeat is one of the better known high-availability software packages.
Linux-HA (Heartbeat) supports a redundant system with access to the Shared (Common) Storage with dedicated connectivity; it can determine the system’s general state. For more information, see Failover.
Things inevitably go wrong--disks fail, packets get dropped, software has bugs, and when they do it is useful to have debugging tools on hand to help figure out how and why a problem occurred.
In this regard, the most useful tool is GDB, coupled with crash. You can use these tools to investigate live systems and kernel core dumps. There are also useful kernel patches/ modules, such as netconsole and netdump, that allow core dumps to be made across the network.
For more information about these tools, see the following websites:
When preparing to install Lustre, make sure the following environmental requirements are met.
Although not strictly required, in many cases it is helpful to have remote SSH[1] access to all nodes in a cluster. Some Lustre configuration and monitoring scripts depend on SSH (or Pdsh[2]) access, although these are not required to run Lustre.
Lustre always uses the client clock for timestamps. If the machine clocks across the cluster are not in sync, Lustre should not break. However, the unsynchronized clocks in a cluster will always be a headache as it is very difficult to debug any multi-node issue or correlate logs. For this reason, we recommend that you keep machine clocks in sync as much as possible. The standard way to accomplish this is by using the Network Time Protocol (NTP). All machines in your cluster should synchronize their time from a local time server (or servers) at a suitable time interval. For more information about NTP, see:
To maintain uniform file access permissions on all nodes in your cluster, use the same user IDs (UID) and group IDs (GID) on all clients. Like most cluster usage, Lustre uses a common UID/GID on all cluster nodes.
One of the many functions of the Linux kernel (indeed of any OS kernel), is to provide access to disk storage. The algorithm which decides how the kernel provides disk access is known as the "I/O Scheduler," or "Elevator." In the 2.6 kernel series, there are four interchangeable schedulers:
The above observations on the schedulers are just our best advice. We strongly suggest that you conduct local testing to ensure high performance with Lustre. Also, note that most distributions ship with either “cfq” or “as” configured as the default scheduler. Choosing an alternate scheduler is an absolutely necessary step to optimally configure Lustre for the best performance. The “cfq” and “as” schedulers should never be used for server platforms.
For more in-depth discussion on choosing an I/O scheduler algorithm for Linux, see:
There are two ways to change the I/O scheduler--at boot time or with new kernels at runtime. For all Linux kernels, appending elevator={noop|deadline} to the kernel boot string sets the I/O elevator.
With LILO, you can use the append keyword:
image=/boot/vmlinuz-2.6.14.2 label=14.2 append="elevator=deadline" read-only optional
With GRUB, append the string to the end of the kernel command:
title Fedora Core (2.6.9-5.0.3.EL_lustre.1.4.2custom) root (hd0,0) kernel /vmlinuz-2.6.9-5.0.3.EL_lustre.1.4.2custom ro root=/dev/VolGroup00/LogVol00 rhgb noapic quiet elevator=deadline
With newer Linux kernels, youg can change the scheduler while running[3]. If the file /sys/block/<DEVICE>/queue/scheduler exists (where <DEVICE> is the block device you wish to affect), it contains a list of available schedulers and can be used to switch the schedulers.
[root@cfs2]# cat /sys/block/hda/queue/scheduler noop [anticipatory] deadline cfq [root@cfs2 ~]# echo deadline > /sys/block/hda/queue/scheduler [root@cfs2 ~]# cat /sys/block/hda/queue/scheduler noop anticipatory [deadline] cfq
For desktop use, the other schedulers (anticipatory and cfq) are better suited.
This section describes the memory requirements of Lustre.
Use the following factors to determine the MDS’s memory:
The amount of memory used by the MDS is a function of how many clients are on the system, and how many files they are using in their working set. This is driven, primarily, by the number of locks a client can hold at one time. The default maximum number of locks for a compute node is 100*num_cores, and interactive clients can hold in excess of 10,000 locks at times. For the MDS, this works out to approximately 2 KB per file, including the Lustre DLM lock and kernel data structures for it, just for the current working set.
There is, by default, 400 MB for the filesystem journal, and additional RAM usage for caching file data for the larger working set that is not actively in use by clients, but should be kept "HOT" for improved access times. Having file data in cache can improve metadata performance by a factor of 10x or more compared to reading it from disk. Approximately 1.5 KB/file is needed to keep a file in cache.
For example, for a single MDT on an MDS with 1,000 clients, 16 interactive nodes, and a 2 million file working set (of which 400,000 files are cached on the clients):
1000 * 4-core clients * 100 files/core * 2kB = 800 MB
16 interactive clients * 10,000 files * 2kB = 320 MB
1,600,000 file extra working set * 1.5kB/file = 2400 MB
This suggests a minimum RAM size of 4 GB, but having more RAM is always prudent given the relatively low cost of this single component compared to the total system cost.
If there are directories containing 1 million or more files, you may benefit significantly from having more memory. For example, in an environment where clients randomly access one of 10 million files, having extra memory for the cache significantly improves performance.
When planning the hardware for an OSS node, consider the memory usage of several components in the Lustre system. Although Lustre versions 1.4 and 1.6 do not cache file data in memory on the OSS node, there are a number of large memory consumers that need to be taken into account. Also consider that future Lustre versions will cache file data on the OSS node, so these calculations should only be taken as a minimum requirement.
By default, each Lustre ldiskfs filesystem has 400 MB for the journal size. This can pin up to an equal amount of RAM on the OSS node per filesystem. In addition, the service threads on the OSS node pre-allocate a 1 MB I/O buffer for each ost_io service thread, so these buffers do not need to be allocated and freed for each I/O request. Also, a reasonable amount of RAM needs to be available for filesystem metadata. While no hard limit can be placed on the amount of filesystem metadata, if more RAM is available, then the disk I/O is needed less often to retrieve the metadata. Finally, if you are using TCP or other network transport that uses system memory for send/receive buffers, this must also be taken into consideration.
Also, if the OSS nodes are to be used for failover from another node, then the RAM for each journal should be doubled, so the backup server can handle the additional load if the primary server fails.
OSS Memory Usage for a 2 OST server (major consumers):
This consumes over 1,300 MB just for the pre-allocated buffers, and does not include memory for the OS or filesystem metadata. For a non-failover configuration, 2 GB of RAM would be the minimum. For a failover configuration, 3 GB of RAM would be the minimum.
Copyright © 2008 Sun Microsystems, Inc. All Rights Reserved.