| C H A P T E R 5 |
|
Configuring the Lustre Network |
This chapter describes how to configure Lustre and includes the following sections:
Before you configure Lustre, it is essential to have a clear understanding of the Lustre network topologies.
A network is a group of nodes that communicate directly with one another. As previously mentioned in this manual, Lustre supports a variety of network types and hardware, including TCP/IP, Elan, varieties of InfiniBand, Myrinet and others. The normal rules for specifying networks apply to Lustre networks. For example, two TCP networks on two different subnets (tcp0 and tcp1) would be considered two different Lustre networks.
Any node with appropriate interfaces can route LNET between different networks--the node may be a server, a client, or a standalone router. LNET can route across different network types (such as TCP-to-Elan) or across different topologies (such as bridging two InfiniBand or TCP/IP networks).
By default, LNET uses all interfaces for a given network type. If there are interfaces it should not use, (such as administrative networks, IP over IB, and so on), then the included interfaces should be explicitly listed.
The LNET configuration is managed via module options, typically specified in /etc/modprobe.conf or /etc/modprobe.conf.local (depending on the distribution). To help ease the maintenance of large clusters, it is possible to configure the networking setup for all nodes through a single, unified set of options in the modprobe.conf file on each node. For more information, see the ip2nets option in Modprobe.conf.
Users of liblustre should set the accept=all parameter. For details, see Module Parameters.
In their mount commands, clients use the NID of the MDS host to retrieve their configuration information. Since an MDS may have more than one NID, a client should use the appropriate NID for its local network. If you are unsure which NID to use, there is a lctl command that can help.
lctl list_nids
This displays the server's NIDs.
lctl which_nid <NID list>
This displays the closest NID for the client.
From a client with SSH access to the MDS, run:
mds_nids=`ssh the_mds lctl list_nids` lctl which_nid $mds_nids
This displays, generally, the correct NID to use for the MDS in the mount command.
This section describes how to configure your Lustre network.
LNET network hardware and routing are configured via module parameters of the LNET and LND-specific modules. Parameters should be specified in the /etc/modprobe.conf or /etc/modules.conf file, for example:
options lnet networks=tcp0,elan0
This specifies that this node should use all available TCP and Elan interfaces.
All LNET routers that bridge two networks are equivalent. Their configuration is not primary or secondary. All available routers balance their overall load. Router fault tolerance only works from Linux nodes. For this, LNET routing must correspond exactly with the Linux nodes' map of alive routers. There is no hard limit on the number of LNET routers.
| Note - By default, Lustre ignores the loopback (lo0) interface. Lustre does not ignore IP addresses aliased to the loopback. In this case, specify all Lustre networks. |
The liblustre network parameters may be set by exporting the environment variables LNET_NETWORKS, LNET_IP2NETS and LNET_ROUTES. Each of these variables uses the same parameters as the corresponding m odprobe option.
Note, it is very important that a liblustre client includes ALL the routers in its setting of LNET_ROUTES. A liblustre client cannot accept connections, it can only create connections. If a server sends remote procedure call (RPC) replies via a router to which the liblustre client has not already connected, then these RPC replies are lost.
| Note - liblustre is not for general use. It was created to work with specific hardware (Cray) and should never be used with other hardware. |
For the SilverStorm/Infinicon InfiniBand LND (iiblnd), the network and HCA may be specified, as in this example:
options lnet networks="iib3(2)"
This says that this node is on iib network number 3, using HCA[2] == ib3.
If you are using zeroconf (mount -t lustre), add a line to your modules.conf as follows:
post-install portals sysctl -w lnet.debug=0x3f0400
This sets the debug level to the value you specify, whenever the portals module is loaded.
| Note - The above value is the default value in Lustre. It provides useful information for diagnosing problems without materially impairing performance. |
The following parameter specifies a colon-separated list of router definitions. Each route is defined as a network type, followed by a list of routers.
route=<net type> <router NID(s)>
options lnet 'networks="o2ib0"' 'routes="tcp0 192.168.10.[1-8]@o2ib0"'
This is an example for IB clients to access TCP servers via 8 IB-TCP routers.
This is a more complicated example:
options lnet 'ip2nets="tcp0 10.10.0.*; o2ib0(ib0) 192.168.10.[1-128]"' \ 'routes="tcp 192.168.10.[1-8]@o2ib0; o2ib 10.10.0.[1-8]@tcp0"
This specifies bi-directional routing; TCP clients can reach Lustre resources on the IB networks and IB servers can access the TCP networks. For more information on ip2nets, Modprobe.conf.
To ease Lustre administration, the same routing parameters can be used across different parts of a routed cluster. For example, the bi-directional routing example above can be used on an entire cluster (TCP clients, TCP-IB routers, and IB servers):
live_router_check_interval, dead_router_check_interval, auto_down, check_routers_before_use and router_ping_timeout
In a routed Lustre setup with nodes on different networks such as TCP/IP and Elan, the router checker checks the status of a router. Currently, only the clients using the sock LND and Elan LND avoid failed routers. We are working on extending this behavior to include all types of LNDs. The auto_down parameter enables/disables (1/0) the automatic marking of router state.
The live_router_check_interval parameter specifies a time interval in seconds after which the router checker will ping the live routers.
In the same way, you can set the dead_router_check_interval parameter for checking dead routers.
You can set the timeout for the router checker to check the live or dead routers by setting the router_ping_timeout parmeter. The Router pinger sends a ping message to a dead/live router once every dead/live_router_check_interval seconds, and if it does not get a reply message from the router within router_ping_timeout seconds, it considers the router to be down.
The last parameter is check_routers_before_use, which is off by default. If it is turned on, you must also give dead_router_check_interval a positive integer value.
The router checker gets the following variables for each router:
The initial time to disable a router should be one minute (enough to plug in a cable after removing it). If the router is administratively marked as "up", then the router checker clears the timeout. When a route is disabled (and possibly new), the "sent packets" counter is set to 0. When the route is first re-used (that is an elapsed disable time is found), the sent packets counter is incremented to 1, and incremented for all further uses of the route. If the route has been used for 100 packets successfully, then the sent-packets counter should be with a value of 100. Set the timeout to 0 (zero), so future errors no longer double the timeout.
All LNET routers that bridge two networks are equivalent. They are not configured as primary or secondary, and load is balanced across all available routers.
Router fault tolerance only works from Linux nodes, that is, service nodes and application nodes if they are running Compute Node Linux (CNL). For this, LNET routing must correspond exactly with the Linux nodes’ map of alive routers.[1]
There are no hard requirements regarding the number of LNET routers, although there should enough to handle the required file serving bandwidth (and a 25% margin for headroom).
By default, at startup, LNET routers allocate 544M (i.e. 139264 4K pages) of memory as router buffers. The buffers can only come from low system memory (i.e. ZONE_DMA and ZONE_NORMAL).
On 32-bit systems, low system memory is, at most, 896M no matter how much RAM is installed. The size of the default router buffer puts big pressure on low memory zones, making it more likely that an out-of-memory (OOM) situation will occur. This is a known cause of router hangs. Lowering the value of the large_router_buffers parameter can circumvent this problem, but at the cost of penalizing router performance, by making large messages wait for longer for buffers.
On 64-bit architectures, the ZONE_HIGHMEM zone is always empty. Router buffers can come from all available memory and out-of-memory hangs do not occur. Therefore, we recommend using 64-bit routers.
There are two mechanisms to update the health status of a peer or a router:
Several key differences in both mechanisms:
Lustre automatically starts and stops LNET, but it can also be manually started in a standalone manner. This is particularly useful to verify that your networking setup is working correctly before you attempt to start Lustre.
$ modprobe lnet $ lctl network up
To see the list of local NIDs, run:
$ lctl list_nids
This command tells you if the local node's networks are set up correctly.
If the networks are not correctly setup, see the modules.conf "networks=" line and make sure the network layer modules are correctly installed and configured.
To get the best remote NID, run:
$ lctl which_nid <NID list>
where <NID list> is the list of available NIDs.
This command takes the "best" NID from a list of the NIDs of a remote host. The "best" NID is the one that the local node uses when trying to communicate with the remote node.
mount -t lustre mdsnode:/mdsA/client /mnt/lustre/
mount -t lustre 2@elan0:/mdsA/client /mnt/lustre
Before the LNET modules can be removed, LNET references must be removed. In general, these references are removed automatically during Lustre shutdown, but for standalone routers, an explicit step is necessary to stop LNET. Run this command:
lctl network unconfigure
To unconfigure the lctl network, run:
modprobe -r <any lnd and the lnet modules>
| Tip - To remove all Lustre modules, run:
$ lctl modules | awk '{print $2}' | xargs rmmod |
Copyright © 2008 Sun Microsystems, Inc. All Rights Reserved.