| C H A P T E R 2 |
|
Understanding Lustre Networking |
This chapter describes Lustre Networking (LNET) and supported networks, and includes the following sections:
In a Lustre network, servers and clients communicate with one another using LNET, a custom networking API which abstracts away all transport-specific interaction. In turn, LNET operates with a variety of network transports through Lustre Network Drivers .
The following terms are important to understanding LNET.
LNET is designed for complex topologies, superior routing capabilities and simplified configuration.
LNET supports the following network types:
Before you configure Lustre, it is essential to have a clear understanding of the Lustre network topologies.
A network is a group of nodes that communicate directly with one another. As previously mentioned in this manual, Lustre supports a variety of network types and hardware, including TCP/IP, Elan, varieties of InfiniBand, Myrinet and others. The normal rules for specifying networks apply to Lustre networks. For example, two TCP networks on two different subnets (tcp0 and tcp1) would be considered two different Lustre networks.
Any node with appropriate interfaces can route LNET between different networks--the node may be a server, a client, or a standalone router. LNET can route across different network types (such as TCP-to-Elan) or across different topologies (such as bridging two InfiniBand or TCP/IP networks).
If not explicitly specified, LNET uses either the first available interface or a pre-defined default for a given network type. If there are interfaces that LNET should not use (such as administrative networks, IP over IB, and so on), then the included interfaces should be explicitly listed.
The LNET configuration is managed via module options, typically specified in /etc/modprobe.conf or /etc/modprobe.conf.local (depending on the distribution). To ease the maintenance of large clusters, you can configure the networking setup for all nodes using a single, unified set of options in the modprobe.conf file on each node. For more information, see the ip2nets option in Setting Up modprobe.conf for Load Balancing.
Users of liblustre should set the accept=all parameter. For details, see Module Parameters.
In mount commands, clients use the NID of the MDS host to retrieve their configuration information. Since an MDS may have more than one NID, a client should use the appropriate NID for its local network. If you are unsure which NID to use, there is a lctl command that can help.
lctl list_nids
This displays the server's NIDs (networks configured to work with Lustre).
lctl which_nid <NID list>
This displays the closest NID for the client.
From a client with SSH access to the MDS, run:
mds_nids=`ssh the_mds lctl list_nids` lctl which_nid $mds_nids
This displays, generally, the correct NID to use for the MDS in the mount command.
| Note - In the mds_nids command above, be sure to use the correct mark (`), not a straight quotation mark ('). Otherwise, the command will not work. |
This section describes how to configure LNET, including entries in the modprobe.conf file which tell LNET which NIC(s) will be configured to work with Lustre, and parameters that specify the routing that will be used with Lustre.
LNET network hardware and routing are configured via module parameters of the LNET and LND-specific modules. Parameters should be specified in the /etc/modprobe.conf or /etc/modules.conf file. This example specifies that the node should use a TCP interface and an Elan interface:
options lnet networks=tcp0,elan0
Depending on the LNDs used, it may be necessary to specify explicit interfaces. For example, if you want to use two TCP interfaces (tcp0 and tcp1, for example), it is necessary to specify the module parameters and ethX interfaces like this:
options lnet networks=tcp0(eth0),tcp1(eth1)
This modprobe.conf entry specifies:
All LNET routers that bridge two networks are equivalent; their configuration is not primary or secondary. All available routers balance their overall load. With the router checker configured, Lustre nodes can detect router health status, avoid those that appear dead, and reuse the ones that restore service after failures. To do this, LNET routing must correspond exactly with the Linux nodes' map of alive routers. There is no hard limit on the number of LNET routers.
Under Linux 2.6, the LNET configuration parameters can be viewed under /sys/module/; generic and acceptor parameters under lnet and LND-specific parameters under the corresponding LND name.
| Note - By default, Lustre ignores the loopback (lo0) interface. Lustre does not ignore IP addresses aliased to the loopback. In this case, specify all Lustre networks. |
The liblustre network parameters may be set by exporting the environment variables LNET_NETWORKS, LNET_IP2NETS and LNET_ROUTES. Each of these variables uses the same parameters as the corresponding m odprobe option.
Note, it is very important that a liblustre client includes ALL the routers in its setting of LNET_ROUTES. A liblustre client cannot accept connections, it can only create connections. If a server sends remote procedure call (RPC) replies via a router to which the liblustre client has not already connected, then these RPC replies are lost.
| Note - Liblustre is not widely tested as part of Lustre release testing, and is currently maintained only as a courtesy to the Lustre community. |
Lustre now offers usocklnd, a socket-based LND that uses TCP in userspace. By default, liblustre is compiled with usocklnd as the transport, so there is no need to specially enable it.
Use the following environmental variables to tune usocklnd’s behavior.
For the SilverStorm/Infinicon InfiniBand LND (iiblnd), the network and HCA may be specified, as in this example:
options lnet networks="o2ib3(ib3)"
This specifies that the node is on o2ib network number 3, using HCA ib3.
The following parameter specifies a colon-separated list of router definitions. Each route is defined as a network number, followed by a list of routers.
route=<net type> <router NID(s)>
options lnet 'networks="o2ib0"' 'routes="tcp0 192.168.10.[1-8]@o2ib0"'
This is an example for IB clients to access TCP servers via 8 IB-TCP routers.
options lnet 'ip2nets="tcp0 10.10.0.*; o2ib0(ib0) 192.168.10.[1-128]"' \ 'routes="tcp 192.168.10.[1-8]@o2ib0; o2ib 10.10.0.[1-8]@tcp0"
This specifies bi-directional routing; TCP clients can reach Lustre resources on the IB networks and IB servers can access the TCP networks. For more information on ip2nets, see Modprobe.conf.
| Note - Configure IB network interfaces on a different subnet than LAN interfaces. |
To ease Lustre administration, the same routing parameters can be used across different parts of a routed cluster. For example, the bi-directional routing example above can be used on an entire cluster (TCP clients, TCP-IB routers, and IB servers):
live_router_check_interval, dead_router_check_interval, auto_down, check_routers_before_use and router_ping_timeout
In a routed Lustre setup with nodes on different networks such as TCP/IP and Elan, the router checker checks the status of a router. The auto_down parameter enables/disables (1/0) the automatic marking of router state.
The live_router_check_interval parameter specifies a time interval in seconds after which the router checker will ping the live routers.
In the same way, you can set the dead_router_check_interval parameter for checking dead routers.
You can set the timeout for the router checker to check the live or dead routers by setting the router_ping_timeout parmeter. The Router pinger sends a ping message to a dead/live router once every dead/live_router_check_interval seconds, and if it does not get a reply message from the router within router_ping_timeout seconds, it considers the router to be down.
The last parameter is check_routers_before_use, which is off by default. If it is turned on, you must also give dead_router_check_interval a positive integer value.
The router checker gets the following variables for each router:
The initial time to disable a router should be one minute (enough to plug in a cable after removing it). If the router is administratively marked as "up", then the router checker clears the timeout. When a route is disabled (and possibly new), the "sent packets" counter is set to 0. When the route is first re-used (that is an elapsed disable time is found), the sent packets counter is incremented to 1, and incremented for all further uses of the route. If the route has been used for 100 packets successfully, then the sent-packets counter should be with a value of 100. Set the timeout to 0 (zero), so future errors no longer double the timeout.
All LNET routers that bridge two networks are equivalent. They are not configured as primary or secondary, and load is balanced across all available routers.
With the router checker configured, Lustre nodes can detect router health status, avoid those that appear dead, and reuse the ones that restore service after failures.
There are no hard requirements regarding the number of LNET routers, although there should enough to handle the required file serving bandwidth (and a 25% margin for headroom).
By default, at startup, LNET routers allocate 544M (i.e. 139264 4K pages) of memory as router buffers. The buffers can only come from low system memory (i.e. ZONE_DMA and ZONE_NORMAL).
On 32-bit systems, low system memory is, at most, 896M no matter how much RAM is installed. The size of the default router buffer puts big pressure on low memory zones, making it more likely that an out-of-memory (OOM) situation will occur. This is a known cause of router hangs. Lowering the value of the large_router_buffers parameter can circumvent this problem, but at the cost of penalizing router performance, by making large messages wait for longer for buffers.
On 64-bit architectures, the ZONE_HIGHMEM zone is always empty. Router buffers can come from all available memory and out-of-memory hangs do not occur. Therefore, we recommend using 64-bit routers.
There are two mechanisms to update the health status of a peer or a router:
Several key differences in both mechanisms:
Lustre automatically starts and stops LNET, but it can also be manually started in a standalone manner. This is particularly useful to verify that your networking setup is working correctly before you attempt to start Lustre.
$ modprobe lnet $ lctl network up
To see the list of local NIDs, run:
$ lctl list_nids
This command tells you the network(s) configured to work with Lustre
If the networks are not correctly setup, see the modules.conf "networks=" line and make sure the network layer modules are correctly installed and configured.
To get the best remote NID, run:
$ lctl which_nid <NID list>
where <NID list> is the list of available NIDs.
This command takes the "best" NID from a list of the NIDs of a remote host. The "best" NID is the one that the local node uses when trying to communicate with the remote node.
mount -t lustre mdsnode:/mdsA/client /mnt/lustre/
mount -t lustre 2@elan0:/mdsA/client /mnt/lustre
Before the LNET modules can be removed, LNET references must be removed. In general, these references are removed automatically when Lustre is shut down, but for standalone routers, an explicit step is needed to stop LNET. Run:
lctl network unconfigure
To unconfigure the LNET network, run:
modprobe -r <any lnd and the lnet modules>
| Tip - To remove all Lustre modules, run:
$ lctl modules | awk '{print $2}' | xargs rmmod |
Copyright © 2010, Oracle and/or its affiliates. All rights reserved.