C H A P T E R  2

Understanding Lustre Networking

This chapter describes Lustre Networking (LNET) and supported networks, and includes the following sections:


2.1 Introduction to LNET

In a Lustre network, servers and clients communicate with one another using LNET, a custom networking API which abstracts away all transport-specific interaction. In turn, LNET operates with a variety of network transports through Lustre Network Drivers .

The following terms are important to understanding LNET.

Key features of LNET include:

LNET is designed for complex topologies, superior routing capabilities and simplified configuration.


2.2 Supported Network Types

LNET supports the following network types:


2.3 Designing Your Lustre Network

Before you configure Lustre, it is essential to have a clear understanding of the Lustre network topologies.

2.3.1 Identify All Lustre Networks

A network is a group of nodes that communicate directly with one another. As previously mentioned in this manual, Lustre supports a variety of network types and hardware, including TCP/IP, Elan, varieties of InfiniBand, Myrinet and others. The normal rules for specifying networks apply to Lustre networks. For example, two TCP networks on two different subnets (tcp0 and tcp1) would be considered two different Lustre networks.

2.3.2 Identify Nodes to Route Between Networks

Any node with appropriate interfaces can route LNET between different networks--the node may be a server, a client, or a standalone router. LNET can route across different network types (such as TCP-to-Elan) or across different topologies (such as bridging two InfiniBand or TCP/IP networks).

2.3.3 Identify Network Interfaces to Include/Exclude from LNET

If not explicitly specified, LNET uses either the first available interface or a pre-defined default for a given network type. If there are interfaces that LNET should not use (such as administrative networks, IP over IB, and so on), then the included interfaces should be explicitly listed.

2.3.4 Determine Cluster-wide Module Configuration

The LNET configuration is managed via module options, typically specified in /etc/modprobe.conf or /etc/modprobe.conf.local (depending on the distribution). To ease the maintenance of large clusters, you can configure the networking setup for all nodes using a single, unified set of options in the modprobe.conf file on each node. For more information, see the ip2nets option in Setting Up modprobe.conf for Load Balancing.

Users of liblustre should set the accept=all parameter. For details, see Module Parameters.

2.3.5 Determine Appropriate Mount Parameters for Clients

In mount commands, clients use the NID of the MDS host to retrieve their configuration information. Since an MDS may have more than one NID, a client should use the appropriate NID for its local network. If you are unsure which NID to use, there is a lctl command that can help.

MDS

On the MDS, run:

lctl list_nids

This displays the server's NIDs (networks configured to work with Lustre).

Client

On a client, run:

lctl which_nid <NID list>

This displays the closest NID for the client.

Client with SSH Access

From a client with SSH access to the MDS, run:

mds_nids=`ssh the_mds lctl list_nids`
lctl which_nid $mds_nids

This displays, generally, the correct NID to use for the MDS in the mount command.



Note - In the mds_nids command above, be sure to use the correct mark (`), not a straight quotation mark ('). Otherwise, the command will not work.



2.4 Configuring LNET

This section describes how to configure LNET, including entries in the modprobe.conf file which tell LNET which NIC(s) will be configured to work with Lustre, and parameters that specify the routing that will be used with Lustre.



Note - We recommend that you use dotted-quad IP addressing rather than host names. We have found this aids in reading debug logs, and helps greatly when debugging configurations with multiple interfaces.


2.4.1 Module Parameters

LNET network hardware and routing are configured via module parameters of the LNET and LND-specific modules. Parameters should be specified in the /etc/modprobe.conf or /etc/modules.conf file. This example specifies that the node should use a TCP interface and an Elan interface:

options lnet networks=tcp0,elan0

Depending on the LNDs used, it may be necessary to specify explicit interfaces. For example, if you want to use two TCP interfaces (tcp0 and tcp1, for example), it is necessary to specify the module parameters and ethX interfaces like this:

options lnet networks=tcp0(eth0),tcp1(eth1)

This modprobe.conf entry specifies:



Note - The requirement to specify explicit interfaces is not consistent across all LNDs used with Lustre, and LND behavior may change over time. We recommend that if your multi-homed settings do not work, try specifying the ethX interfaces in the options lnet networks line.


All LNET routers that bridge two networks are equivalent; their configuration is not primary or secondary. All available routers balance their overall load. With the router checker configured, Lustre nodes can detect router health status, avoid those that appear dead, and reuse the ones that restore service after failures. To do this, LNET routing must correspond exactly with the Linux nodes' map of alive routers. There is no hard limit on the number of LNET routers.



Note - When multiple interfaces are available during the network setup, Lustre choose the 'best' route. Once the network connection is established, Lustre expects the network to stay connected. In a Lustre network, connections do not fail over to the other interface, even if multiple interfaces are available on the same node.


Under Linux 2.6, the LNET configuration parameters can be viewed under /sys/module/; generic and acceptor parameters under lnet and LND-specific parameters under the corresponding LND name.



Note - Depending on the Linux distribution, options with included commas may need to be escaped using single and/or double quotes. Worst-case quotes look like:

options lnet'networks="tcp0,elan0"' 'routes="tcp [2,10]@elan0"'

Additional quotes may confuse some distributions. Check for messages such as:

lnet: Unknown parameter ‘'networks'

After modprobe LNET, remove the additional single quotes (modprobe.conf in this case). Additionally, the refusing connection - no matching NID message generally points to an error in the LNET module configuration.




Note - By default, Lustre ignores the loopback (lo0) interface. Lustre does not ignore IP addresses aliased to the loopback. In this case, specify all Lustre networks.


The liblustre network parameters may be set by exporting the environment variables LNET_NETWORKS, LNET_IP2NETS and LNET_ROUTES. Each of these variables uses the same parameters as the corresponding m odprobe option.

Note, it is very important that a liblustre client includes ALL the routers in its setting of LNET_ROUTES. A liblustre client cannot accept connections, it can only create connections. If a server sends remote procedure call (RPC) replies via a router to which the liblustre client has not already connected, then these RPC replies are lost.



Note - Liblustre is not required or even recommended for running Lustre on Linux. Most users will not use liblustre. Instead, you should use the Lustre (VFS) client file system to mount Lustre directly. Liblustre does NOT support multi-threaded applications.




Note - Liblustre is not widely tested as part of Lustre release testing, and is currently maintained only as a courtesy to the Lustre community.


2.4.1.1 Using Usocklnd

Lustre now offers usocklnd, a socket-based LND that uses TCP in userspace. By default, liblustre is compiled with usocklnd as the transport, so there is no need to specially enable it.

Use the following environmental variables to tune usocklnd’s behavior.


Variable

Description

USOCK_SOCKNAGLE=N

Turns the TCP Nagle algorithm on or off. Setting N to 0 (the default value), turns the algorithm off. Setting N to 1 turns the algorithm on.

USOCK_SOCKBUFSIZ=N

Changes the socket buffer size. Setting N to 0 (the default value), specifies the default socket buffer size. Setting N to another value (must be a positive integer) causes usocklnd to try to set the socket buffer size to the specified value.

USOCK_TXCREDITS=N

Specifies the maximum number of concurrent sends. The default value is 256. N should be set to a positive value.

USOCK_PEERTXCREDITS=N

Specifies the maximum number of concurrent sends per peer. The default value is 8. N should be set to a positive value and should not be greater than the value of the USOCK_TXCREDITS parameter.

USOCK_NPOLLTHREADS=N

Defines the degree of parallelism of usocklnd, by equaling the number of threads devoted to processing network events. The default value is the number of CPUs in the system. N should be set to a positive value.

USOCK_FAIR_LIMIT=N

The maximum number of times that usocklnd loops processing events before the next polling occurs. The default value is 1, meaning that every network event has only one chance to be processed before polling occurs the next time. N should be set to a positive value.

USOCK_TIMEOUT=N

Specifies the network timeout (measured in seconds). Network options that are not completed in N seconds
time out and are canceled. The default value is 50 seconds.
N should be a positive value.

USOCK_POLL_TIMEOUT=N

Specifies the polling timeout; how long usocklnd ‘sleeps’ if no network events occur. N results in a slightly lower overhead of checking network timeouts and longer delay of evicting timed-out events. The default value is 1 second.
N should be set to a positive value.

USOCK_MIN_BULK=N

This tunable is only used for typed network connections. Currently, liblustre clients do not use this usocklnd facility.


2.4.1.2 OFED InfiniBand Options

For the SilverStorm/Infinicon InfiniBand LND (iiblnd), the network and HCA may be specified, as in this example:

options lnet networks="o2ib3(ib3)"

This specifies that the node is on o2ib network number 3, using HCA ib3.

2.4.2 Module Parameters - Routing

The following parameter specifies a colon-separated list of router definitions. Each route is defined as a network number, followed by a list of routers.

route=<net type> <router NID(s)>

Examples:

options lnet 'networks="o2ib0"' 'routes="tcp0 192.168.10.[1-8]@o2ib0"'

This is an example for IB clients to access TCP servers via 8 IB-TCP routers.

options lnet 'ip2nets="tcp0 10.10.0.*; o2ib0(ib0) 192.168.10.[1-128]"' \
'routes="tcp 192.168.10.[1-8]@o2ib0; o2ib 10.10.0.[1-8]@tcp0"

This specifies bi-directional routing; TCP clients can reach Lustre resources on the IB networks and IB servers can access the TCP networks. For more information on ip2nets, see Modprobe.conf.



Note - Configure IB network interfaces on a different subnet than LAN interfaces.




caution icon Caution - For options ip2nets, routes and networks, several best practices must be followed or configuration errors occur:

Best Practice 1: If you add a comment to any of the options mentioned above, position the semicolon after the comment. If you fail to do so, some nodes are not properly initialized because LNET silently ignores everything following the '#' character (which begins the comment), until it reaches the next semicolon. This is subtle; no error message is generated to alert you to the problem.

This example shows the correct syntax:

options lnet ip2nets="pt10 192.168.0.[89,93] # comment with semicolon AFTER comment; \
pt11 192.168.0.[92,96] # comment

In this example, the following is ignored: comment with semicolon AFTER comment

This example shows the wrong syntax:

options lnet ip2nets="pt10 192.168.0.[89,93]; # comment with semicolon BEFORE comment \
pt11 192.168.0.[92,96];

In this example, the following is ignored: comment with semicolon BEFORE comment pt11 192.168.0.[92,96]. Because LNET silently ignores pt11 192.168.0.[92,96], these nodes are not properly initialized.

Best Practice 2: Do not add an excessive number of comments to these options. The Linux kernel has a limit on the length of string module options; it is usually 1KB, but may differ in vendor kernels. If you exceed this limit, errors result and the configuration specified by the user is not processed properly.


Using Routing Parameters Across a Cluster

To ease Lustre administration, the same routing parameters can be used across different parts of a routed cluster. For example, the bi-directional routing example above can be used on an entire cluster (TCP clients, TCP-IB routers, and IB servers):

live_router_check_interval, dead_router_check_interval, auto_down, check_routers_before_use and router_ping_timeout

In a routed Lustre setup with nodes on different networks such as TCP/IP and Elan, the router checker checks the status of a router. The auto_down parameter enables/disables (1/0) the automatic marking of router state.

The live_router_check_interval parameter specifies a time interval in seconds after which the router checker will ping the live routers.

In the same way, you can set the dead_router_check_interval parameter for checking dead routers.

You can set the timeout for the router checker to check the live or dead routers by setting the router_ping_timeout parmeter. The Router pinger sends a ping message to a dead/live router once every dead/live_router_check_interval seconds, and if it does not get a reply message from the router within router_ping_timeout seconds, it considers the router to be down.

The last parameter is check_routers_before_use, which is off by default. If it is turned on, you must also give dead_router_check_interval a positive integer value.

The router checker gets the following variables for each router:

The initial time to disable a router should be one minute (enough to plug in a cable after removing it). If the router is administratively marked as "up", then the router checker clears the timeout. When a route is disabled (and possibly new), the "sent packets" counter is set to 0. When the route is first re-used (that is an elapsed disable time is found), the sent packets counter is incremented to 1, and incremented for all further uses of the route. If the route has been used for 100 packets successfully, then the sent-packets counter should be with a value of 100. Set the timeout to 0 (zero), so future errors no longer double the timeout.



Note - The router_ping_timeout is consistent with the default LND timeouts. You may have to increase it on very large clusters if the LND timeout is also increased. For larger clusters, we suggest increasing the check interval.


2.4.2.1 LNET Routers

All LNET routers that bridge two networks are equivalent. They are not configured as primary or secondary, and load is balanced across all available routers.

With the router checker configured, Lustre nodes can detect router health status, avoid those that appear dead, and reuse the ones that restore service after failures.

There are no hard requirements regarding the number of LNET routers, although there should enough to handle the required file serving bandwidth (and a 25% margin for headroom).

Comparing 32-bit and 64-bit LNET Routers

By default, at startup, LNET routers allocate 544M (i.e. 139264 4K pages) of memory as router buffers. The buffers can only come from low system memory (i.e. ZONE_DMA and ZONE_NORMAL).

On 32-bit systems, low system memory is, at most, 896M no matter how much RAM is installed. The size of the default router buffer puts big pressure on low memory zones, making it more likely that an out-of-memory (OOM) situation will occur. This is a known cause of router hangs. Lowering the value of the large_router_buffers parameter can circumvent this problem, but at the cost of penalizing router performance, by making large messages wait for longer for buffers.

On 64-bit architectures, the ZONE_HIGHMEM zone is always empty. Router buffers can come from all available memory and out-of-memory hangs do not occur. Therefore, we recommend using 64-bit routers.

2.4.3 Downed Routers

There are two mechanisms to update the health status of a peer or a router:

Several key differences in both mechanisms:


2.5 Starting and Stopping LNET

Lustre automatically starts and stops LNET, but it can also be manually started in a standalone manner. This is particularly useful to verify that your networking setup is working correctly before you attempt to start Lustre.

2.5.1 Starting LNET

To start LNET, run:

$ modprobe lnet
$ lctl network up

To see the list of local NIDs, run:

$ lctl list_nids

This command tells you the network(s) configured to work with Lustre

If the networks are not correctly setup, see the modules.conf "networks=" line and make sure the network layer modules are correctly installed and configured.

To get the best remote NID, run:

$ lctl which_nid <NID list>

where <NID list> is the list of available NIDs.

This command takes the "best" NID from a list of the NIDs of a remote host. The "best" NID is the one that the local node uses when trying to communicate with the remote node.

2.5.1.1 Starting Clients

To start a TCP client, run:

mount -t lustre mdsnode:/mdsA/client /mnt/lustre/

To start an Elan client, run:

mount -t lustre 2@elan0:/mdsA/client /mnt/lustre

2.5.2 Stopping LNET

Before the LNET modules can be removed, LNET references must be removed. In general, these references are removed automatically when Lustre is shut down, but for standalone routers, an explicit step is needed to stop LNET. Run:

lctl network unconfigure


Note - Attempting to remove Lustre modules prior to stopping the network may result in a crash or an LNET hang. if this occurs, the node must be rebooted (in most cases). Make sure that the Lustre network and Lustre are stopped prior to unloading the modules. Be extremely careful using rmmod -f.


To unconfigure the LNET network, run:

modprobe -r <any lnd and the lnet modules>


Tip - To remove all Lustre modules, run:

$ lctl modules | awk '{print $2}' | xargs rmmod