Sun Oracle Logo


Lustre 1.8 Operations Manual

821-0035-11



Contents

Preface

Part I Lustre Architecture

1. Introduction to Lustre

1.1 Introducing the Lustre File System

1.1.1 Lustre Key Features

1.2 Lustre Components

1.2.1 Lustre Networking (LNET)

1.2.2 Management Server (MGS)

1.3 Lustre Systems

1.4 Files in the Lustre File System

1.4.1 Lustre File System and Striping

1.4.2 Lustre Storage

1.4.2.1 OSS Storage

1.4.2.2 MDS Storage

1.4.3 Lustre System Capacity

1.5 Lustre Configurations

1.6 Lustre Networking

1.7 Lustre Failover and Rolling Upgrades

2. Understanding Lustre Networking

2.1 Introduction to LNET

2.2 Supported Network Types

2.3 Designing Your Lustre Network

2.3.1 Identify All Lustre Networks

2.3.2 Identify Nodes to Route Between Networks

2.3.3 Identify Network Interfaces to Include/Exclude from LNET

2.3.4 Determine Cluster-wide Module Configuration

2.3.5 Determine Appropriate Mount Parameters for Clients

2.4 Configuring LNET

2.4.1 Module Parameters

2.4.1.1 Using Usocklnd

2.4.1.2 OFED InfiniBand Options

2.4.2 Module Parameters - Routing

2.4.2.1 LNET Routers

2.4.3 Downed Routers

2.5 Starting and Stopping LNET

2.5.1 Starting LNET

2.5.1.1 Starting Clients

2.5.2 Stopping LNET

Part II Lustre Administration

3. Installing Lustre

3.1 Preparing to Install Lustre

3.1.1 Supported Operating System, Platform and Interconnect

3.1.2 Required Lustre Software

3.1.3 Required Tools and Utilities

3.1.4 (Optional) High-Availability Software

3.1.5 Debugging Tools

3.1.6 Environmental Requirements

3.1.7 Memory Requirements

3.1.7.1 MDS Memory Requirements

3.1.7.2 OSS Memory Requirements

3.2 Installing Lustre from RPMs

3.3 Installing Lustre from Source Code

3.3.1 Patching the Kernel

3.3.1.1 Introducing the Quilt Utility

3.3.1.2 Get the Lustre Source and Unpatched Kernel

3.3.1.3 Patch the Kernel

3.3.2 Create and Install the Lustre Packages

3.3.3 Installing Lustre with a Third-Party Network Stack

4. Configuring Lustre

4.1 Configuring the Lustre File System

4.1.0.1 Simple Lustre Configuration Example

4.1.0.2 Module Setup

4.1.1 Scaling the Lustre File System

4.2 Additional Lustre Configuration

4.3 Basic Lustre Administration

4.3.1 Specifying the File System Name

4.3.2 Starting up Lustre

4.3.3 Mounting a Server

4.3.4 Unmounting a Server

4.3.5 Working with Inactive OSTs

4.3.6 Finding Nodes in the Lustre File System

4.3.7 Mounting a Server Without Lustre Service

4.3.8 Specifying Failout/Failover Mode for OSTs

4.3.9 Running Multiple Lustre File Systems

4.3.10 Setting and Retrieving Lustre Parameters

4.3.10.1 Setting Parameters with mkfs.lustre

4.3.10.2 Setting Parameters with tunefs.lustre

4.3.10.3 Setting Parameters with lctl

4.3.10.4 Reporting Current Parameter Values

4.3.11 Regenerating the Lustre Configuration Logs

4.3.12 Changing a Server NID

4.3.13 Removing and Restoring OSTs

4.3.13.1 Removing an OST from the File System

4.3.13.2 Restoring an OST in the File System

4.3.14 Aborting Recovery

4.3.15 Determining Which Machine is Serving an OST

4.4 More Complex Configurations

4.4.1 Failover

4.5 Operational Scenarios

4.5.1 Unmounting a Server (without Failover)

4.5.2 Unmounting a Server (with Failover)

4.5.3 Changing the Address of a Failover Node

5. Service Tags

5.1 Introduction to Service Tags

5.2 Using Service Tags

5.2.1 Installing Service Tags

5.2.2 Discovering and Registering Lustre Components

5.2.3 Information Registered with Sun

6. Configuring Lustre - Examples

6.1 Simple TCP Network

6.1.1 Lustre with Combined MGS/MDT

6.1.1.1 Installation Summary

6.1.1.2 Configuration Generation and Application

6.1.2 Lustre with Separate MGS and MDT

6.1.2.1 Installation Summary

6.1.2.2 Configuration Generation and Application

6.1.2.3 Configuring Lustre with a CSV File

7. More Complicated Configurations

7.1 Multihomed Servers

7.1.1 Modprobe.conf

7.1.2 Start Servers

7.1.3 Start Clients

7.2 Elan to TCP Routing

7.2.1 Modprobe.conf

7.2.2 Start servers

7.2.3 Start clients

7.3 Load Balancing with InfiniBand

7.3.1 Setting Up modprobe.conf for Load Balancing

7.4 Multi-Rail Configurations with LNET

8. Failover

8.1 What is Failover?

8.1.1 Failover Capabilities

8.1.2 Types of Failover Configurations

8.2 Failover Functionality in Lustre

8.2.1 MDT Failover Configuration (Active/Passive)

8.2.2 OST Failover Configuration (Active/Active)

8.2.3 Lustre Failover and MMP

8.2.3.1 Working with MMP

8.3 Configuring and Using Heartbeat with Lustre Failover

8.3.1 Creating a Failover Environment

8.3.1.1 Power Management Software

8.3.1.2 Power Equipment

8.3.2 Setting up the Heartbeat Software

8.3.2.1 Installing Heartbeat

8.3.2.2 Configuring Heartbeat

8.3.2.3 (Optional) Migrating a Heartbeat Configuration (v1 to v2)

8.3.3 Working with Heartbeat

8.3.3.1 Starting Heartbeat

8.3.3.2 Switching Resources Between Nodes

9. Configuring Quotas

9.1 Working with Quotas

9.1.1 Enabling Disk Quotas

9.1.1.1 Administrative and Operational Quotas

9.1.2 Creating Quota Files and Quota Administration

9.1.3 Quota Allocation

9.1.4 Known Issues with Quotas

9.1.4.1 Granted Cache and Quota Limits

9.1.4.2 Quota Limits

9.1.4.3 Quota File Formats

9.1.5 Lustre Quota Statistics

9.1.5.1 Interpreting Quota Statistics

10. RAID

10.1 Considerations for Backend Storage

10.1.1 Selecting Storage for the MDS or OSTs

10.1.2 Reliability Best Practices

10.1.3 Understanding Double Failures with Hardware and Software RAID5

10.1.4 Performance Tradeoffs

10.1.5 Formatting Options for RAID Devices

10.1.5.1 Creating an External Journal

10.1.6 Handling Degraded RAID Arrays

10.2 Insights into Disk Performance Measurement

10.3 Lustre Software RAID Support

10.3.0.1 Enabling Software RAID on Lustre

11. Kerberos

11.1 What is Kerberos?

11.2 Lustre Setup with Kerberos

11.2.1 Configuring Kerberos for Lustre

11.2.1.1 Kerberos Distributions Supported on Lustre

11.2.1.2 Preparing to Set Up Lustre with Kerberos

11.2.1.3 Configuring Lustre for Kerberos

11.2.1.4 Configuring Kerberos

11.2.1.5 Setting the Environment

11.2.1.6 Building Lustre

11.2.1.7 Running GSS Daemons

11.2.2 Types of Lustre-Kerberos Flavors

11.2.2.1 Basic Flavors

11.2.2.2 Security Flavor

11.2.2.3 Customized Flavor

11.2.2.4 Specifying Security Flavors

11.2.2.5 Mounting Clients

11.2.2.6 Rules, Syntax and Examples

11.2.2.7 Authenticating Normal Users

12. Bonding

12.1 Network Bonding

12.2 Requirements

12.3 Using Lustre with Multiple NICs versus Bonding NICs

12.4 Bonding Module Parameters

12.5 Setting Up Bonding

12.5.1 Examples

12.6 Configuring Lustre with Bonding

12.6.1 Bonding References

13. Upgrading and Downgrading Lustre

13.1 Supported Upgrades

13.2 Lustre Interoperability

13.3 Upgrading Lustre 1.6.x to 1.8.x

13.3.1 Performing a Complete File System Upgrade

13.3.2 Performing a Rolling Upgrade

13.4 Upgrading Lustre 1.8.x to the Next Minor Version

13.5 Downgrading from Lustre 1.8.x to 1.6.x

13.5.1 Performing a Complete File System Downgrade

13.5.2 Performing a Rolling Downgrade

14. Lustre SNMP Module

14.1 Installing the Lustre SNMP Module

14.2 Building the Lustre SNMP Module

14.3 Using the Lustre SNMP Module

15. Backup and Restore

15.1 Backing up a File System

15.2 Backing up a Device (MDS or OST)

15.2.1 Backing Up the MDS

15.2.2 Backing Up an OST

15.3 Backing up Files

15.3.1 Backing up Extended Attributes

15.4 Restoring from a File-level Backup

15.5 Using LVM Snapshots with Lustre

15.5.1 Creating an LVM-based Backup File System

15.5.2 Backing up New/Changed Files to the Backup File System

15.5.3 Creating Snapshot Volumes

15.5.4 Restoring the File System From a Snapshot

15.5.5 Deleting Old Snapshots

15.5.6 Changing Snapshot Volume Size

16. POSIX

16.1 Introduction to POSIX

16.2 Installing POSIX

16.2.1 POSIX Installation Using a Quick Start Version

16.3 Building and Running a POSIX Compliance Test Suite on Lustre

16.3.1 Building the Test Suite from Scratch

16.3.2 Running the Test Suite Against Lustre

16.4 Isolating and Debugging Failures

17. Benchmarking

17.1 Bonnie++ Benchmark

17.2 IOR Benchmark

17.3 IOzone Benchmark

18. Lustre I/O Kit

18.1 Lustre I/O Kit Description and Prerequisites

18.1.1 Downloading an I/O Kit

18.1.2 Prerequisites to Using an I/O Kit

18.2 Running I/O Kit Tests

18.2.1 sgpdd_survey

18.2.2 obdfilter_survey

18.2.2.1 Running obdfilter_survey Against a Local Disk

18.2.2.2 Running obdfilter_survey Against a Network

18.2.2.3 Running obdfilter_survey Against a Network Disk

18.2.2.4 Output Files

18.2.2.5 Script Output

18.2.2.6 Visualizing Results

18.2.3 ost_survey

18.2.4 stats-collect

18.3 PIOS Test Tool

18.3.1 Synopsis

18.3.2 PIOS I/O Modes

18.3.3 PIOS Parameters

18.3.4 PIOS Examples

18.4 LNET Self-Test

18.4.1 Basic Concepts of LNET Self-Test

18.4.1.1 Modules

18.4.1.2 Utilities

18.4.1.3 Session

18.4.1.4 Console

18.4.1.5 Group

18.4.1.6 Test

18.4.1.7 Batch

18.4.1.8 Sample Script

18.4.2 LNET Self-Test Commands

18.4.2.1 Session

18.4.2.2 Group

18.4.2.3 Batch and Test

18.4.2.4 Other Commands

19. Lustre Recovery

19.1 Recovery Overview

19.1.1 Client Failure

19.1.2 Client Eviction

19.1.3 MDS Failure (Failover)

19.1.4 OST Failure (Failover)

19.1.5 Network Partition

19.1.6 Failed Recovery

19.2 Metadata Replay

19.2.1 XID Numbers

19.2.2 Transaction Numbers

19.2.3 Replay and Resend

19.2.4 Client Replay List

19.2.5 Server Recovery

19.2.6 Request Replay

19.2.7 Gaps in the Replay Sequence

19.2.8 Lock Recovery

19.2.9 Request Resend

19.3 Reply Reconstruction

19.3.1 Required State

19.3.2 Reconstruction of Open Replies

19.4 Version-based Recovery

19.4.1 Delayed Recovery

19.4.2 Working with VBR

19.4.3 Tips for Using VBR

19.5 Recovering from Errors or Corruption on a Backing File System

19.6 Recovering from Corruption in the Lustre File System

19.6.1 Working with Orphaned Objects

Part III Lustre Tuning, Monitoring and Troubleshooting

20. Lustre Tuning

20.1 Module Options

20.1.1 OSS Service Thread Count

20.1.1.1 Optimizing the Number of Service Threads

20.1.2 MDS Service Thread Count

20.1.2.1 I/O Scheduler

20.2 LNET Tunables

20.2.0.1 Transmit and receive buffer size:

20.2.0.2 irq_affinity

20.3 Options for Formatting the MDT and OSTs

20.3.1 Planning for Inodes

20.3.2 Sizing the MDT

20.4 Overriding Default Formatting Options

20.4.1 Number of Inodes for the MDT

20.4.2 Inode Size for the MDT

20.4.3 Number of Inodes for an OST

20.5 Large-Scale Tuning for Cray XT and Equivalents

20.5.1 Network Tunables

20.6 Lockless I/O Tunables

20.7 Data Checksums

21. LustreProc

21.1 Proc Entries for Lustre

21.1.1 Locating Lustre File Systems and Servers

21.1.2 Lustre Timeouts

21.1.3 Adaptive Timeouts

21.1.3.1 Configuring Adaptive Timeouts

21.1.3.2 Interpreting Adaptive Timeouts Information

21.1.4 LNET Information

21.1.5 Free Space Distribution

21.1.5.1 Managing Stripe Allocation

21.2 Lustre I/O Tunables

21.2.1 Client I/O RPC Stream Tunables

21.2.2 Watching the Client RPC Stream

21.2.3 Client Read-Write Offset Survey

21.2.4 Client Read-Write Extents Survey

21.2.5 Watching the OST Block I/O Stream

21.2.6 Using File Readahead and Directory Statahead

21.2.6.1 Tuning File Readahead

21.2.6.2 Tuning Directory Statahead

21.2.7 OSS Read Cache

21.2.7.1 Using OSS Read Cache

21.2.8 mballoc History

21.2.9 mballoc3 Tunables

21.2.10 Locking

21.2.11 Setting MDS and OSS Thread Counts

21.3 Debug Support

21.3.1 RPC Information for Other OBD Devices

21.3.1.1 Interpreting OST Statistics

21.3.1.2 llobdstat

21.3.1.3 Interpreting MDT Statistics

22. Lustre Monitoring and Troubleshooting

22.1 Monitoring Lustre

22.2 Troubleshooting Lustre

22.2.1 Error Numbers

22.2.2 Error Messages

22.2.3 Lustre Logs

22.3 Reporting a Lustre Bug

22.4 Common Lustre Problems and Performance Tips

22.4.1 Recovering from an Unavailable OST

22.4.2 Write Performance Better Than Read Performance

22.4.3 OST Object is Missing or Damaged

22.4.4 OSTs Become Read-Only

22.4.5 Identifying a Missing OST

22.4.6 Improving Lustre Performance When Working with Small Files

22.4.7 Default Striping

22.4.8 Erasing a File System

22.4.9 Reclaiming Reserved Disk Space

22.4.10 Considerations in Connecting a SAN with Lustre

22.4.11 Handling/Debugging "Bind: Address already in use" Error

22.4.12 Replacing An Existing OST or MDS

22.4.13 Handling/Debugging Error "- 28"

22.4.14 Triggering Watchdog for PID NNN

22.4.15 Handling Timeouts on Initial Lustre Setup

22.4.16 Handling/Debugging "LustreError: xxx went back in time"

22.4.17 Lustre Error: "Slow Start_Page_Write"

22.4.18 Drawbacks in Doing Multi-client O_APPEND Writes

22.4.19 Slowdown Occurs During Lustre Startup

22.4.20 Log Message ‘Out of Memory’ on OST

22.4.21 Number of OSTs Needed for Sustained Throughput

22.4.22 Setting SCSI I/O Sizes

22.4.23 Identifying Which Lustre File an OST Object Belongs To

23. Lustre Debugging

23.1 Lustre Debug Messages

23.1.1 Format of Lustre Debug Messages

23.2 Tools for Lustre Debugging

23.2.1 Debug Daemon Option to lctl

23.2.1.1 lctl Debug Daemon Commands

23.2.2 Controlling the Kernel Debug Log

23.2.3 The lctl Tool

23.2.4 Finding Memory Leaks

23.2.5 Printing to /var/log/messages

23.2.6 Tracing Lock Traffic

23.2.7 Sample lctl Run

23.2.8 Adding Debugging to the Lustre Source Code

23.3 Troubleshooting with strace

23.4 Looking at Disk Content

23.4.1 Determine the Lustre UUID of an OST

23.4.2 Tcpdump

23.5 Ptlrpc Request History

23.6 Using LWT Tracing

Part IV Lustre for Users

24. Striping and I/O Options

24.1 File Striping

24.1.1 Advantages of Striping

24.1.1.1 Bandwidth

24.1.2 Disadvantages of Striping

24.1.2.1 Increased Overhead

24.1.2.2 Increased Risk

24.1.3 Stripe Size

24.2 Displaying Files and Directories with lfs getstripe

24.3 lfs setstripe - Setting File Layouts

24.3.1 Changing Striping for a Subdirectory

24.3.2 Using a Specific Striping Pattern/File Layout for a Single File

24.3.3 Creating a File on a Specific OST

24.4 Managing Free Space

24.4.1 Checking File System Free Space

24.4.2 Using Stripe Allocations

24.4.3 Round-Robin Allocator

24.4.4 Weighted Allocator

24.4.5 Adjusting the Weighting Between Free Space and Location

24.5 Handing Full OSTs

24.5.1 Checking File System Usage

24.5.2 Taking a Full OST Offline

24.5.3 Migrating Data within a File System

24.6 Creating and Managing OST Pools

24.6.1 Working with OST Pools

24.6.1.1 Using the lfs Command with OST Pools

24.6.2 Tips for Using OST Pools

24.7 Performing Direct I/O

24.7.1 Making File System Objects Immutable

24.8 Other I/O Options

24.8.1 Lustre Checksums

24.8.1.1 Changing Checksum Algorithms

24.9 Striping Using llapi

25. Lustre Security

25.1 Using ACLs

25.1.1 How ACLs Work

25.1.2 Using ACLs with Lustre

25.1.3 Examples

25.2 Using Root Squash

25.2.1 Configuring Root Squash

25.2.2 Enabling and Tuning Root Squash

25.2.3 Syntax Error Handling

26. Lustre Operating Tips

26.1 Adding an OST to a Lustre File System

26.2 A Simple Data Migration Script

26.3 Adding Multiple SCSI LUNs on Single HBA

26.4 Failures Running a Client and OST on the Same Machine

26.5 Improving Lustre Metadata Performance While Using Large Directories

Part V Reference

27. User Utilities (man1)

27.1 lfs

27.2 lfsck

27.3 Filefrag

27.4 Mount

27.5 Handling Timeouts

28. Lustre Programming Interfaces (man2)

28.1 User/Group Cache Upcall

28.1.1 Name

28.1.2 Description

28.1.2.1 Primary and Secondary Groups

28.1.3 Parameters

28.1.4 Data structures

29. Setting Lustre Properties (man3)

29.1 Using llapi

29.1.1 llapi_file_create

29.1.2 llapi_file_get_stripe

29.1.3 llapi_file_open

29.1.4 llapi_quotactl

29.1.5 llapi_path2fid

30. Configuration Files and Module Parameters (man5)

30.1 Introduction

30.2 Module Options

30.2.1 LNET Options

30.2.1.1 Network Topology

30.2.1.2 networks ("tcp")

30.2.1.3 routes (“”)

30.2.1.4 forwarding ("")

30.2.2 SOCKLND Kernel TCP/IP LND

30.2.3 QSW LND

30.2.4 RapidArray LND

30.2.5 VIB LND

30.2.6 OpenIB LND

30.2.7 Portals LND (Linux)

30.2.8 Portals LND (Catamount)

30.2.9 MX LND

31. System Configuration Utilities (man8)

31.1 mkfs.lustre

31.2 tunefs.lustre

31.3 lctl

31.4 mount.lustre

31.5 Additional System Configuration Utilities

31.5.1 lustre_rmmod.sh

31.5.2 e2scan

31.5.3 Utilities to Manage Large Clusters

31.5.4 Application Profiling Utilities

31.5.5 More /proc Statistics for Application Profiling

31.5.6 Testing / Debugging Utilities

31.5.7 Flock Feature

31.5.7.1 Example

31.5.8 l_getgroups

31.5.9 llobdstat

31.5.10 llstat

31.5.11 lst

31.5.12 plot-llstat

31.5.13 routerstat

31.5.14 ll_recover_lost_found_objs

32. System Limits

32.1 Maximum Stripe Count

32.2 Maximum Stripe Size

32.3 Minimum Stripe Size

32.4 Maximum Number of OSTs and MDTs

32.5 Maximum Number of Clients

32.6 Maximum Size of a File System

32.7 Maximum File Size

32.8 Maximum Number of Files or Subdirectories in a Single Directory

32.9 MDS Space Consumption

32.10 Maximum Length of a Filename and Pathname

32.11 Maximum Number of Open Files for Lustre File Systems

32.12 OSS RAM Size

Glossary

Index