Lustre (file system)
|Stable release||2.3 / October 21, 2012|
|Type||Distributed file system|
|Website||Lustre Home Page (1.8.x) Lustre 2.x wiki|
Lustre is a parallel distributed file system, generally used for large-scale cluster computing. The name Lustre is a portmanteau word derived from Linux and cluster. Lustre file systems are available under the GNU GPL (v2 only) and provide a high performance file system for computer clusters ranging in size from small workgroup clusters to large-scale, multi-site clusters.
Because Lustre file systems have high performance capabilities and open licensing, it is often used in supercomputers. At the present time, six of the top 10 and more than 60 of the top 100 supercomputers in the world have Lustre file systems in them, including the world's fastest TOP500 supercomputer, Titan.
Lustre file systems are scalable and can support tens of thousands of client systems, tens of petabytes (PB) of storage, and more than a terabyte per second (TB/s) of aggregate I/O throughput. This makes Lustre file systems a popular choice for businesses with large data centers, including those in industries such as meteorology, simulation, oil and gas, life science, rich media, and finance.
The Lustre file system architecture was started as a research project in 1999 by Peter Braam, who was a Senior Systems Scientist at Carnegie Mellon University at the time. Braam went on to found his own company Cluster File Systems and developed Lustre under the ASCI Path Forward project, which released Lustre 1.0 in 2003. In 2007, Sun Microsystems acquired Cluster File Systems Inc. Sun included Lustre with its HPC hardware offerings, with the intent to bring the benefits of Lustre technologies to Sun's ZFS file system and the Solaris operating system. In November 2008, Braam left Sun Microsystems and the group's most senior leaders took control of the project.
In 2010 Oracle Corporation, by way of its 2010 acquisition of Sun, began to manage and release Lustre.
In December 2010, Oracle announced they would cease Lustre 2.x development and place Lustre 1.8 into maintenance-only support creating uncertainty around the future development of the file system. Following this announcement, several new organizations sprang up to provide support and development in an open community development model, including Whamcloud, Xyratex, Open Scalable File Systems (OpenSFS), EUROPEAN Open File Systems (EOFS) and others. In the same year, many Lustre alumni left Oracle to pursue working on Lustre elsewhere. Braam joined the solutions-oriented Xyratex, while others joined the software startup Whamcloud, where they continue to work on Lustre.
In 2011, OpenSFS awarded a substantial contract for Lustre feature development to Whamcloud. This contract covers the completion of several long-standing features, including improved Single Server Metadata Performance scaling, which allows Lustre to better take advantage of many-core metadata server; online Lustre distributed filesystem checking (LFSCK), which allows verification of the distributed filesystem state between data and metadata servers while the filesystem is mounted and in use; and Distributed Namespace, formerly Clustered Metadata (CMD), which allows the Lustre metadata to be distributed across multiple servers. Development also continued on ZFS-based back-end object storage at Lawrence Livermore National Laboratory. These features form the backbone of upcoming Lustre 2.2 to 2.4 community release roadmap. In late 2011, a separate contract was awarded to Whamcloud for the maintenance of the Lustre 2.x source code to ensure that the Lustre code would receive sufficient testing and bug fixing while new features were being developed.
In July 2012 Whamcloud was acquired by Intel in order to bolster its supercomputing infrastructure capabilities and ramp up for exascale storage development after Whamcloud won the FastForward DOE contract to extend Lustre for exascale computing systems in the 2018 timeframe.
In February 2013, Xyratex Ltd. announced it acquired the original Lustre trademark, logo, website and associated intellectual property from Oracle.
Release history 
Lustre 1.2.0, released in March 2004, provided Linux kernel 2.6 support, a "size glimpse" feature to avoid lock revocation on files undergoing write, and client side data write-back cache accounting (grant).
Lustre 1.6.0, released in April 2007, supported mount configuration (“mountconf”) allowing servers to be configured with "mkfs" and "mount", supported dynamic addition of object storage targets (OSTs), enabled Lustre distributed lock manager (LDLM) scalability on symmetric multiprocessing (SMP) servers, and supported free space management for object allocations.
Lustre 1.8.0, released in May 2009, provided OSS Read Cache, improves recovery in the face of multiple failures, adds basic heterogeneous storage management via OST Pools, adaptive network timeouts, and version-based recovery. It also serves as a transition release, being interoperable with both Lustre 1.6 and Lustre 2.0.
Lustre 2.0, released in August 2010, was based on significant internally restructured code to prepare for major architectural advancements. Lustre 2.x clients cannot interoperate with 1.8 or earlier servers. However, Lustre 1.8.6 and later clients can interoperate with Lustre 2.0 servers. The MDT and OST on-disk format from 1.8 can be upgraded to 2.0 without the need to reformat the filesystem.
Lustre 2.1, released in September 2011, was a community-wide initiative in response to Oracle suspending development on Lustre 2.x releases. It adds Red Hat Linux 6 server support and increases the maximum ext4-based OST size from 24 TB to 128 TB, as well as a number of performance and stability improvements. Lustre 2.1 servers remain interoperable with 1.8.6 and later clients, and is the new long-term maintenance release for Lustre.
Lustre 2.2, released in March 2012, focused on providing metadata performance improvements and new features. It adds parallel directory operations allowing multiple clients to traverse and modify a single large directory concurrently, faster recovery from server failures, increased stripe counts for a single file (across up to 2000 OSTs), and improved single-client directory traversal (ls -l, find, du) performance.
Lustre 2.3, released in October 2012, is the newest Lustre feature release. The server code was optimized to remove internal locking bottlenecks on nodes with many CPU cores (over 16). The OST added the preliminary support for using ZFS as the backing filesystem. The MDS LFSCK can verify and repair the Object Index (OI) file while the filesystem is in use. The server-side IO statistics were enhanced to allow integration with batch job schedulers such as SLURM to track per-job statistics. Client-side support was updated for Linux 3.0 kernels.
A Lustre file system has three major functional units:
- A single metadata server (MDS) that has a single metadata target (MDT) per Lustre filesystem that stores namespace metadata, such as filenames, directories, access permissions, and file layout. The MDT data is stored in a single local disk filesystem, which may be a bottleneck under some metadata intensive workloads. However, unlike block-based distributed filesystems, such as GPFS and PanFS, where the metadata server controls all of the block allocation, the Lustre metadata server is only involved in pathname and permission checks, and is not involved in any file I/O operations, avoiding I/O scalability bottlenecks on the metadata server.
- One or more object storage servers (OSSes) that store file data on one or more object storage targets (OSTs). Depending on the server’s hardware, an OSS typically serves between two and eight OSTs, with each OST managing a single local disk filesystem. The capacity of a Lustre file system is the sum of the capacities provided by the OSTs.
- Client(s) that access and use the data. Lustre presents all clients with a unified namespace for all of the files and data in the filesystem, using standard POSIX semantics, and allows concurrent and coherent read and write access to the files in the filesystem.
The MDT, OST, and client can be on the same node, but in typical installations these functions are on separate nodes communicating over a network. The Lustre Network (LNET) layer supports several network interconnects, including native Infiniband verbs, TCP/IP on Ethernet and other networks, Myrinet, Quadrics, and other proprietary network technologies. Lustre will take advantage of remote direct memory access (RDMA) transfers, when available, to improve throughput and reduce CPU usage.
The storage used for the MDT and OST backing filesystems is partitioned, optionally organized with logical volume management (LVM) and/or RAID, and normally formatted as ext4 file systems. The Lustre OSS and MDS servers read, write, and modify data in the format imposed by these file systems.
An OST is a dedicated filesystem that exports an interface to byte ranges of objects for read/write operations. An MDT is a dedicated filesystem that controls file access and tells clients which object(s) make up a file. MDTs and OSTs currently use an enhanced version of ext4 called ldiskfs to store data. Work started in 2008 at Sun to port Lustre to Sun's ZFS/DMU for back-end data storage and continues as an open source project.
When a client accesses a file, it completes a filename lookup on the MDS. As a result, a file is created on behalf of the client or the layout of an existing file is returned to the client. For read or write operations, the client then interprets the layout in the logical object volume (LOV) layer, which maps the offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSTs. With this approach, bottlenecks for client-to-OST communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the filesystem.
Clients do not directly modify the objects on the OST filesystems, but, instead, delegate this task to OSSes. This approach ensures scalability for large-scale clusters and supercomputers, as well as improved security and reliability. In contrast, shared block-based filesystems such as Global File System and OCFS must allow direct access to the underlying storage by all of the clients in the filesystem and increase the risk of filesystem corruption from misbehaving/defective clients.
In a typical Lustre installation on a Linux client, a Lustre filesystem driver module is loaded into the kernel and the filesystem is mounted like any other local or network filesystem. Client applications see a single, unified filesystem even though it may be composed of tens to thousands of individual servers and MDT/OST filesystems.
On some massively parallel processor (MPP) installations, computational processors can access a Lustre file system by redirecting their I/O requests to a dedicated I/O node configured as a Lustre client. This approach is used in the Blue Gene installation  at Lawrence Livermore National Laboratory.
Another approach used in the past is the liblustre library, which provided userspace applications with direct filesystem access. Liblustre was a user-level library that allows computational processors to mount and use the Lustre file system as a client. Using liblustre, the computational processors could access a Lustre file system even if the service node on which the job was launched is not a Lustre client. Liblustre allowed data movement directly between application space and the Lustre OSSs without requiring an intervening data copy through the kernel, thus providing low latency, high bandwidth access from computational processors to the Lustre file system directly.
Data objects and file striping 
In a traditional Unix disk file system, an inode data structure contains basic information about each file, such as where the data contained in the file is stored. The Lustre file system also uses inodes, but inodes on MDTs point to one or more OST objects associated with the file rather than to data blocks. These objects are implemented as files on the OSTs. When a client opens a file, the file open operation transfers a set of object pointers and their layout from the MDS to the client, so that the client can directly interact with the OSS node where the object is stored, allowing the client to perform I/O on the file without further communication with the MDS.
If only one OST object is associated with an MDT inode, that object contains all the data in the Lustre file. When more than one object is associated with a file, data in the file is “striped” across the objects similar to RAID 0. Striping a file over multiple objects provides significant performance benefits. When striping is used, the maximum file size is not limited by the size of a single target. Capacity and aggregate I/O bandwidth scale with the number of OSTs a file is striped over. Also, since the locking of each object is managed independently for each OST, adding more stripes (one per OST) scales the file I/O locking capability of the filesystem proportionately. Each file in the filesystem can have a different striping layout, so that performance and capacity can be tuned optimally for each file.
Lustre has a distributed lock manager in the OpenVMS style to protect the integrity of each file's data and metadata. Access and modification of a Lustre file is completely cache coherent among all of the clients. Metadata locks are managed by the MDT that stores the inode for the file, using the 128-bit Lustre File Identifier (FID, composed of the Sequence number and Object ID) as the resource name. The metadata locks are split into multiple bits that protect the lookup of the file (file owner and group, permission and mode, and access control list (ACL)), the state of the inode (directory size, directory contents, link count, timestamps), and layout (file striping). A client can fetch multiple metadata lock bits for a single inode with a single RPC request, but currently they are only ever granted a read lock for the inode. The MDS manages all modifications to the inode in order to avoid lock resource contention and is currently the only node that gets write locks on inodes.
File data locks are managed by the OST on which each object of the file is striped, using byte-range extent locks. Clients can be granted both overlapping read extent locks for part or all of the file, allowing multiple concurrent readers of the same file, and/or non-overlapping write extent locks for regions of the file. This allows many Lustre clients to access a single file concurrently for both read and write, avoiding bottlenecks during file I/O. In practice, because Linux clients manage their data cache in units of pages, the clients will request locks that are always an integer multiple of the page size (4096 bytes on most clients). When a client is requesting an extent lock the OST may grant a lock for a larger extent than requested, in order to reduce the number of lock requests that the client makes. The actual size of the granted lock depends on several factors, including the number of currently-granted locks, whether there are conflicting write locks, and the number of outstanding lock requests. The granted lock is never smaller than the originally-requested extent. OST extent locks use the Lustre FID as the resource name for the lock. Since the number of extent lock servers scales with the number of OSTs in the filesystem, this also scales the aggregate locking performance of the filesystem, and of a single file if it is striped over multiple OSTs.
In a cluster with a Lustre file system, the system network connecting the servers and the clients is implemented using Lustre Networking (LNET), which provides the communication infrastructure required by the Lustre file system. Disk storage is connected to the Lustre file system MDSs and OSSs using traditional storage area network (SAN) technologies.
LNET supports many commonly-used network types, such as InfiniBand and IP networks, and allows simultaneous availability across multiple network types with routing between them. Remote Direct Memory Access (RDMA) is permitted when supported by underlying networks such as Quadrics Elan, Myrinet, and InfiniBand. High availability and recovery features enable transparent recovery in conjunction with failover servers.
LNET provides end-to-end throughput over Gigabit Ethernet (GigE) networks in excess of 100 MB/s, throughput up to 3 GB/s using InfiniBand quad data rate (QDR) links, and throughput over 1 GB/s across 10GigE interfaces.
High availability 
Lustre file system high availability features include a robust failover and recovery mechanism, making server failures and reboots transparent. Version interoperability between successive minor versions of the Lustre software enables a server to be upgraded by taking it offline (or failing it over to a standby server), performing the upgrade, and restarting it, while all active jobs continue to run, merely experiencing a delay while the backup server takes over the storage.
Lustre MDSes are configured as an active/passive pair, while OSSes are typically deployed in an active/active configuration that provides redundancy without extra overhead. Often the standby MDS is the active MDS for another Lustre file system, so no nodes are idle in the cluster.
Lustre is used by many of the TOP500 supercomputers and multi-site clusters. Six of the top 10 and more than 60 of the top 100 supercomputers in the world use Lustre file systems. Some of these deployments include: K computer at the RIKEN Advanced Institute for Computational Science, the Tianhe-1A at the National Supercomputing Center in Tianjin, China, the Jaguar supercomputer at Oak Ridge National Laboratory (ORNL) and systems at the National Energy Research Scientific Computing Center located at Lawrence Berkeley National Laboratory (LBNL), Blue Waters at the University of Illinois, Lawrence Livermore National Laboratory (LLNL), Pacific Northwest National Laboratory, Texas Advanced Computing Center and NASA in North America, in Asia at Tokyo Institute of Technology, and CEA.
Commercial support 
Commercial support for Lustre is available from a wide array of vendors. In most cases, this support is bundled along with the computing system and/or storage hardware sold by the vendor. A non-exhaustive list of vendors selling bundled computing and Lustre storage systems include Cray, Dell, Hewlett-Packard, BULL, SGI, and others. Oracle no longer sells either systems or storage that include Lustre. Major vendors selling storage hardware with bundled Lustre support include Data Direct Networks (DDN), Dell, NetApp, Terascala, Xyratex, and many others.
See also 
- Distributed file system
- List of file systems, the distributed parallel fault-tolerant file system section
- "Lustre Home". Archived from the original on 2000-08-23.
- "Spider Center-Wide File System". Oak Ridge Leadership Computing Facility. Retrieved 2012-02-02.
- "Rock-Hard Lustre: Trends in Scalability and Quality". Nathan Rutman, Xyratex. Retrieved 2012-02-02.
- Lustre File System presentation. Google Video.
- "Lustre, The Inter-Galactic File System". Lawrence Livermore National Laboratory. 2002-08-08.
- "Sun Assimilates Lustre Filesystem". Linux Magazine. 2007-09-13.
- "Oracle has Kicked Lustre to the Curb". Inside HPC. 2011-01-10.
- "Whamcloud aims to make sure Lustre has a future in HPC". Inside HPC. 2010-08-20.
- "Xyratex Acquires ClusterStor, Lustre File System Expertise/". HPCwire. 2010-11-09.
- "Whamcloud Staffs up for Brighter Lustre". InsideHPC.
- "Whamcloud Signs Multi-Year Lustre Development Contract With OpenSFS". InsideHPC. 2011-08-16.
- "ZFS on Linux for Lustre". Lawrence Livermore National Laboratory. 2011-04-13.
- "OpenSFS Update". Open Scalable File Systems. 2011-11-15.
- "OpenSFS and Whamcloud Sign Lustre Community Tree Development Agreement". Reuters. 2011-11-15.
- Joab Jackson (2012-07-16). "Intel Purchases Lustre Purveyor Whamcloud". PC World.
- Timothy Prickett Morgan (2012-07-16). "Intel gobbles Lustre file system expert Whamcloud". The Register.
- Timothy Prickett Morgan (2012-07-11). "DOE doles out cash to AMD, Whamcloud for exascale research". The Register.
- "Lustre Helps Power Third Fastest Supercomputer". DSStar.
- "MCR Linux Cluster Xeon 2.4 GHz - Quadrics". Top500.Org.
- "Lustre Roadmap and Future Plans". Sun Microsystems. Retrieved 2008-08-21.
- Whamcloud "OpenSFS Announces Collaborative Effort to Support Lustre 2.1 Community Distribution". Open Scalable File Systems. Retrieved 2012-02-02.
- "Lustre 2.1 Released". Retrieved 2012-02-02.
- "Lustre 2.2 Released". Yahoo! Finance. Retrieved 2012-05-08.
- "Lustre to run on ZFS". Government Computer News. 2008-10-26.
- "ZFS on Lustre". 2011-05-10.
- "DataDirect Selected As Storage Tech Powering BlueGene/L". HPC Wire, October 15, 2004: Vol. 13, No. 41.
- Lafoucrière, Jacques-Charles. "Lustre Experience at CEA/DIF". HEPiX Forum, April 2007.
- "LUG11 ZFS on Linux for Lustre". Text "http://zfsonlinux.org/docs/LUG11_ZFS_on_Linux_for_Lustre.pdf " ignored (help);
- "Pleiades Supercomputer". www.nas.nasa.gov. 2008-08-18.
- "TOP500 List - November 2006". TOP500.Org.
- "TOP500 List - June 2006". TOP500.Org.
- "French Atomic Energy Group Expands HPC File System to 11 Petabytes". HPCwire.com. 2012-06-15.
Lustre Information Wikis 
Lustre Community Foundations 
Lustre Hardware/Software Vendors 
- Cray Storage Products
- NetApp High Performance Computing Solution for Lustre
- SGI Professional Services
- Terascala Products
- Xyratex Lustre