Lustre (file system)
|Stable release||2.4.1 (maintenance), 2.5.0 (feature) / October 23, 2013|
|Type||Distributed file system|
|Website||Lustre Community Portal (2.1 and newer)
lustre.org (1.8.7 and older)
Lustre (Intel) wiki
Lustre (OpenSFS) wiki
Lustre is a type of parallel distributed file system, generally used for large-scale cluster computing. The name Lustre is a portmanteau word derived from Linux and cluster. Lustre file system software is available under the GNU General Public License (version 2 only) and provides high performance file systems for computer clusters ranging in size from small workgroup clusters to large-scale, multi-site clusters.
Because Lustre file systems have high performance capabilities and open licensing, it is often used in supercomputers. At one time, six of the top 10 and more than 60 of the top 100 supercomputers in the world have Lustre file systems in them, including the world's #2 ranked TOP500 supercomputer, Titan in 2013.
Lustre file systems are scalable and can be part of multiple computer clusters with tens of thousands of client nodes, tens of petabytes (PB) of storage on hundreds of servers, and more than a terabyte per second (TB/s) of aggregate I/O throughput. This makes Lustre file systems a popular choice for businesses with large data centers, including those in industries such as meteorology, simulation, oil and gas, life science, rich media, and finance.
- 1 History
- 2 Release history
- 3 Architecture
- 4 Implementation
- 5 Data objects and file striping
- 6 Locking
- 7 Networking
- 8 High availability
- 9 Deployments
- 10 Commercial technical support
- 11 See also
- 12 References
- 13 External links
The Lustre file system architecture was started as a research project in 1999 by Peter Braam, who was on the staff of Carnegie Mellon University (CMU) at the time. Braam went on to found his own company Cluster File Systems in 2001, starting from work on the InterMezzo file system in the Coda project at CMU. Lustre was developed under the Accelerated Strategic Computing Initiative Path Forward project funded by the United States Department of Energy, which included Hewlett-Packard and Intel. In September 2007, Sun Microsystems acquired the assets of Cluster File Systems Inc. including its intellectual property. Sun included Lustre with its high-performance computing hardware offerings, with the intent to bring Lustre technologies to Sun's ZFS file system and the Solaris operating system. In November 2008, Braam left Sun Microsystems, and Eric Barton and Andreas Dilger took control of the project. In 2010 Oracle Corporation, by way of its acquisition of Sun, began to manage and release Lustre.
In December 2010, Oracle announced they would cease Lustre 2.x development and place Lustre 1.8 into maintenance-only support creating uncertainty around the future development of the file system. Following this announcement, several new organizations sprang up to provide support and development in an open community development model, including Whamcloud, Open Scalable File Systems, Inc. (OpenSFS), EUROPEAN Open File Systems (EOFS) and others. By the end of 2010, most Lustre developers had left Oracle. Braam and several associates joined the hardware-oriented Xyratex when it acquired the assets of ClusterStor, while Barton, Dilger, and others formed software startup Whamcloud, where they continued to work on Lustre.
In August 2011, OpenSFS awarded a contract for Lustre feature development to Whamcloud. This contract covered the completion of features, including improved Single Server Metadata Performance scaling, which allows Lustre to better take advantage of many-core metadata server; online Lustre distributed filesystem checking (LFSCK), which allows verification of the distributed filesystem state between data and metadata servers while the filesystem is mounted and in use; and Distributed Namespace (DNE), formerly Clustered Metadata (CMD), which allows the Lustre metadata to be distributed across multiple servers. Development also continued on ZFS-based back-end object storage at Lawrence Livermore National Laboratory. These features were in the Lustre 2.2 through 2.4 community release roadmap. In November 2011, a separate contract was awarded to Whamcloud for the maintenance of the Lustre 2.x source code to ensure that the Lustre code would receive sufficient testing and bug fixing while new features were being developed.
In July 2012 Whamcloud was acquired by Intel, after Whamcloud won the FastForward DOE contract to extend Lustre for exascale computing systems in the 2018 timeframe.OpenSFS then transitioned contracts for Lustre development to Intel.
In February 2013, Xyratex Ltd., announced it acquired the original Lustre trademark, logo, website and associated intellectual property from Oracle. In June 2013, Intel began positioning Lustre for commercial uses, such as within Hadoop. For 2013 as a whole, OpenSFS announced request for proposals (RFP) to cover Lustre feature development, parallel file system tools, addressing Lustre technical debt, and parallel file system incubators. OpenSFS also established the Lustre Community Portal, a technical site that provides a collection of information and documentation in one area for reference and guidance to support the Lustre open source community.
Lustre 1.2.0, released in March 2004, worked on Linux kernel 2.6, and had a "size glimpse" feature to avoid lock revocation on files undergoing write, and client side data write-back cache accounting (grant).
Lustre 1.6.0, released in April 2007, allowed mount configuration (“mountconf”) allowing servers to be configured with "mkfs" and "mount", allowed dynamic addition of object storage targets (OSTs), enabled Lustre distributed lock manager (LDLM) scalability on symmetric multiprocessing (SMP) servers, and provided free space management for object allocations.
Lustre 1.8.0, released in May 2009, provided OSS Read Cache, improved recovery in the face of multiple failures, added basic heterogeneous storage management via OST Pools, adaptive network timeouts, and version-based recovery. It was a transition release, being interoperable with both Lustre 1.6 and Lustre 2.0.
Lustre 2.0, released in August 2010, was based on significant internally restructured code to prepare for major architectural advancements. Lustre 2.x clients cannot interoperate with 1.8 or earlier servers. However, Lustre 1.8.6 and later clients can interoperate with Lustre 2.0 and later servers. The MDT and OST on-disk format from 1.8 can be upgraded to 2.0 and later without the need to reformat the filesystem.
Lustre 2.1, released in September 2011, was a community-wide initiative in response to Oracle suspending development on Lustre 2.x releases. It adds the ability to run servers on Red Hat Linux 6 and increases the maximum ext4-based OST size from 24 TB to 128 TB, as well as a number of performance and stability improvements. Lustre 2.1 servers remain interoperable with 1.8.6 and later clients, and is the new long-term maintenance release for Lustre.
Lustre 2.2, released in March 2012, focused on providing metadata performance improvements and new features. It adds parallel directory operations allowing multiple clients to traverse and modify a single large directory concurrently, faster recovery from server failures, increased stripe counts for a single file (across up to 2000 OSTs), and improved single-client directory traversal (ls -l, find, du) performance.
Lustre 2.3, released in October 2012, continued to optimize the metadata server code to remove internal locking bottlenecks on nodes with many CPU cores (over 16). The object store added a preliminary ability to use ZFS as the backing file system. The MDS LFSCK feature can verify and repair the Object Index (OI) file while the file system is in use, after a file-level backup/restore or corruption. The server-side IO statistics were enhanced to allow integration with batch job schedulers such as SLURM to track per-job statistics. Client-side software was updated to work with Linux kernels up to version 3.0.
Lustre 2.4, released in May 2013, added a considerable number of major features, many funded directly through OpenSFS. Distributed Namespace (DNE) allows horizontal metadata capacity and performance scaling for 2.4 clients, by allowing subdirectory trees of a single namespace to be located on separate MDTs. ZFS can now be used as the backing filesystem for both MDT and OST storage. The LFSCK feature allows scanning and verifying the internal consistency of the MDT FID and LinkEA attributes. The Network Request Scheduler (NRS) adds policies to optimize client request processing for disk ordering or fairness. Clients can optionally send up to bulk RPCs up to 4 MB in size. Client-side software was updated to work with Linux kernels up to version 3.6, and is still interoperable with 1.8 clients.
Lustre 2.5, released in October 2013, added the highly anticipated feature, Hierarchical Storage Management (HSM). A core requirement in enterprise environments, HSM allows customers to easily implement tiered storage solutions in their operational environment. This release also begins the next OpenSFS-designated Maintenance Release branch of Lustre.
A Lustre file system has three major functional units:
- One or more metadata servers (MDSes) that has one or more metadata targets (MDTs) per Lustre filesystem that stores namespace metadata, such as filenames, directories, access permissions, and file layout. The MDT data is stored in a local disk filesystem. However, unlike block-based distributed filesystems, such as GPFS and PanFS, where the metadata server controls all of the block allocation, the Lustre metadata server is only involved in pathname and permission checks, and is not involved in any file I/O operations, avoiding I/O scalability bottlenecks on the metadata server. The ability to have multiple MDTs in a single filesystem is a new feature in Lustre 2.4, and only allows directory subtrees to reside on the secondary MDTs in this version.
- One or more object storage servers (OSSes) that store file data on one or more object storage targets (OSTs). Depending on the server’s hardware, an OSS typically serves between two and eight OSTs, with each OST managing a single local disk filesystem. The capacity of a Lustre file system is the sum of the capacities provided by the OSTs.
- Client(s) that access and use the data. Lustre presents all clients with a unified namespace for all of the files and data in the filesystem, using standard POSIX semantics, and allows concurrent and coherent read and write access to the files in the filesystem.
The MDT, OST, and client may be on the same node (usually for testing purposes), but in typical production installations these functions are on separate nodes communicating over a network. The Lustre Network (LNET) layer can use several types of network interconnects, including native InfiniBand verbs, TCP/IP on Ethernet and other networks, Myrinet, Quadrics, and other proprietary network technologies such as the Cray SeaStar and Gemini interconnects. Lustre will take advantage of remote direct memory access (RDMA) transfers, when available, to improve throughput and reduce CPU usage.
The storage used for the MDT and OST backing filesystems is normally provided by hardware RAID devices, though will work with any block devices. The Lustre OSS and MDS servers read, write, and modify data in the format imposed by the backing filesystem and return this data to the clients. Clients do not have any direct access to the underlying storage.
An OST is a dedicated filesystem that exports an interface to byte ranges of objects for read/write operations. An MDT is a dedicated filesystem that controls file access and tells clients the layout of the object(s) that make up each file. MDTs and OSTs currently use an enhanced version of ext4 called ldiskfs to store data. Starting in Lustre 2.4 it is also possible to use Sun/Oracle ZFS/DMU for MDT and OST back-end data storage using the open source ZFS-on-Linux port.
When a client accesses a file, it completes a filename lookup on the MDS. As a result, either the layout of an existing file is returned to the client or a new file is created on behalf of the client. For read or write operations, the client then interprets the layout in the logical object volume (LOV) layer, which maps the offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSTs. With this approach, bottlenecks for client-to-OST communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the filesystem. After the initial lookup of the file layout, the MDS is not involved in file IO.
Clients do not directly modify the objects on the OST filesystems, but, instead, delegate this task to OSSes. This approach ensures scalability for large-scale clusters and supercomputers, as well as improved security and reliability. In contrast, shared block-based filesystems such as Global File System and OCFS must allow direct access to the underlying storage by all of the clients in the filesystem and increase the risk of filesystem corruption from misbehaving/defective clients.
In a typical Lustre installation on a Linux client, a Lustre filesystem driver module is loaded into the kernel and the filesystem is mounted like any other local or network filesystem. Client applications see a single, unified filesystem even though it may be composed of tens to thousands of individual servers and MDT/OST filesystems.
On some massively parallel processor (MPP) installations, computational processors can access a Lustre file system by redirecting their I/O requests to a dedicated I/O node configured as a Lustre client. This approach is used in the Blue Gene installation  at Lawrence Livermore National Laboratory.
Another approach used in the past is the liblustre library, which provided userspace applications with direct filesystem access. Liblustre was a user-level library that allows computational processors to mount and use the Lustre file system as a client. Using liblustre, the computational processors could access a Lustre file system even if the service node on which the job was launched is not a Linux client. Liblustre allowed data movement directly between application space and the Lustre OSSs without requiring an intervening data copy through the kernel, thus providing access from computational processors to the Lustre file system directly in a constrained operating environment.
Data objects and file striping
In a traditional Unix disk file system, an inode data structure contains basic information about each file, such as where the data contained in the file is stored. The Lustre file system also uses inodes, but inodes on MDTs point to one or more OST objects associated with the file rather than to data blocks. These objects are implemented as files on the OSTs. When a client opens a file, the file open operation transfers a set of object pointers and their layout from the MDS to the client, so that the client can directly interact with the OSS node where the object is stored. This allows the client to perform I/O in parallel across all of the OST objects in the file without further communication with the MDS.
If only one OST object is associated with an MDT inode, that object contains all the data in the Lustre file. When more than one object is associated with a file, data in the file is "striped" across the objects similar to RAID 0. Striping a file over multiple OST objects provides significant performance benefits if there is a need for high bandwidth access to a single large file. When striping is used, the maximum file size is not limited by the size of a single target. Capacity and aggregate I/O bandwidth scale with the number of OSTs a file is striped over. Also, since the locking of each object is managed independently for each OST, adding more stripes (one per OST) scales the file I/O locking capacity of the file proportionately. Each file in the filesystem can have a different striping layout, so that performance and capacity can be tuned optimally for each file.
Lustre has a distributed lock manager in the OpenVMS style to protect the integrity of each file's data and metadata. Access and modification of a Lustre file is completely cache coherent among all of the clients. Metadata locks are managed by the MDT that stores the inode for the file, using the 128-bit Lustre File Identifier (FID, composed of the Sequence number and Object ID) as the resource name. The metadata locks are split into multiple bits that protect the lookup of the file (file owner and group, permission and mode, and access control list (ACL)), the state of the inode (directory size, directory contents, link count, timestamps), and layout (file striping). A client can fetch multiple metadata lock bits for a single inode with a single RPC request, but currently they are only ever granted a read lock for the inode. The MDS manages all modifications to the inode in order to avoid lock resource contention and is currently the only node that gets write locks on inodes.
File data locks are managed by the OST on which each object of the file is striped, using byte-range extent locks. Clients can be granted both overlapping read extent locks for part or all of the file, allowing multiple concurrent readers of the same file, and/or non-overlapping write extent locks for regions of the file. This allows many Lustre clients to access a single file concurrently for both read and write, avoiding bottlenecks during file I/O. In practice, because Linux clients manage their data cache in units of pages, the clients will request locks that are always an integer multiple of the page size (4096 bytes on most clients). When a client is requesting an extent lock the OST may grant a lock for a larger extent than requested, in order to reduce the number of lock requests that the client makes. The actual size of the granted lock depends on several factors, including the number of currently-granted locks, whether there are conflicting write locks, and the number of outstanding lock requests. The granted lock is never smaller than the originally-requested extent. OST extent locks use the Lustre FID as the resource name for the lock. Since the number of extent lock servers scales with the number of OSTs in the filesystem, this also scales the aggregate locking performance of the filesystem, and of a single file if it is striped over multiple OSTs.
In a cluster with a Lustre file system, the system network connecting the servers and the clients is implemented using Lustre Networking (LNET), which provides the communication infrastructure required by the Lustre file system. Disk storage is connected to the Lustre file system MDSs and OSSs using traditional storage area network (SAN) technologies.
LNET can use many commonly-used network types, such as InfiniBand and IP networks, and allows simultaneous availability across multiple network types with routing between them. Remote Direct Memory Access (RDMA) is permitted when available on the underlying networks such as InfiniBand, Quadrics Elan, and Myrinet. High availability and recovery features enable transparent recovery in conjunction with failover servers.
LNET provides end-to-end throughput over Gigabit Ethernet networks in excess of 100 MB/s, throughput up to 3 GB/s using InfiniBand quad data rate (QDR) links, and throughput over 1 GB/s across 10-gigabit Ethernet interfaces.
Lustre file system high availability features include a robust failover and recovery mechanism, making server failures and reboots transparent. Version interoperability between successive minor versions of the Lustre software enables a server to be upgraded by taking it offline (or failing it over to a standby server), performing the upgrade, and restarting it, while all active jobs continue to run, experiencing a delay while the backup server takes over the storage.
Lustre MDSes are configured as an active/passive pair, while OSSes are typically deployed in an active/active configuration that provides redundancy without extra overhead. Often the standby MDS for one filesystem is the active MDS for another file system, so no nodes are idle in the cluster.
Lustre is used by many of the TOP500 supercomputers and large multi-cluster sites. Six of the top 10 and more than 60 of the top 100 supercomputers use Lustre file systems. These include: K computer at the RIKEN Advanced Institute for Computational Science, the Tianhe-1A at the National Supercomputing Center in Tianjin, China, the Jaguar and Titan at Oak Ridge National Laboratory (ORNL), Blue Waters at the University of Illinois, and Sequoia and Blue Gene/L at Lawrence Livermore National Laboratory (LLNL).
There are also large Lustre filesystems at the National Energy Research Scientific Computing Center, Pacific Northwest National Laboratory, Texas Advanced Computing Center, and NASA in North America, in Asia at Tokyo Institute of Technology, in Europe at CEA, and others.
Commercial technical support
Commercial technical support for Lustre is available. In most cases, this support is bundled along with the computing system or storage hardware sold by the vendor. Some vendors selling bundled computing and Lustre storage systems include Cray, Dell, Hewlett-Packard, Groupe Bull, Silicon Graphics International. Oracle no longer[when?] sells either systems or storage that include Lustre. Vendors selling storage hardware with bundled Lustre support include Hitachi Data Systems, Data Direct Networks (DDN), Dell, NetApp, Terascala, Xyratex, and many others.
- Distributed file system
- List of file systems, the distributed parallel fault-tolerant file system section
- "Index of /public/lustre/latest-maintenance-release". Download from commercial site. September 13, 2013. Retrieved September 23, 2013.
- "Index of /public/lustre/latest-feature-release". Download from commercial site. May 31, 2013. Retrieved December 3, 2013.
- "Lustre Home". Archived from the original on March 31, 2001. Retrieved September 23, 2013.
- "Titan System Overview". Retrieved 2013-09-19.
- "Spider Center-Wide File System". Oak Ridge Leadership Computing Facility. Retrieved 2012-02-02.
- "Rock-Hard Lustre: Trends in Scalability and Quality". Nathan Rutman, Xyratex. Retrieved 2012-02-02.
- Lustre File System presentation, November 2007 on YouTube By Peter Braam, November 10, 2007
- Peter J. Braam (August 4, 2002). "Lustre, The Inter-Galactic File System". Presentation slides. Lawrence Livermore National Laboratory. Retrieved September 23, 2013.
- R. Kent Koeninger (June 2003). "The Ultra-Scalable HPTC Lustre Filesystem". Slides for presentation at Cluster World 2003. Retrieved September 23, 2013.
- Britta Wülfing (September 13, 2007). "Sun Assimilates Lustre Filesystem". Linux Magazine. Retrieved September 23, 2013.
- "Sun Microsystems Expands High Performance Computing Portfolio with Definitive Agreement to Acquire Assets of Cluster File Systems, Including the Lustre File System". Press release (Sun Microsystems). September 12, 2007. Archived from the original on October 2, 2007. Retrieved September 23, 2013.
- "Oracle has Kicked Lustre to the Curb". Inside HPC. 2011-01-10.
- J. Leidel (August 20, 2010). "Whamcloud aims to make sure Lustre has a future in HPC". Inside HPC. Retrieved September 23, 2013.
- "Xyratex Advances Lustre® Initiative, Assumes Ownership of Related Assets". Press release (Xyratex). February 19, 2013. Retrieved September 18, 2013.
- Rich Brueckner (November 9, 2010). "Bojanic & Braam Getting Lustre Band Back Together at Xyratex". Inside HPC. Retrieved September 23, 2013.
- Rich Brueckner (January 4, 2011). "Whamcloud Staffs up for Brighter Lustre". Inside HPC. Retrieved September 18, 2013.
- "Whamcloud Signs Multi-Year Lustre Development Contract With OpenSFS". Press release (HPC Wire). August 16, 2011. Retrieved September 23, 2013.
- "ZFS on Linux for Lustre". Lawrence Livermore National Laboratory. 2011-04-13.
- Galen Shipman (November 18, 2011). "OpenSFS Update". Slides for Supercomputing 2011 presentation. Open Scalable File Systems. Retrieved September 23, 2013.
- Whamcloud (November 15, 2011). "OpenSFS and Whamcloud Sign Lustre Community Tree Development Agreement". Press release. Retrieved September 23, 2013.
- Joab Jackson (2012-07-16). "Intel Purchases Lustre Purveyor Whamcloud". PC World.
- Timothy Prickett Morgan (2012-07-16). "Intel gobbles Lustre file system expert Whamcloud". The Register.
- Timothy Prickett Morgan (2012-07-11). "DOE doles out cash to AMD, Whamcloud for exascale research". The Register.
- Nicole Hemsoth (June 12, 2013). "Intel Carves Mainstream Highway for Lustre". HPC Wire. Retrieved September 23, 2013.
- Brueckner, Rich. "With New RFP, OpenSFS to Invest in Critical Open Source Technologies for HPC". insideHPC. Retrieved 1 October 2013.
- "Lustre Community Portal". OpenSFS. Retrieved 1 October 2013.
- "Lustre Helps Power Third Fastest Supercomputer". DSStar.
- "MCR Linux Cluster Xeon 2.4 GHz - Quadrics". Top500.Org.
- Peter Bojanic (June 15, 2008). "Lustre Roadmap and Future Plans". Presentation to Sun HPC Consortium. Sun Microsystems. Retrieved September 23, 2013.
- Whamcloud "OpenSFS Announces Collaborative Effort to Support Lustre 2.1 Community Distribution". Open Scalable File Systems. Retrieved 2012-02-02.
- "Lustre 2.1 Released". Retrieved 2012-02-02.
- "Lustre 2.2 Released". Yahoo! Finance. Retrieved 2012-05-08.
- Prickett Morgan, Timothy. "OpenSFS Announces Availability of Lustre 2.5". EnterpriseTech.
- Brueckner, Rich. "Video: New Lustre 2.5 Release Offers HSM Capabilities". Inside Big Data. Retrieved 11 December 2013.
- Hemsoth, Nicole. "Lustre Gets Business Class Upgrade with HSM". HPCwire. Retrieved 11 December 2013.
- "Lustre 2.5". Scientific Computing World. Retrieved 11 December 2013.
- "Lustre to run on ZFS". Government Computer News. 2008-10-26.
- "ZFS on Lustre". 2011-05-10.
- "DataDirect Selected As Storage Tech Powering BlueGene/L". HPC Wire. October 15, 2004.
- Lafoucrière, Jacques-Charles. "Lustre Experience at CEA/DIF". HEPiX Forum, April 2007.
- "LUG11 ZFS on Linux for Lustre". Retrieved 2012-07-01.
- "Pleiades Supercomputer". www.nas.nasa.gov. 2008-08-18.
- "TOP500 List - November 2006". TOP500.Org.
- "TOP500 List - June 2006". TOP500.Org.
- "French Atomic Energy Group Expands HPC File System to 11 Petabytes". HPCwire.com. 2012-06-15.
- Cray Storage Products
- NetApp High Performance Computing Solution for Lustre
- "File management consulting". SGI Professional Services web site.
- Terascala Products
- Xyratex Lustre