Lustre 1.0 released

[Posted December 17, 2003 by corbet]

Linux-based clusters would appear to be the future of high-performance computing. No other approach can combine the power and flexibility of the Linux system with the economic advantages of using commercial, mass-market hardware. For many kinds of problems, a room full of racks of Linux systems is by far the most cost-effective way of obtaining high-end computing power. For other sorts of tasks, ad-hoc "grid" computing networks promise the ability to offer computing power on demand from otherwise idle systems.

Making these clusters work and scale well is more than a simple matter of plugging them all into a network switch, however. Distributing data around a cluster can be a hard task; often, data transfer, rather than computing power, is the limiting factor in system performance. Faster networking technology can help, but what is really needed is a reliable way of making tremendous amounts of data available to any node in the cluster on demand.

With the announcement of Lustre 1.0, the Linux community just got a new tool for use in the creation of high-performance clusters. Lustre is a cluster filesystem which is intended to scale to tens of thousands of nodes and more stored data than anybody would ever want to have to back up. It offers high-bandwidth file I/O throughout the cluster while suffering from no single points of failure that could bring your expensive cluster to a halt. Luster 1.0 is licensed under the GPL, and is currently available for 2.4 kernels; a 2.6 version should be coming out before too long.

The Lustre filesystem is implemented with three high-level modules:

Metadata servers keep track of what files exist in the cluster, along with various attributes (such as where the files are to be found). These servers also handle file locking tasks. A cluster can have many metadata servers, and can perform load balancing between them. Large directories can be split across multiple servers, so no single server should ever become a bottleneck for the system as a whole.
Lustre supports failover for the metadata servers, but only if the backup servers are working from shared storage.
Object storage targets store the actual files within a cluster. They are essentially large boxes full of bits which can be accessed via unique file ID tags. Linux systems can serve as object storage targets, using the ext3 filesystem as the underlying storage, but someday specialized OST appliance boxes may become available from the usual vendors. Object storage targets are stackable, allowing the creation of virtual targets which provide high-level volume management and RAID services.
The object storage targets are also responsible for implementing access control and security. Once again, failover targets can be set up, as long as the underlying storage is shared.
The client filesystem is charged with talking to the metadata servers and object storage targets and presenting something that looks like a Unix filesystem to the host system. Typical requests will be handled by asking one or more metadata servers to look up a file of interest, followed by I/O requests to the object storage target(s) which hold the data contained by that file.

A key part of the Lustre design is failure recovery. Each component keeps a log of actions that it has committed - or attempted to commit. If a server (metadata or object storage) falls off the net, the other nodes which were working with that server remember the operations which were not known to be complete. When the server comes back up, it implements a "recovery period" where other nodes can reestablish locks, replay operations, and so on, so that it can return to a state which is consistent with the rest of the cluster. New requests will be accepted only after the recovery period is complete.

Lustre uses the Sandia Portals system to handle communications between the nodes. A full Lustre deployment will also likely involve LDAP and/or Kerberos servers to handle authentication tasks.

The 1.0 release may have just happened, but Lustre has been handling real loads for some time. According to this press release from Cluster File Systems, four of the top five Linux supercomputers are running Lustre. The press release also claims that a Lustre deployment achieved a sustained throughput of 11.1 GB/second, which is rather better than most of us can get with NFS.

The 2.6 version of Lustre has not yet been released, but should be available soon. Apparently there have already been talks with Linus about getting Lustre merged into the 2.6 kernel. Before too long, that shrink-wrapped Linux box in the local computer store may come with a high-end cluster filesystem included.

Only one active MDS is possible (for now)

Posted Dec 18, 2003 5:48 UTC (Thu) by snitm (guest, #4031) [Link]

Lustre 1.0 only supports one active Metadata server; multiple active MDSs is on the roadmap.

Fault tolerance

Posted Dec 18, 2003 7:20 UTC (Thu) by rmacaulay (guest, #7971) [Link] (1 responses)

Can a piece of data that would be stored on a Object storage target be stored on multiple targets to enable fault tolerance on the storage nodes w/out the need for shared storage?

Fault tolerance

Posted Dec 18, 2003 15:19 UTC (Thu) by snitm (guest, #4031) [Link]

Client level Raid1 or Raid5 isn't possible; however CFS has been working on the infrastructure for adding such support to Lustre, from a Lustre cvs commit message:

Add new LOV EA format which gives us fields for features needed in the post
1.0 stage (e.g. RAID, OST migration/replacement) as well as more efficient
storage than the current layout when there are few stripes on lots of OSTs.

Lustre/Intermezzo/P2P

Posted Dec 18, 2003 14:23 UTC (Thu) by brugolsky (guest, #28) [Link] (1 responses)

Lustre is perhaps a bit too tightly coupled for home, and especially mobile wireless use. Additionally, it is designed for configurations where there is a clear distinction between clients and servers.

The typical home network consists of a bunch of machines with an OS install that is an every-shrinking fraction of the available disk-space. It is not uncommon for a home with a few machines to have a few tens to hundreds of gigabytes of disk space scattered across several machines. It would be nice to unify that under one namespace.

Laptop/PDA users want the aggressive caching and disconnection/reintegration of Intermezzo, but in a more distributed P2P fashion, with replication. Something like Ivy (http://www.pdos.lcs.mit.edu/ivy/) is a bit closer to meeting those needs.

Happily, Peter Braam and Andreas Dilger have suggested that once Lustre has met all of its requirements, they are inclined to revisit Intermezzo and apply some of the lessons learned in constructing Lustre.

Lustre/Intermezzo/P2P

Posted Dec 19, 2003 17:12 UTC (Fri) by giraffedata (guest, #1954) [Link]

It is not uncommon for a home with a few machines to have a few tens to hundreds of gigabytes of disk space scattered across several machines. It would be nice to unify that under one namespace.

Agreed, but that has little to do with Lustre. You're alluding to a distributed filesystem, which is fundamentally different from what Lustre is: a shared filesystem. The Lustre approach to unifying the storage is to remove the storage from all those systems and stick it into a central filesystem to which all the systems have access. It's specially designed to have that logically centralized system be physically spread out enough that it isn't a bottleneck even for huge numbers of storage users.

The separate storage approach is what makes Lustre impractical for small applications -- who wants to pay for a bunch of dedicated metadata and object storage servers just to serve a few home computers?

I don't doubt that features could theoretically be added to Lustre to allow it be a distributed filesystem and/or be practical in homes and mobile situations, but those features would have little to do with Lustre technology as it is known today.