Lustre 1.0 released
[Posted December 17, 2003 by corbet]
Linux-based clusters would appear to be the future of high-performance
computing. No other approach can combine the power and flexibility of the
Linux system with the economic advantages of using commercial, mass-market
hardware. For many kinds of problems, a room full of racks of Linux
systems is by far the most cost-effective way of obtaining high-end
computing power. For other sorts of tasks, ad-hoc "grid" computing networks
promise the ability to offer computing power on demand from otherwise idle
systems.
Making these clusters work and scale well is more than a simple matter of
plugging them all into a network switch, however. Distributing data around
a cluster can be a hard task; often, data transfer, rather than computing
power, is the limiting factor
in system performance. Faster networking
technology can help, but what is really needed is a reliable way of making
tremendous amounts of data available to any node in the cluster on demand.
With the announcement of Lustre 1.0,
the Linux community just got a new tool for use in the creation of
high-performance clusters. Lustre is a cluster filesystem which is
intended to scale to tens of thousands of nodes and more stored data than
anybody would ever want to have to back up. It offers high-bandwidth file
I/O throughout the cluster while suffering from no single points of failure
that could bring your expensive cluster to a halt. Luster 1.0 is
licensed under the GPL, and is currently available for 2.4 kernels; a 2.6
version should be coming out before too long.
The Lustre filesystem is implemented with three high-level modules:
- Metadata servers keep track of what files exist in the cluster,
along with various attributes (such as where the files are to be
found). These servers also handle file locking tasks. A cluster can
have many metadata servers, and can perform load balancing between
them. Large directories can be split across multiple servers, so no
single server should ever become a bottleneck for the system as a
whole.
Lustre supports failover for the metadata servers, but only if the
backup servers are working from shared storage.
- Object storage targets store the actual files within a
cluster. They are essentially large boxes full of bits which can be
accessed via unique file ID tags. Linux systems can serve as object
storage targets, using the ext3 filesystem as the underlying storage,
but someday specialized OST appliance boxes may become available from
the usual vendors. Object storage targets are stackable, allowing the
creation of virtual targets which provide high-level volume management
and RAID services.
The object storage targets are also responsible for implementing
access control and security. Once again, failover targets can be set
up, as long as the underlying storage is shared.
- The client filesystem is charged with talking to the metadata
servers and object storage targets and presenting something that looks
like a Unix filesystem to the host system. Typical requests will be
handled by asking one or more metadata servers to look up a file of
interest, followed by I/O requests to the object storage target(s)
which hold the data contained by that file.
A key part of the Lustre design is failure recovery. Each component keeps
a log of actions that it has committed - or attempted to commit. If a
server (metadata or object storage) falls off the net, the other nodes
which were working with that server remember the operations which were not
known to be complete. When the server comes back up, it implements a
"recovery period" where other nodes can reestablish locks, replay
operations, and so on, so that it can return to a state which is consistent
with the rest of the cluster. New requests will be accepted only after the
recovery period is complete.
Lustre uses the Sandia
Portals system to handle communications between the nodes. A full
Lustre deployment will also likely involve LDAP and/or Kerberos servers to
handle authentication tasks.
The 1.0 release may have just happened, but Lustre has been handling real
loads for some time. According to this press
release from Cluster File Systems, four of the top five Linux
supercomputers are running Lustre. The press release also claims that a
Lustre deployment achieved a sustained throughput of 11.1 GB/second,
which is rather better than most of us can get with NFS.
The 2.6 version of Lustre has not yet been released, but should be
available soon. Apparently there have already been talks with Linus about
getting Lustre merged into the 2.6 kernel. Before too long, that
shrink-wrapped Linux box in the local computer store may come with a
high-end cluster filesystem included.
(
Log in to post comments)