LWN.net Logo

Kernel development

Brief items

Kernel Release Status

The current development kernel is 2.6.0-test11; there has been no development kernel release since November 26. Linus continues to accumulate small, critical patches in his BitKeeper repository, but appears to be waiting for Andrew Morton to return to the scene for the preparation of the next release, be it another -test kernel or the real 2.6.0.

Andrew did release 2.6.0-test11-mm1 on December 17. The -mm tree now contains a full 300 patches, ranging from small fixes to new drivers and major subsystem work. Andrew has indicated that at least some of the patches in -mm will find their way into the mainline after 2.6.0 comes out.

The current stable kernel is 2.4.23; Marcelo has not released any 2.4.24 prepatches since 2.4.24-pre1 on December 10.

Comments (4 posted)

Kernel development news

Lustre 1.0 released

Linux-based clusters would appear to be the future of high-performance computing. No other approach can combine the power and flexibility of the Linux system with the economic advantages of using commercial, mass-market hardware. For many kinds of problems, a room full of racks of Linux systems is by far the most cost-effective way of obtaining high-end computing power. For other sorts of tasks, ad-hoc "grid" computing networks promise the ability to offer computing power on demand from otherwise idle systems.

Making these clusters work and scale well is more than a simple matter of plugging them all into a network switch, however. Distributing data around a cluster can be a hard task; often, data transfer, rather than computing power, is the limiting factor in system performance. Faster networking technology can help, but what is really needed is a reliable way of making tremendous amounts of data available to any node in the cluster on demand.

With the announcement of Lustre 1.0, the Linux community just got a new tool for use in the creation of high-performance clusters. Lustre is a cluster filesystem which is intended to scale to tens of thousands of nodes and more stored data than anybody would ever want to have to back up. It offers high-bandwidth file I/O throughout the cluster while suffering from no single points of failure that could bring your expensive cluster to a halt. Luster 1.0 is licensed under the GPL, and is currently available for 2.4 kernels; a 2.6 version should be coming out before too long.

The Lustre filesystem is implemented with three high-level modules:

  • Metadata servers keep track of what files exist in the cluster, along with various attributes (such as where the files are to be found). These servers also handle file locking tasks. A cluster can have many metadata servers, and can perform load balancing between them. Large directories can be split across multiple servers, so no single server should ever become a bottleneck for the system as a whole.

    Lustre supports failover for the metadata servers, but only if the backup servers are working from shared storage.

  • Object storage targets store the actual files within a cluster. They are essentially large boxes full of bits which can be accessed via unique file ID tags. Linux systems can serve as object storage targets, using the ext3 filesystem as the underlying storage, but someday specialized OST appliance boxes may become available from the usual vendors. Object storage targets are stackable, allowing the creation of virtual targets which provide high-level volume management and RAID services.

    The object storage targets are also responsible for implementing access control and security. Once again, failover targets can be set up, as long as the underlying storage is shared.

  • The client filesystem is charged with talking to the metadata servers and object storage targets and presenting something that looks like a Unix filesystem to the host system. Typical requests will be handled by asking one or more metadata servers to look up a file of interest, followed by I/O requests to the object storage target(s) which hold the data contained by that file.

A key part of the Lustre design is failure recovery. Each component keeps a log of actions that it has committed - or attempted to commit. If a server (metadata or object storage) falls off the net, the other nodes which were working with that server remember the operations which were not known to be complete. When the server comes back up, it implements a "recovery period" where other nodes can reestablish locks, replay operations, and so on, so that it can return to a state which is consistent with the rest of the cluster. New requests will be accepted only after the recovery period is complete.

Lustre uses the Sandia Portals system to handle communications between the nodes. A full Lustre deployment will also likely involve LDAP and/or Kerberos servers to handle authentication tasks.

The 1.0 release may have just happened, but Lustre has been handling real loads for some time. According to this press release from Cluster File Systems, four of the top five Linux supercomputers are running Lustre. The press release also claims that a Lustre deployment achieved a sustained throughput of 11.1 GB/second, which is rather better than most of us can get with NFS.

The 2.6 version of Lustre has not yet been released, but should be available soon. Apparently there have already been talks with Linus about getting Lustre merged into the 2.6 kernel. Before too long, that shrink-wrapped Linux box in the local computer store may come with a high-end cluster filesystem included.

Comments (5 posted)

Linux for little systems

Matt Mackall has picked up a new project: making the 2.6 kernel work on very small systems. This is, he says, "an area Linux mainstream has been moving away from since Linus got a real job." To this end, he has released a tree called 2.6.0-test11-tiny which incorporates a large set of patches aimed at slimming down the kernel. It's worth a look as an expression of just what needs to be done if you want to run Linux on small systems.

So what's required? The -tiny patch includes, among others, the following:

  • Building the kernel with the -Os compiler option, which instructs gcc to optimize for size. This option results in a smaller kernel; interestingly, there have also been reports that -Os yields better performance on large systems as well, since the resulting executable has better cache behavior.

  • The 4k kernel stack patch cuts the runtime per-process memory use significantly.

  • Various patches shrink the size of internal data structures to their minimum values. Target structures include the block and char device names hash tables, the maximum number of swapfiles, the maximum number of processes, the futex hash table, CRC lookup tables, and many others.

  • For truly daring users, the -tiny kernel has an option to remove printk() from the kernel entirely, along with its associated buffers and most of the strings passed to printk(). The space savings will be considerable; you just have to hope that the kernel has nothing important to tell you. Strings for BUG() and panic() calls can also be removed.

  • Various subsystems which are not normally optional become so. With the -tiny kernel, it is possible to configure out sysfs (which can take a lot of run-time memory), asynchronous I/O, /proc/kcore, ethtool support, core dump support, etc.

  • Inline functions are heavily used in the kernel; they can improve performance, and, in some situations, the use of inline code is mandatory. Excessive use of inline functions can bloat the size of the kernel considerably, however. The -tiny kernel includes a patch which makes the compiler complain about the use of inline functions, allowing a size-conscious developer to find which ones are invoked most often.

There are almost 80 separate patches in all. Matt claims that his kernel, when configured with a full networking stack, fits "comfortably" on a 4MB box, which is, indeed, considered small these days. Matt has some ambitious future plans, including cutting functionality out of the console subsystem and (an idea that is sure to raise some eyebrows) making parts of the kernel be pageable. It remains to be seen whether things will get that far, but there is no doubt that making Linux work on small systems is a worthy goal.

Comments (4 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds