The current development kernel is 2.6.0-test11
; there has been no
development kernel release since November 26. Linus continues to
accumulate small, critical patches in his BitKeeper repository, but appears
to be waiting for Andrew Morton to return to the scene for the preparation
of the next release, be it another -test kernel or the real 2.6.0.
Andrew did release 2.6.0-test11-mm1 on
December 17. The -mm tree now contains a full 300 patches, ranging
from small fixes to new drivers and major subsystem work. Andrew has
indicated that at least some of the patches in -mm will find their way into
the mainline after 2.6.0 comes out.
The current stable kernel is 2.4.23; Marcelo has not released any
2.4.24 prepatches since 2.4.24-pre1 on
Comments (4 posted)
Kernel development news
Linux-based clusters would appear to be the future of high-performance
computing. No other approach can combine the power and flexibility of the
Linux system with the economic advantages of using commercial, mass-market
hardware. For many kinds of problems, a room full of racks of Linux
systems is by far the most cost-effective way of obtaining high-end
computing power. For other sorts of tasks, ad-hoc "grid" computing networks
promise the ability to offer computing power on demand from otherwise idle
Making these clusters work and scale well is more than a simple matter of
plugging them all into a network switch, however. Distributing data around
a cluster can be a hard task; often, data transfer, rather than computing
power, is the limiting factor
in system performance. Faster networking
technology can help, but what is really needed is a reliable way of making
tremendous amounts of data available to any node in the cluster on demand.
With the announcement of Lustre 1.0,
the Linux community just got a new tool for use in the creation of
high-performance clusters. Lustre is a cluster filesystem which is
intended to scale to tens of thousands of nodes and more stored data than
anybody would ever want to have to back up. It offers high-bandwidth file
I/O throughout the cluster while suffering from no single points of failure
that could bring your expensive cluster to a halt. Luster 1.0 is
licensed under the GPL, and is currently available for 2.4 kernels; a 2.6
version should be coming out before too long.
The Lustre filesystem is implemented with three high-level modules:
- Metadata servers keep track of what files exist in the cluster,
along with various attributes (such as where the files are to be
found). These servers also handle file locking tasks. A cluster can
have many metadata servers, and can perform load balancing between
them. Large directories can be split across multiple servers, so no
single server should ever become a bottleneck for the system as a
Lustre supports failover for the metadata servers, but only if the
backup servers are working from shared storage.
- Object storage targets store the actual files within a
cluster. They are essentially large boxes full of bits which can be
accessed via unique file ID tags. Linux systems can serve as object
storage targets, using the ext3 filesystem as the underlying storage,
but someday specialized OST appliance boxes may become available from
the usual vendors. Object storage targets are stackable, allowing the
creation of virtual targets which provide high-level volume management
and RAID services.
The object storage targets are also responsible for implementing
access control and security. Once again, failover targets can be set
up, as long as the underlying storage is shared.
- The client filesystem is charged with talking to the metadata
servers and object storage targets and presenting something that looks
like a Unix filesystem to the host system. Typical requests will be
handled by asking one or more metadata servers to look up a file of
interest, followed by I/O requests to the object storage target(s)
which hold the data contained by that file.
A key part of the Lustre design is failure recovery. Each component keeps
a log of actions that it has committed - or attempted to commit. If a
server (metadata or object storage) falls off the net, the other nodes
which were working with that server remember the operations which were not
known to be complete. When the server comes back up, it implements a
"recovery period" where other nodes can reestablish locks, replay
operations, and so on, so that it can return to a state which is consistent
with the rest of the cluster. New requests will be accepted only after the
recovery period is complete.
Lustre uses the Sandia
Portals system to handle communications between the nodes. A full
Lustre deployment will also likely involve LDAP and/or Kerberos servers to
handle authentication tasks.
The 1.0 release may have just happened, but Lustre has been handling real
loads for some time. According to this press
release from Cluster File Systems, four of the top five Linux
supercomputers are running Lustre. The press release also claims that a
Lustre deployment achieved a sustained throughput of 11.1 GB/second,
which is rather better than most of us can get with NFS.
The 2.6 version of Lustre has not yet been released, but should be
available soon. Apparently there have already been talks with Linus about
getting Lustre merged into the 2.6 kernel. Before too long, that
shrink-wrapped Linux box in the local computer store may come with a
high-end cluster filesystem included.
Comments (5 posted)
Matt Mackall has picked up a new project: making the 2.6 kernel work on
very small systems. This is, he says, "an area Linux mainstream has
been moving away from since Linus got a real job.
" To this end, he
has released a tree called 2.6.0-test11-tiny
which incorporates a large set of patches aimed at slimming down the
kernel. It's worth a look as an expression of just what needs to be done
if you want to run Linux on small systems.
So what's required? The -tiny patch includes, among others, the following:
- Building the kernel with the -Os compiler option, which
instructs gcc to optimize for size. This option results in a smaller
kernel; interestingly, there have also been reports that -Os
yields better performance on large systems as well, since the
resulting executable has better cache behavior.
- The 4k kernel stack patch cuts the runtime per-process memory use
- Various patches shrink the size of internal data structures to their
minimum values. Target structures include the block and char device names hash
tables, the maximum number of swapfiles, the maximum number of
processes, the futex hash table, CRC lookup tables, and many others.
- For truly daring users, the -tiny kernel has an option to remove
printk() from the kernel entirely, along with its associated
buffers and most of the strings passed to printk(). The
space savings will be considerable; you just have to hope that the
kernel has nothing important to tell you. Strings for BUG()
and panic() calls can also be removed.
- Various subsystems which are not normally optional become so. With
the -tiny kernel, it is possible to configure out sysfs (which can
take a lot of run-time memory), asynchronous I/O,
/proc/kcore, ethtool support, core dump support, etc.
- Inline functions are heavily used in the kernel; they can improve
performance, and, in some situations, the use of inline code is
mandatory. Excessive use of inline functions can bloat the size of
the kernel considerably, however. The -tiny kernel includes a patch
which makes the compiler complain about the use of inline functions,
allowing a size-conscious developer to find which ones are invoked
There are almost 80 separate patches in all. Matt claims that his kernel,
when configured with a full networking stack, fits "comfortably" on a 4MB
box, which is, indeed, considered small these days. Matt has some
ambitious future plans, including cutting functionality out of the console
subsystem and (an idea that is sure to raise some eyebrows) making parts of
the kernel be pageable. It remains to be seen whether things will get that
far, but there is no doubt that making Linux work on small systems is a
Comments (4 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>