The current development kernel is 2.5.45
by Linus on October 30,
just in time to make your
editor go back and rewrite this section.
Linus has been busy, having merged over 500 patches since returning from
cruise. The most significant changes include another set of block layer
fixes, an ia-64 update, many fixes from the -ac series, the device mapper
(LVM2) code, the new cryptographic API (see below), the beginnings of an
IPSec implementation, an ISDN update, Roman Zippel's new kernel
configuration system, the sys_epoll
patch (see below), much device
model work, and many other fixes and updates. The long-format changelog
is longer than usual, and
has all the details.
There are many open issues, still, that need to be resolved before the
feature freeze. For varying perspectives on what remains to be merged, see
Guillaume Boissiere's 2.5 status summary for
October 30, Rob Landley's merge candidate
list, or Rusty Russell's Remarkably
Unreliable 2.6 list.
For a view of what's in the kernel now, see Dave Jones's post-Halloween document which serves as a sort
of preliminary release notes for people interested in testing the new
The current stable kernel is still 2.4.19, but the next stable
release got a little
closer with the announcement of the first
2.4.20 release candidate on October 29..
Comments (1 posted)
Kernel development news
One of the first things Linus merged once he got back home was the brand
new cryptographic API written by James Morris, David S. Miller, and
Jean-François Dive, with ideas and code taken from many other places. This
patch is interesting for a couple of reasons: it is a brand-new, previously
unseen kernel crypto implementation, and it is the first time that serious
cryptographic code has been included in the mainline kernel tree. With
luck, worldwide crypto regulations will remain sane enough that this code
can stay there.
This API's purpose is to provide fast, general-purpose cryptographic
operations for the rest of the kernel. The driving need in the short term
is IPSec (which has also been partially merged), but other applications,
such as cryptographic filesystems, can also make use of this facility.
Needless to say, use in the networking and filesystem layers places some
strong performance demands on the cryptographic layer.
The new crypto API is based on the scatterlist structure, which is
used in many other parts of the I/O subsystem. Scatterlists give direct
access to the page structures describing the memory to be operated
on, below the level of the virtual memory system. Among other things, this
architecture means that data can be encrypted or decrypted "in place" in
the buffers that are used for I/O operations. It should, in other words,
be fast (if the crypto algorithms themselves are fast).
Three basic types of "transforms" are supported by this API: ciphers,
digests, and compressors. The kernel currently has implementations for DES
(including triple DES), MD4, MD5, and SHA. See Documentation/crypto/api-intro.txt
for a quick overview of how this API works.
(As a postscript: a few people have asked if this API would be made
available to user space. It would not be hard to write a system call or
pseudo-device which would export this functionality, but it is hard to
imagine why that would be useful. It would be better just to run the
crypto algorithms in user space directly).
Comments (2 posted)
One of the remaining issues to be resolved before the Halloween feature
freeze is whether the Reiser4 filesystem will be included in the 2.5
kernel. This has been a hard question to answer, however, given that
almost nobody had actually seen the Reiser4 source. That situation, at
least, has come to an end with the announcement
of the first public Reiser4
Reiser4 is the latest incarnation of the ReiserFS filesystem. It is not
simply an upgrade; Reiser4 has been redesigned and reimplemented from the
beginning. It is a completely different filesystem than the ReiserFS (also
known as "Reiser3") found in the 2.4 kernel; should it be included, the
next stable kernel will contain both Reiser3 and Reiser4, as separate
There is a fair amount of online information available on Reiser4, though
some of it makes for a bit of a challenging read. This lengthy document provides
discussion in depth of many of the Reiser4 features (not all of which are
implemented yet), along with an explanation of Hans Reiser's long-term
vision for filesystems, a polemic on free software, and some of the
weirdest imagery to be found in software documentation anywhere. The
document entitled The
Infrastructure for Security Attributes in Reiser4 is actually a
relatively straightforward discussion of many of the technical details
behind the Reiser4 design, and might be a better starting point.
For those wanting a shorter summary, here's a few of the features to be
found in Reiser4:
- The filesystem maintains many of the basic features of Reiser3 -
it is based on (mostly) balanced trees, with file data incorporated in
the tree along with names. Reiser4 thus remains well suited to the
handling of large numbers of very small files.
- It is smarter about block allocation and data placement. Block
allocation is delayed until file data is actually written to disk,
leading to more efficient layouts. On-disk layout is done with
extents. The result of these optimizations is that the filesystem's
read performance is greatly improved over Reiser3.
- "Wandering logfiles" take some techniques from log-structured
filesystems to provide journaling without (always) writing data to the
disk twice. In many cases, Reiser4 can write "journal" data to a disk
block, then atomically swap the journal block into the file itself.
The journaling code can overwrite or replace blocks, depending on
which technique would provide better layout on the disk.
- Most filesystem semantics are implemented with plugins. The normal
Unix directory behavior, for example, is implemented with the "Unix
directory plugin." Plugins can be used to implement security features
(access control lists and such), encryption, maintenance of audit
trails, and no end of strange, non-POSIX
semantics. Hans Reiser remains determined to implement a lot of
interesting features in his filesystem, and plugins are the mechanism
by which those features will be included.
- Reiser4 is heavily transaction-oriented, and is able to provide
guarantees that operations will be performed atomically. Future plans
call for the ability to perform multi-file operations in an atomic
- The Reiser4 design includes a reiser4() system call
"to support applications that don't have to be fooled into
thinking that they are using POSIX." This system call will
accept (and parse) command strings that can describe complex
operations. The reiser4() system call is not implemented in
the current snapshot.
As an example of the sort of uses that the Reiser4 developers eventually
would like to see, consider the classic Unix password file. Each line in
the file describes one account, and contains several colon-separated fields
with information like the account name, user and group IDs, the user's home
directory and shell, etc. In Reiser4, each field in the password file
would become a file in its own right; one could obtain the home directory
of a given user via a path like:
A special-purpose plugin would aggregate the various files, so that a
process reading /etc/passwd would see the same information as
always. But each field file could be protected differently; a user could
have write access to the file describing his or full name, but not to the
one containing the user ID value.
In the Reiser4 vision, file attributes would also be stored as files. For
a given file, something like file/owner would contain
the UID of the
user who owns that file.
Needless to say, in the long-term Reiser vision, Linux systems will behave
rather differently than they do now. In the shorter term, Reiser4 promises
a high-performance journaling filesystem with highly efficient handling of
small files and a plugin architecture which encourages experiments with
interesting new semantics.
Will it be merged? The Reiser4 team plans to submit a patch for merging at
the last second, sometime before midnight on Halloween. Some developers
have argued that it is too late to propose a major new feature that nobody
has had a chance to look at. Hans feels this is
I'm the last straggler coming back from the hunt, and I've got what
looks like it might be a wooly mammoth on my shoulders, and my
tribesmen are complaining that I'm late for dinner. How about
helping me by cutting down a tree for the roasting spit instead
Linus has not offered any public opinions on the matter. The Reiser4 patch
is apparently unintrusive, however, so there is probably no real reason not
to include it.
Comments (11 posted)
The classic Unix way to wait for I/O events on multiple file descriptors is
with the select()
system calls. When a
process invokes one of those calls, the kernel goes through the list of
interesting file descriptors, checks to see if non-blocking I/O is
available on any of them, and adds the calling process to a wait queue for
each file descriptor that would block. This implementation works
reasonably well when the number of file descriptors is small. But if a
process is managing thousands of file descriptors, the select()
calls must check every single one of them, and add the
calling process to thousands of wait queues. For every single call.
Needless to say, this approach does not scale very well.
Davide Libenzi and others have been working for some time on a new approach
to polling that would work for thousands of files. It was originally
implemented as a special device (/dev/epoll), but, on request from
Linus, the new scheme was turned into a new set of epoll() system
These calls work in a very different way. Every call to select()
or poll() is a separate event; the data structures must be set up
and torn down every time. epoll, instead, requires the
application to build a persistent (across calls) data structure in kernel
space first. The application starts by creating a special epoll
epfd = epoll_create(int maxfds);
The maxfds parameter is the maximum number of file descriptors
that the process expects to manage. The return value is a file descriptor
to be used with the other epoll calls; it should be shut down with
close() when it is no longer needed.
Each file descriptor to be managed must be added to the special
epoll descriptor with:
int epoll_ctl(int epfd, int op, int fd, unsigned int events);
The op parameter specifies the operation to be performed (add,
change, or remove the given file descriptor fd), and
events is a mask of events of interest to the process.
Once everything has been set up, the process can sit back and wait until
there is something for it to do:
int epoll_wait(int epfd, struct pollfd const **events, int timeout);
The return value is the number of events (i.e. readable or writeable file
descriptors) that epoll_wait() has found.
These system calls have been shown, through heavy benchmarking, to scale in
constant time up to unbelievable numbers of file descriptors (some graphs
can be found on this
page). The persistent data structure built around the epoll
file descriptor is one of the reasons for this scalability: there is no
need to set it up and tear it down for every epoll_wait() call.
The other half of the story is in how epoll_wait() finds the
readable or writeable file descriptors. Rather than polling each file
descriptor (and adding itself to wait queues), the epoll mechanism
adds a callback structure onto the struct file associated
with each file descriptor. When a file descriptor becomes readable or
writeable, its callback(s) are called, and processes using
epoll_wait() can be notified directly. So an
epoll_wait() call never needs to make a pass over the list of file
descriptors it is watching.
The epoll patch is ready, and Linus has indicated that he wants to
merge it. For now, epoll only works for pipes and sockets (its
initial use is likely to be network services that manage large numbers of
connections). Expanding its scope to other types of I/O should just be a
matter of doing the work, however.
Comments (8 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>