Brief items
The current development kernel is 2.5.73, which was
released by Linus on June 22. Changes
this time around include some big ext3 and journaling changes (see
last week's LWN Kernel Page), an ACPI update, a
big ia64 merge, some networking fixes, a new PCI device locking scheme, the
new
request_firmware() interface (see
the May 21 LWN Kernel Page), an NFS server
update, more driver model work, an ARM update, and various other fixes and
tweaks.
The long-format changelog has all
the details.
Linus's BitKeeper tree, as of this writing, contains an MTD driver cleanup,
the beginning of work on the loop driver (see below), and some patches to
make the network block device driver work again.
The current stable kernel is 2.4.21. Marcelo has started the 2.4.22
process (supposed to only last a couple months) with the release of 2.4.22-pre1; it is a large patch with a lot of
USB work, the long-awaited ACPI update, some network fixes, and quite a few
other repairs and updates.
Comments (none posted)
Kernel development news
One longstanding goal in kernel development has been to eliminate the
differences between loadable modules and monolithic (linked-in) code. The
fewer differences there are, the easier it is to write code which works in
either mode - and to maintain that code. In 2.5, this process is almost
complete; there is very little code which is unique to either modules or
monolithic code.
One remaining difference, however, has to do with initialization and exit
code. It is possible to use the module_init() macro to designate an
initialization function, and that function will be called properly at
module load time or at boot
time if the module is built directly into the kernel. (Exit
functions for monolithic code are, of course, simply discarded.) One
important difference remains, however: monolithic code can have multiple
initialization calls, while modules can only have one. Monolithic
code initialization calls can even be given priorities (via macros like
core_initcall() or late_initcall()) which control when
each function is called.
One would think this wouldn't matter a whole lot for loadable modules,
since every initialization function would be called at the same time
(when the module is loaded) anyway. But this difference forces module and
monolithic code to be different. It also prevents the creation of nice,
initialization-time macros which ease the process of setting up
/proc files or sysfs entries.
With a new patch (since revised) from Rusty Russell, things
will change. Rusty notes the real reason why modules can only have a
single set of initialization and exit functions: the kernel simply does not
know what to do if one of a series of initialization functions fails. In
that case, the module load process must fail, and some sort of cleanup must
be performed. The problem is knowing what that cleanup is.
The solution is to associate pairs of initialization and exit functions.
That is done with a new macro:
module_init_exit(priority, init_fn, exit_fn);
This call designates a new initialization and exit function pair, and
associates a priority with that pair.
Each exit function cleans up (only) the work done by its associated
initialization function. At module load time, the initialization functions
are called in increasing priority order. Should one fail, the exit
functions corresponding to the initialization functions that succeeded will
be called, in reverse priority order. Thus, a properly-written module
should be able to clean up after itself correctly after a failure in any
part of the initialization process.
An early version of this patch broke modules using the long-deprecated
technique of calling their initialization and exit functions
init_module() and cleanup_module(), respectively. That
has since been patched up - this stage of the kernel development process is
not the time to be making such changes. But the writing is on the wall,
and that particular technique is not likely to survive past 2.7.
Comments (none posted)
The Linux loop driver is a virtual disk driver which loops block I/O
requests back to a file or partition on a local drive. It has a number of
uses, such as mounting ISO images contained within a file on another
filesystem. The loop driver is also well positioned to apply
transformations to block data as it passes through, however. It is thus
a logical place for the implementation of encrypted filesystems.
By adding a cryptographic
transformation to the loop driver, encryption can be added to any standard
Linux filesystem without having to worry about the filesystem code itself.
An actual encrypted loop driver has never been packaged with the Linux kernel,
but implementation have long been available through sites like
kerneli.org.
In 2.5 the mainline kernel was opened up to cryptographic code. Numerous
ciphers and other algorithms have been added as part of the new crypto API,
but, so far, encryption has not been hooked into the loop driver.
Connecting up the components is not that hard at this point, but there is
one slightly thorny issue which still needs to be resolved.
Many ciphers can take an "initial vector" argument, along with the
encryption key and the data to encrypt (or decrypt). The initial vector
influences the encryption of the data;
the same initial vector must be supplied when that data is decrypted. For
filesystems, the initial vector is often derived from the position of
the data block within the filesystem, with the result that how data is
encrypted depends on its position on the (virtual) disk.
The Linux loop driver, while not performing encryption itself, has long had
a number of hooks to make it easier for others to plug in encryption
algorithms. One of the things the loop driver does is calculate and
provide an initial vector value for data transformations. This seems like
a useful service for the loop driver to provide, except that
nobody likes how that initial vector is calculated.
The problem is that the initial vector is derived from the logical
block number of the data in the filesystem holding the loopback image.
This method works until the block size of that filesystem changes; at that
point the initial vectors change and the filesystem becomes unreadable.
The loopback driver does, by behaving this way, achieve the objective of
protecting the data from prying eyes. But users can be hard to satisfy,
and they complain anyway.
The fix, as posted by Fruhwirth Clemens (or,
as part of a bigger loop patch by Andries Brouwer), is
simple. Rather than using block numbers to generate the initial vector,
the loop driver should simply use 512-byte sector offsets. With that
change, initial vectors are independent of the blocksize of the underlying
filesystem and all is well.
Except, of course, for those users who created filesystems using the
older initial vector calculation. A change in the initial vector will lock
all of those users out of their data, an act which is seen as being in poor
taste. As a result, some developers have argued that this change cannot be merged as it
is.
The real question, however, is whether anybody actually has filesystems
encrypted with block-based initial vectors. The kernel itself has not ever had
support for a cryptographic loop driver, so there is no compatibility with
older mainline kernels to break. The external projects which have provided
this support - loop-AES and kerneli - also noticed the initial vector
problem a long time ago and fixed it in their code. So it would seem
that, in fact, there are no users dependent on the older algorithm. In
that case, it makes a great deal of sense to fix it now, before somebody
does start using it in 2.6. If, on the other hand, somebody, somewhere
really has used the old initial vector calculation to encrypt data, they
may want to speak up fairly soon.
Comments (3 posted)
The bulk of the development effort on the kernel is currently aimed at
stabilizing things for the 2.6 release. Chances are that things will stay
that way for the better part of a year - remember that a fair amount of
stabilization work has to happen
after 2.6.0 is released. Even so,
we're starting to see hints (and even code) showing where some things might
go in 2.7.
A number of people maintain their own special-purpose kernel trees. Most
of them are aimed at adding features to the 2.4 or 2.5 kernels; many serve
as staging areas for patches which, it is hoped, will be merged into the
mainline soon. Those of you who find 2.5.x to be overly stable and boring,
though, may want to have a look at William Lee Irwin's -wli patch series,
which is full of stuff that no rational person would consider putting into
2.5 at this point. Some of the work to be found there includes:
- Single-page kernel stacks and interrupt stacks. This work, discussed here last December, increases
the number of processes a system can support by reducing the
per-process memory usage for stacks.
- Object-based reverse mapping (covered in
February). This technique cuts down on virtual memory management
overhead in most cases. In 2.5.73-wli-1, object-based reverse mapping
for anonymous objects (i.e. user-space memory) was added as well.
- High-memory page mid-level directories. The PMD is the middle tier
on systems which use three-level page table schemes - such as x86
systems with massive amounts of memory. The "highpmd" patch moves
these page directories into high memory, thus reducing the amount of
low memory required by each process on the system. Low memory (the
memory, usually below 1GB, which is directly addressable by the
kernel) tends to be scarce on truly huge systems, so any change which
shifts data structures to high memory can be helpful.
As a result of these (and numerous other) patches, William claims a
five-fold increase in the number of processes which can be supported by a
massive system. This work certainly improves scalability, and may well
make it into the mainline - but not in 2.5. (The -wli patches do not
currently include his page clustering work,
which is even more bleeding-edge. Page clustering, too, may well become a
2.7 feature.)
More in the realm of vaporware currently is Daniel Phillips's 2.7 agenda. Daniel has been
the source of numerous interesting ideas in the past (though somewhat fewer
completed implementations). Among other things, the shared page table
patch (which could also be a 2.7 candidate) was originally written by
Daniel. Looking forward to 2.7, Daniel has a few topics of interest:
- Memory defragmentation. Once a Linux system has been running for a
bit, it can get hard for kernel code to allocate blocks of two or more
physically contiguous pages. In most cases, kernel hackers don't even
try. Daniel suggests the creation of a defragmentation daemon which
would move pages around in an attempt to create larger contiguous
blocks of free memory. Additions made to the kernel in 2.5 (such as
the reverse-mapping VM) will help in this regard, since pages cannot
be moved unless the kernel knows where all the pointers to the page
are.
- Variable-size pages. This idea includes page clustering to create
large pages along with "sub-pages" which are smaller than the physical
page size. Daniel claims to have a prototype implementation which
makes the kernel smaller and faster, and which simplifies a number of
things.
- A physical block cache. This would be a separate address space which
tracks physical blocks on a given volume. There are various
performance benefits which would come from such a structure.
It is far too soon to say with any kind of certainty where the 2.7
development series will go. Linus explicitly resists creating any sort of
explicit plan, preferring to see what sorts of developments prove
interesting enough to actually get implemented and used. Still, one can
read from these early hints that the developers expect to remain interested
in virtual memory topics for a while yet.
Comments (6 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
- Andries.Brouwer@cwi.nl: loop.c.
(June 21, 2003)
Janitorial
Networking
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>