LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 2.5.73, which was released by Linus on June 22. Changes this time around include some big ext3 and journaling changes (see last week's LWN Kernel Page), an ACPI update, a big ia64 merge, some networking fixes, a new PCI device locking scheme, the new request_firmware() interface (see the May 21 LWN Kernel Page), an NFS server update, more driver model work, an ARM update, and various other fixes and tweaks. The long-format changelog has all the details.

Linus's BitKeeper tree, as of this writing, contains an MTD driver cleanup, the beginning of work on the loop driver (see below), and some patches to make the network block device driver work again.

The current stable kernel is 2.4.21. Marcelo has started the 2.4.22 process (supposed to only last a couple months) with the release of 2.4.22-pre1; it is a large patch with a lot of USB work, the long-awaited ACPI update, some network fixes, and quite a few other repairs and updates.

Comments (none posted)

Kernel development news

Supporting multiple module initialization functions

One longstanding goal in kernel development has been to eliminate the differences between loadable modules and monolithic (linked-in) code. The fewer differences there are, the easier it is to write code which works in either mode - and to maintain that code. In 2.5, this process is almost complete; there is very little code which is unique to either modules or monolithic code.

One remaining difference, however, has to do with initialization and exit code. It is possible to use the module_init() macro to designate an initialization function, and that function will be called properly at module load time or at boot time if the module is built directly into the kernel. (Exit functions for monolithic code are, of course, simply discarded.) One important difference remains, however: monolithic code can have multiple initialization calls, while modules can only have one. Monolithic code initialization calls can even be given priorities (via macros like core_initcall() or late_initcall()) which control when each function is called.

One would think this wouldn't matter a whole lot for loadable modules, since every initialization function would be called at the same time (when the module is loaded) anyway. But this difference forces module and monolithic code to be different. It also prevents the creation of nice, initialization-time macros which ease the process of setting up /proc files or sysfs entries.

With a new patch (since revised) from Rusty Russell, things will change. Rusty notes the real reason why modules can only have a single set of initialization and exit functions: the kernel simply does not know what to do if one of a series of initialization functions fails. In that case, the module load process must fail, and some sort of cleanup must be performed. The problem is knowing what that cleanup is.

The solution is to associate pairs of initialization and exit functions. That is done with a new macro:

    module_init_exit(priority, init_fn, exit_fn);

This call designates a new initialization and exit function pair, and associates a priority with that pair. Each exit function cleans up (only) the work done by its associated initialization function. At module load time, the initialization functions are called in increasing priority order. Should one fail, the exit functions corresponding to the initialization functions that succeeded will be called, in reverse priority order. Thus, a properly-written module should be able to clean up after itself correctly after a failure in any part of the initialization process.

An early version of this patch broke modules using the long-deprecated technique of calling their initialization and exit functions init_module() and cleanup_module(), respectively. That has since been patched up - this stage of the kernel development process is not the time to be making such changes. But the writing is on the wall, and that particular technique is not likely to survive past 2.7.

Comments (none posted)

Fixing the cryptoloop driver

The Linux loop driver is a virtual disk driver which loops block I/O requests back to a file or partition on a local drive. It has a number of uses, such as mounting ISO images contained within a file on another filesystem. The loop driver is also well positioned to apply transformations to block data as it passes through, however. It is thus a logical place for the implementation of encrypted filesystems. By adding a cryptographic transformation to the loop driver, encryption can be added to any standard Linux filesystem without having to worry about the filesystem code itself. An actual encrypted loop driver has never been packaged with the Linux kernel, but implementation have long been available through sites like kerneli.org.

In 2.5 the mainline kernel was opened up to cryptographic code. Numerous ciphers and other algorithms have been added as part of the new crypto API, but, so far, encryption has not been hooked into the loop driver. Connecting up the components is not that hard at this point, but there is one slightly thorny issue which still needs to be resolved.

Many ciphers can take an "initial vector" argument, along with the encryption key and the data to encrypt (or decrypt). The initial vector influences the encryption of the data; the same initial vector must be supplied when that data is decrypted. For filesystems, the initial vector is often derived from the position of the data block within the filesystem, with the result that how data is encrypted depends on its position on the (virtual) disk.

The Linux loop driver, while not performing encryption itself, has long had a number of hooks to make it easier for others to plug in encryption algorithms. One of the things the loop driver does is calculate and provide an initial vector value for data transformations. This seems like a useful service for the loop driver to provide, except that nobody likes how that initial vector is calculated.

The problem is that the initial vector is derived from the logical block number of the data in the filesystem holding the loopback image. This method works until the block size of that filesystem changes; at that point the initial vectors change and the filesystem becomes unreadable. The loopback driver does, by behaving this way, achieve the objective of protecting the data from prying eyes. But users can be hard to satisfy, and they complain anyway.

The fix, as posted by Fruhwirth Clemens (or, as part of a bigger loop patch by Andries Brouwer), is simple. Rather than using block numbers to generate the initial vector, the loop driver should simply use 512-byte sector offsets. With that change, initial vectors are independent of the blocksize of the underlying filesystem and all is well. Except, of course, for those users who created filesystems using the older initial vector calculation. A change in the initial vector will lock all of those users out of their data, an act which is seen as being in poor taste. As a result, some developers have argued that this change cannot be merged as it is.

The real question, however, is whether anybody actually has filesystems encrypted with block-based initial vectors. The kernel itself has not ever had support for a cryptographic loop driver, so there is no compatibility with older mainline kernels to break. The external projects which have provided this support - loop-AES and kerneli - also noticed the initial vector problem a long time ago and fixed it in their code. So it would seem that, in fact, there are no users dependent on the older algorithm. In that case, it makes a great deal of sense to fix it now, before somebody does start using it in 2.6. If, on the other hand, somebody, somewhere really has used the old initial vector calculation to encrypt data, they may want to speak up fairly soon.

Comments (3 posted)

Looking forward to 2.7

The bulk of the development effort on the kernel is currently aimed at stabilizing things for the 2.6 release. Chances are that things will stay that way for the better part of a year - remember that a fair amount of stabilization work has to happen after 2.6.0 is released. Even so, we're starting to see hints (and even code) showing where some things might go in 2.7.

A number of people maintain their own special-purpose kernel trees. Most of them are aimed at adding features to the 2.4 or 2.5 kernels; many serve as staging areas for patches which, it is hoped, will be merged into the mainline soon. Those of you who find 2.5.x to be overly stable and boring, though, may want to have a look at William Lee Irwin's -wli patch series, which is full of stuff that no rational person would consider putting into 2.5 at this point. Some of the work to be found there includes:

  • Single-page kernel stacks and interrupt stacks. This work, discussed here last December, increases the number of processes a system can support by reducing the per-process memory usage for stacks.

  • Object-based reverse mapping (covered in February). This technique cuts down on virtual memory management overhead in most cases. In 2.5.73-wli-1, object-based reverse mapping for anonymous objects (i.e. user-space memory) was added as well.

  • High-memory page mid-level directories. The PMD is the middle tier on systems which use three-level page table schemes - such as x86 systems with massive amounts of memory. The "highpmd" patch moves these page directories into high memory, thus reducing the amount of low memory required by each process on the system. Low memory (the memory, usually below 1GB, which is directly addressable by the kernel) tends to be scarce on truly huge systems, so any change which shifts data structures to high memory can be helpful.

As a result of these (and numerous other) patches, William claims a five-fold increase in the number of processes which can be supported by a massive system. This work certainly improves scalability, and may well make it into the mainline - but not in 2.5. (The -wli patches do not currently include his page clustering work, which is even more bleeding-edge. Page clustering, too, may well become a 2.7 feature.)

More in the realm of vaporware currently is Daniel Phillips's 2.7 agenda. Daniel has been the source of numerous interesting ideas in the past (though somewhat fewer completed implementations). Among other things, the shared page table patch (which could also be a 2.7 candidate) was originally written by Daniel. Looking forward to 2.7, Daniel has a few topics of interest:

  • Memory defragmentation. Once a Linux system has been running for a bit, it can get hard for kernel code to allocate blocks of two or more physically contiguous pages. In most cases, kernel hackers don't even try. Daniel suggests the creation of a defragmentation daemon which would move pages around in an attempt to create larger contiguous blocks of free memory. Additions made to the kernel in 2.5 (such as the reverse-mapping VM) will help in this regard, since pages cannot be moved unless the kernel knows where all the pointers to the page are.

  • Variable-size pages. This idea includes page clustering to create large pages along with "sub-pages" which are smaller than the physical page size. Daniel claims to have a prototype implementation which makes the kernel smaller and faster, and which simplifies a number of things.

  • A physical block cache. This would be a separate address space which tracks physical blocks on a given volume. There are various performance benefits which would come from such a structure.

It is far too soon to say with any kind of certainty where the 2.7 development series will go. Linus explicitly resists creating any sort of explicit plan, preferring to see what sorts of developments prove interesting enough to actually get implemented and used. Still, one can read from these early hints that the developers expect to remain interested in virtual memory topics for a while yet.

Comments (6 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

  • Andries.Brouwer@cwi.nl: loop.c. (June 21, 2003)

Janitorial

Networking

Architecture-specific

Security-related

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds