User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 kernel is 2.6.12, which was released on June 17. Quite a few fixes - but no substantial changes - were merged after the last release candidate. For those who might not remember back to last March: 2.6.12 contains, among other things, a driver for the "trusted computing" (TPM) chip found in Thinkpads (and elsewhere), SuperHyway bus support, a multilevel security implementation for SELinux, device mapper multipath support, the address space randomization patches, a restored Philips webcam driver (still lacking full functionality), full I/O barrier support for serial ATA drives, resource limits which can be used to allow unprivileged users to run tasks with realtime priority, and a huge pile of fixes. See the long-format changelog for the details back to 2.6.12-rc2. For details prior to that, see the long-format changelogs for 2.6.12-rc2 and 2.6.12-rc1.

No 2.6.13 prepatches have yet been released. There are, however, a few hundred patches in Linus's git repository, including a big SCSI subsystem update (the venerable SCSI changer driver has finally been merged), version 18 of the wireless extensions (with WPA2 security support), a new SysKonnect ethernet driver, some audit subsystem improvements, some networking updates, a set of device model updates (see below), a number of virtual memory improvements, some Rock Ridge filesystem improvements, a new set of framebuffer fonts, some RAID (MD) improvements, and a number of fixes.

The current -mm tree is 2.6.12-mm1. Recent changes to -mm include a new version of the completely fair queueing (CFQ) I/O scheduler, some VFS scalability work, and lots of fixes.

Comments (2 posted)

Kernel development news

What to merge for 2.6.13?

Andrew Morton, looking forward to 2.6.13, has posted a list of major patches which, in his opinion, will (or will not) be merged soon. Reviewing the list, along with the subsequent discussion, gives a good sense for what the next 2.6 kernel might look like. Of course, the final product is still likely to contain a few surprises.

Some of the decisions are not particularly controversial. Andrew is likely to merge the OCFS2 filesystem, some Xen precursor patches, execute in place support, software suspend support for SMP systems, some kernel timer performance improvements, various KProbes updates, the RapidIO subsystem, some scheduler tweaks, and some memory management work. Nobody has really complained about the inclusion of any of these patches (yet), so their path into the kernel might be relatively smooth.

One patch which has gotten surprising support is kexec, which was first covered here in November, 2002. The ability to quickly boot a new kernel without going through the system firmware is nice, but the real payoff for kexec comes when it is combined with kernel crash dumps. Crash dumps can be a useful diagnostic tool, especially for vendors who are trying to track down a bizarre crash which only occurs at a customer's site. So various distributors have included some sort of crash dump capability in their kernels for some time. These patches will typically write kernel memory to a disk or network device, then reboot the system.

The approaches taken to crash dumps so far share one significant problem: they all rely on the kernel to create its own dump. But this is a kernel which has just gone into panic mode; it is not in a stable state. The chances of an oopsing kernel completing a satisfactory crash dump are not all that high (Arjan van de Ven estimates that it works about 10% of the time). The real problem, however, is the risk involved in allowing an unstable kernel to continue performing I/O; there is a very real possibility that a (corrupted) crash dump could end up being written on top of something that the owner would have preferred to keep.

The kexec approach gets around this problem by rebooting the system before performing the dump. The normal, production kernel is configured to set aside a small range of memory, which it never uses. Instead, a different kernel is loaded into that memory; this kernel will be small, and configured to do little other than performing crash dumps. If the system should panic, kexec is used to immediately boot into the crash dump kernel. This kernel, which will be starting fresh and in a known state, can then write the contents of memory to some sort of permanent store before rebooting into a new production kernel. This approach is safer and more reliable; the mailing list discussion has been favorable enough that kexec/kdump appears likely to be merged.

The reiser4 filesystem has sat in the -mm tree for some time, and Andrew indicated that he might merge it this time around. Reiser4 has run into trouble into the past, mostly as a result of its "file as a directory" semantics which change how Linux works, can confuse tools, and, crucially, can lead to system deadlocks. This feature has been disabled for now, but there is still opposition to merging reiser4 into the mainline.

The main issue this time around would appear to be the plugin architecture used by reiser4. Plugins can be used to change the behavior of the filesystem in many ways, from adding compression to completely changing how the file is laid out on disk. The plugin mechanism is a key part of Hans Reiser's longer-term vision of how filesystems should work; he hopes to eventually move all kinds of functionality into the filesystem level. The kernel developers, however, do not think that this sort of mechanism should be built into a filesystem; instead, much of what plugins do belongs in the VFS layer. So they would like to see reiser4 slimmed down into a much smaller, dumber system, with the plugin capability added on top of it and made available for other filesystems as well.

Hans is resisting making this (large) change; he asks that the review process take a different tack:

How about review by benchmark instead? It works, it runs faster than the competition, users like it, we addressed the core kernel patch complaints, it should go in and receive the exposure that will result in lots of useful improvements and suggestions. It seems like we are getting an unusual review process.

Things appear to be at a standoff which could block the inclusion of reiser4 for some time.

Yet another change under consideration is configurable clock frequencies for the i386 and ia-64 architectures. The current value (1KHz) turns out not to be optimal for all users; lower clock frequencies can improve throughput on some systems at the cost of coarser timer resolution and possibly increased latencies. There have been complaints about the new default (250Hz) and the fact that the patch is going in at all when more sweeping changes to the timer system (such as the dynamic tick patch) are waiting on the wings. Your editor's guess is that the patch will be merged, but the default may be changed to keep the current HZ value.

FUSE (user-space filesystems) is being discussed again. FUSE has run into opposition due to the way it overrides the file permissions checking done at the VFS level. There does not appear to be any solution to this issue that pleases everybody, so it is hard to say where this one might go. It is possible that FUSE will be merged, but without its particular permissions behavior - a solution which would leave a number of FUSE users still needing to apply a patch to get the behavior they want.

It didn't appear on Andrew's list, but the removal of devfs has also been a discussion item. Andrew didn't entirely like the full patch set which completely removed devfs from the kernel; he wondered what would happen if enough people complained and devfs had to be restored at some point in the future. So the current approach is to simply remove the devfs configuration option, making the functionality inaccessible. Eventually, if no major problems turn up, the code can be removed for real.

Comments (12 posted)

A big set of driver core changes

Greg Kroah-Hartman has gotten 2.6.13 off to a good start with a massive set of driver core patches. There are a fair number of API changes that come with this patch set, so the whole thing is worth a look. In-tree code has been fixed to use the new API, but, as always, maintainers of external code are on their own.

Two of the more significant changes were covered here last March. The interfaces have not changed since then, so that coverage will not be duplicated. The first of these changes is the complete rework of the "class" API. The interface known as "class_simple" turned out to be the best way to work with classes, so Greg reworked it as the class API, changing everything as he went. The interface known as class_simple is no more, but the new class API looks much like class_simple used to. The other change is the addition of the "klist" type: an extension to the kernel linked list type which includes its own, built-in reference counting and locking.

The next change is in the prototypes of the store() and show() callbacks for device attributes. These callbacks now look like:

    ssize_t (*show)(struct device *dev, struct device_attribute *attr,
                    char *buf);
    ssize_t (*store)(struct device *dev, struct device_attribute *attr,
                     const char *buf, size_t count);

In each case, the callbacks have picked up a pointer to the actual attribute being accessed, allowing one callback to handle multiple attributes.

There have been a number of internal changes to device model data structures which really shouldn't affect other code, but which might anyway. Various internal lists have been removed; in some cases, they have been replaced with klists. And a number of character pointers are now explicitly const pointers.

Code wanting to look through the devices bound to a driver can use a new function to iterate through the list:

    int driver_for_each_device(struct device_driver *driver, 
                               struct device *start, 
			       void *data, 
                               int (*fn)(struct device *, void *));

This function will call fn() for each device bound to the given driver, stopping at the end of the list or when fn() returns a non-zero value.

Inodes in sysfs now have an i_op->setattr() function, meaning that their permissions can be changed and those changes will last for as long as the system runs. Changing of sysfs permissions was never really supported in the past; it would work for a bit, but the permissions could be reverted at seemingly random times. This is not really an API change, but creators of sysfs attributes should bear in mind that the permissions on those attributes might be changed from their original values.

Comments (none posted)

Dealing with disk I/O problems

Filesystem authors try hard to avoid losing data. Many of them have discovered, the hard way, that failure to return a user's bits in exactly the same condition as when they were entrusted to the filesystem can lead to serious disgruntlement down the road. There are limits to what a filesystem can do, however, when the hardware starts to fail. If a disk drive begins to go bad, or somebody yanks out a hotpluggable device, problems are simply going to happen.

So what should a filesystem do in such a case? The behavior shown by most Linux filesystems (and partially enforced by the VFS layer) is to return an I/O error status (EIO) when things start to fail, then remount the filesystem in a read-only mode in an attempt to avoid any further damage. The end result is that a user-space application might see an EIO error return once - or it might not, since not all in-kernel error codes make it all the way back to user space. After that, the returned error will be EROFS (read-only filesystem), which is not entirely illuminating.

Back in the good old days, we would just look in the system log file to see what was really going on. The new crowd of Linux users would rather not have to do that, however; they expect the system to tell them, politely, that their hardware is on fire and that they are about to deeply regret not having run any backups since sometime last winter. The problem is that the POSIX API is simply not set up to return that sort of detailed error information. Breaking compatibility with POSIX is not an option, so something complicated would have to be done to return error information within the bounds of the current API. Beyond that, however, is the simple fact that the application which is currently beating its head against disk errors might not be the right one to be having a pleasant conversation with the user about those errors.

These issues have led Ted Ts'o to suggest that a different mechanism should be used. Rather than try to shove additional information through the existing API, the kernel should simply report events like disk disasters via an out-of-band mechanism. For example, errors could be reported with the user notification mechanism and fed into DBus for distribution. The user could then be informed of the trouble and given the opportunity to panic in a desktop-specific manner.

There seems to be a high level of agreement that the out-of-band notification is the right way of doing things. All that is needed is for somebody to do the hacking to actually make it happen.

Comments (5 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

  • Marco Costalba: qgit-0.6. (June 20, 2005)

Device drivers

Filesystems and block I/O


Memory management



Benchmarks and bugs


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds