LWN.net Logo

Kernel development

Release status

Kernel release status

The current development kernel is 2.5.72, which was released by Linus on June 16. This relatively small patch contains an x86-64 merge, a partial reversion of the IDE taskfile switchover, a PA-RISC update, and various fixes and cleanups. The long-format changelog has the details.

Linus had released the 2.5.71 ("sticky turtle") kernel only two days before. This long-awaited patch included a fair amount of driver model work, some extensive PCI bus cleanups (dealing with potential race conditions there), the big IDE changeover to taskfile I/O, a new /proc/kallsyms file, support for per-CPU variables in modules, a change the kmalloc_percpu() interface, an Atmel at76c50x wireless driver, a long-sought fix for hanging TCP sessions, an improved slab allocator which performs better in busy, multi-processor situations, some kbuild tweaks, an ALSA update, a set of hash function changes to deal with algorithmic complexity attacks, a FAT filesystem rework (if you have been waiting to be able to create FAT partitions greater than 128GB, this patch is for you), a v850 subarchitecture merge, a RAID update, the removal of the long-deprecated callout TTY device (/dev/cua) support, numerous architecture updates, and several other fixes and updates. As always, the long-format changelog has the gory details.

Linus's BitKeeper tree contains an extensive ext3 and JBD rework (see below), an OProfile update, some NFS server fixes, and a few other fixes and updates.

With the 2.5.72 announcement, Linus announced that he is taking a leave of absence from Transmeta to go work at the Open Source Development Lab. "Transmeta has always been very good at letting me spend even an inordinate amount of time on Linux, but as a result I've been feeling a little guilty at just how little 'real work' I got done lately. To fix that, I'll instead be working at OSDL, finally actually doing Linux as my main job."

The current stable kernel is 2.4.21, released, at last, on June 13. There were no changes since -rc8.

No 2.4.22 prepatches have come out yet. Marcelo's plan, at this point, is to have 2.4.22 contain an updated aic7xx driver and the current ACPI tree (both items that people had wanted in 2.4.21), along with some interactivity and memory management fixes.

Comments (none posted)

Kernel development news

What's needed to fix user-space device enumeration?

Back in April, LWN looked at udev, a simple user-space daemon which handles the dynamic creation and removal of device nodes. Udev is an answer to devfs which uses hotplug events and sysfs to manage the device tree in user space. Things have been fairly quiet on the udev front - at least, on the public lists. That changed, however, when Steven Dake posted a patch aimed at fixing some problems he sees with how udev works. At that point, it become clear that an off-list discussion has been going on for some time.

Mr. Dake has a list of four problems that he is trying to fix with his patch, which creates an event queue within the kernel and a virtual device for retrieving events from that queue. These problems are:

  • The current implementation (which invokes /sbin/hotplug for each device event) has performance problems when the number of devices is large.

  • There is no policy controlling how many /sbin/hotplug processes can be created simultaneously, a shortcoming which can lead to out-of-memory situations.

  • /sbin/hotplug is not available during the early part of the system initialization process, so early device enumeration is not possible.

  • Hotplug events can be processed out of order, leading to device directory corruption.

The posting elicited some strongly-worded responses. The general view is that the first three of the problems listed above do not actually exist. The cost of /sbin/hotplug is small relative to the cost of device probing and initialization, so, in the real world, system load and performance are not problems. Early initialization can be handled with initramfs or by reconstructing things in user space from the sysfs tree. The hotplug developers thus feel no pressure to "fix" any of those problems. Linus also chimed in with a condemnation of event daemon schemes.

When the dust settled, however, the problem of event reordering remained. Device events can come quickly, and the vagaries of scheduling, page faults, etc. can cause them to be processed in an order different from that in which they were generated. Some fairly complicated schemes were presented for dealing with this problem, but they were set aside when Andrew Morton suggested the (in retrospect) obvious: add a sequence number to hotplug events. With a unique, increasing sequence number, it is simple for a user-space process to detect (and fix) misordered events. Problem solved.

Comments (1 posted)

Avoiding sysfs surprises

One of the nice (and increasingly important) features of the 2.5 device model is sysfs. This virtual filesystem exports a view of the system's structure to user space; it also provides a nice control interface - and /proc replacement - by allowing attributes to be attached to sysfs entries. Sysfs is not without its traps, however, and many kernel developers are just now beginning to realize the sort of care that is necessary to avoid making mistakes.

The hardware supported by Linux is increasingly dynamic; devices can appear and disappear at any time. The sysfs filesystem adjusts itself in response to hardware events by creating and removing directories associated with devices, classes, and other objects. Kernel code typically implements this functionality by allocating (and registering) device structures and other objects when a device is plugged in, and deleting those structures when the device is removed. It tends to work quite well.

But consider the following possible sequence of events:

  1. A user plugs in a shiny new hotplug PCI frobnicator.

  2. The driver creates a device structure and registers it; as a result, the directory /sys/devices/pci0/00:11.0/ (or some such) gets created and filled with attributes.

  3. A user process moves into that directory, opens one of the attribute files, but doesn't get around to reading it yet.

  4. The user, having done enough frobnication for one day, unplugs the device.

  5. The driver unregisters and frees the device structures.

All seems well, except for the small problem of that user process. By sitting in the directory, it maintains a reference there. The open attribute file is yet another reference. If the driver has truly cleaned up and freed the devices, the user process will be holding structures with pointers into freed memory. An attempt to read the (already open) attribute file at this point is almost certain to crash the system.

The above scenario is not hypothetical; a fair number of such conditions exist in the 2.5 kernel now. That is why this issue (titled "kobject refcounting") appears in the 2.6 must-fix list. It truly must be fixed.

The infrastructure exists to handle these problems, but it must be used properly to be effective. The solution lies in the same place as the problem - the kobject structure. The 2.5.72 version of this structure looks like:

struct kobject {
	char			name[KOBJ_NAME_LEN];
	atomic_t		refcount;
	struct list_head	entry;
	struct kobject		*parent;
	struct kset		*kset;
	struct kobj_type	*ktype;
	struct dentry		*dentry;
};

Entries in sysfs are closely tied to kobjects; there is a kobject associated with each directory in the filesystem. When a process moves into a sysfs directory or opens a sysfs file, the associated kobject has its refcount field incremented. As long as the reference count is above zero, the kobject cannot be deleted.

The same kobjects, of course, are embedded deeply within the structures used to represent devices and other system objects. So a nonzero reference count in a kobject means that the entire device structure (and, perhaps, the module infrastructure supporting it) is still in use. Safely putting things into sysfs is really just a matter of not deleting objects until their reference counts hits zero.

Of course, that is easily said, but the current mechanism for implementing such a policy is not entirely obvious. An example might help, so we'll look at the block subsystem, which does things right. Disks, within the kernel, are represented by the gendisk structure. The function used to create a gendisk is alloc_disk(), which, after allocating and initializing a gendisk structure (which contains a kobject), executes this mysterious line of code:

    kobj_set_kset_s(disk,block_subsys);

This line tweaks the kobject within disk (the gendisk structure) to make it a part of block_subsys. The block subsystem structure, in turn, contains a pointer to a kobj_type structure, which, in this case, looks like:

static struct kobj_type ktype_block = {
	.release	= disk_release,
	.sysfs_ops	= &disk_sysfs_ops,
	.default_attrs	= default_attrs,
};

We'll come back to this structure in a moment. For now, suffice to say that it identifies the kobject (and the gendisk structure that contains it) as something belonging to the block code, and provides some methods implementing the object's operations.

The function which puts a new disk into the system is add_disk(); it creates the associated sysfs structure, and increments the disk's reference count. The disk then goes through its lifecycle, with the reference count going up and down as it is mounted and unmounted, and as its sysfs files are accessed. Should the disk disappear, the driver will do some cleanup and call del_gendisk() to return the gendisk structure to the system.

del_gendisk() does not actually free the structures, however. It removes the sysfs entries and generally shuts things down; it then finishes by decrementing the reference count. That operation releases the reference which was first obtained in add_disk(). The driver also must release its own reference with put_disk(). These operations may drop the reference count to zero - if nobody else is holding a reference to the disk. But there is no way to know ahead of time.

Sooner or later, however, the last reference will go away. The function which actually decrements the count (kobject_put()) tests that count for zero. If no references remain, kobject_put() will go back to the kobj_type structure associated with the kobject (the ktype_block we saw above, in the case of a gendisk) and call the release() method found there. That method, knowing that nobody is referring to the object, can actually remove it from the system.

That is how sysfs objects must be managed. They must have a destructor associated with them, by way of the kobj_type structure, and that destructor must understand the higher-level objects that it is dealing with. With this mechanism in place, objects will continue to exist as long as references to them are held.

Of course, things can get more complicated than that. If, for example, a module adds attributes to sysfs entries, that module cannot be removed until it is certain that all of the relevant references have gone away. It gets even worse if kernel code tries to attach attributes to objects which it does not own; in that case it can be very hard to get everything right. It may eventually prove necessary to rework some of the sysfs interfaces to make it easier to avoid mistakes, but that seems unlikely for 2.5 at this point. In the mean time, connecting the pieces together correctly can be an intimidating task the first time around, but the alternative is to put denial of service vulnerabilities into the kernel.

Comments (1 posted)

Big changes to ext3 and journaling

The ext3 filesystem is, for many, the standard journaling filesystem for the Linux kernel. So it has been somewhat embarrassing that ext3 still uses a number of deprecated interfaces, including the big kernel lock and sleep_on(). The big kernel lock (BKL) is a holdover from the initial Linux symmetric multiprocessing implementation, when it was not safe for more than one processor to run in the kernel at the same time. Its presence in ext3 is not just considered archaic and inelegant; it is also a serious performance constraint on larger SMP systems.

As of 2.5.73, the BKL has been abolished from ext3, thanks to a lengthy series of patches by Andrew Morton and Alex Tomas. These patches never did show up on linux-kernel, but they have been part of the -mm kernel tree for some time. Says Andrew:

My gut feeling is that there should be one, maybe two bugs left in it, but no problems have been discovered...

So, as with all development kernels, a bit of caution is called for.

Removing the BKL from ext3 was actually a simple thing to do. That filesystem, itself, had no need for the BKL - it is the generic journaled block device (JBD) layer that required that protection. So the first step was to push the BKL down a layer, and ext3 was BKL-free. Of course, that didn't solve the real problem, but it was a start. While ext3 was being worked on, a few other patches went in:

  • Concurrent block and inode allocation, much like ext2 has had for some time. This patch puts a separate spinlock on each cylinder group in a filesystem, allowing allocation to happen in multiple groups simultaneously.

  • "Fuzzy counters," which implements approximate counters for free blocks and inodes using per-CPU variables.

  • The ext3 "data=journal" mode has been fixed. This mode, which journals all data written to the disk (rather than just the metadata) has been broken for a long time.

With ext3 done, it was time to fix up the JBD layer. This job was not done halfway - a lengthy series of patches adds several locks and a whole, complicated, fine-grained scheme. Each transaction gets two separate locks (t_handle_lock and t_jcb_lock) controlling access to various data structures. There is another set for the journal: j_state_lock for scalar state information, j_list_lock for lists and buffers, and j_revoke_lock for the list of revoked blocks. Two more locks protect aspects of the buffer head/journal head combination. And, of course, there is a whole set of ordering rules to control which locks must be taken before which others. Believe it or not, there is even a certain amount of documentation in the code comments describing which locks protect which data structures.

The whole body of work clearly needs wider testing (and benchmarking), so it's probably a good time for it to go into the mainline kernel. Hopefully there won't be too many surprises lurking for the unwary (or unbacked-up). As this work stabilizes, however, another big item can be scratched off the "must-fix" list.

Comments (6 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Networking

Architecture-specific

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds