Kernel development [LWN.net]

Kernel release status

The current 2.6 release is 2.6.2-rc2, which Linus announced on January 25. Changes since -rc1 include a number of architecture updates, an IrDA update, and various fixes. See the long-format changelog for the details.

The latest patch set from Andrew Morton is 2.6.2-rc2-mm1. Changes in recent -mm kernels include compilation fixes for gcc 3.5, more scheduler tweaks, a new "_relaxed" API for unordered I/O memory accesses, some code for finding dangerous sleep_on() calls (see below), x86_64 kgdb support, and many other fixes.

The current 2.4 kernel is 2.4.24; Marcelo released 2.4.25-pre7, which includes a set of architecture and filesystem updates, on January 23. Marcelo also notes that 2.4 development will not freeze before 2.4.27; there is already a set of important patches that will need to go into 2.4.26.

Comments (2 posted)

Cooperative Linux 0.51

Cooperative Linux is a project to make a kernel which can run cooperatively in kernel mode with other operating systems. The goal, in particular, is to run Linux as an application under Windows XP. That goal has now been achieved for "some common hardware configurations." Click below for the release announcement.

Full Story (comments: 16)

FUSE - implementing filesystems in user space

Last week we looked at implementing device drivers in user space. Drivers are not the only kernel functionality which can be moved across the divide, however; it is also possible to implement filesystems with user-space code. Linux has a long tradition of user-space filesystems, actually; NFS was implemented that way for quite some time. Even so, user-space filesystems are not widely used, for a number of obvious reasons (performance, security, ...). But there are situations where a user-space filesystem can be a nice thing to have.

For those situations, there is a project called FUSE. Its associated SourceForge page is not particularly enlightening; one really has to look at the project's code to understand what FUSE has to offer. Since the second FUSE 1.1 release candidate has just been announced, this seems like a good time for such an examination.

FUSE is a three-part system. The first of those parts is a kernel module which hooks into the VFS code and looks like a filesystem module. It also implements a special-purpose device which can be opened by a user-space process. It then spends its time accepting filesystem requests, translating them into its own protocol, and sending them out via the device interface. Responses to requests come back from user space via the FUSE device, and are translated back into the form expected by the kernel.

In user space, FUSE implements a library which manages communications with the kernel module. It accepts filesystem requests from the FUSE device and translates them into a set of function calls which look similar (but not identical) to the kernel's VFS interface. These functions have names like open(), read(), write(), rename(), symlink(), etc.

Finally, there is a user-supplied component which actually implements the filesystem of interest. It fills a fuse_operations structure with pointers to its functions which implement the required operations in whatever way makes sense. This interface is not well documented, but the example filesystem provided with FUSE (which implements a simple sort of loopback filesystem) is reasonably easy to follow.

An old filesystem module (AVFS) uses FUSE to make filesystems out of tar and zip files, but one could imagine any number of other possibilities. It would not be that hard to make filesystems which mirror a web site (in read-only mode, at least), provide access to an object database, or provide a file-per-user view of the password file, for example. FUSE could be an ideal platform for experimenters who want to take the "everything is a file" idea to its limit.

Comments (15 posted)

sleep_on() in 2.6.

One of the many goals for the 2.5 development series was the removal of the sleep_on() function (and its variants). The purpose of sleep_on() is to cause a process to block until some condition comes true; unfortunately, it is almost impossible to use safely. Almost every call to sleep_on() looks something like the following:

    while (we_have_to_wait)
	sleep_on(&some_wait_queue);

The problem is that the situation can change between the test (in the while loop) and when the process actually goes to sleep. If the wakeup event happens between the two, the process will miss it and may sleep forevermore. Given that 2.6 was intended to be a more responsive kernel than its predecessors, this behavior is considered undesirable. The only way to avoid it, however, is to hold the Big Kernel Lock (BKL) in the code which calls sleep_on() - and the code which performs the wakeup. Since elimination of the BKL was also on the to-do list, there is little enthusiasm for fixing sleep_on() race conditions that way.

The 2.4 kernel provided a couple of safer ways to sleep: the wait_event() macro or a full "manual sleep" calling schedule() directly (though the latter can be hard to do correctly). In 2.5, the prepare_to_sleep() function was added as an easier (and better performing) way of doing manual sleeps. Even so, the 2.6.2-rc2 kernel still has over 400 calls to the various forms of sleep_on(). Clearly, the goal of getting rid of that function was not achieved.

At this point, many people will have concluded that the effort to remove sleep_on() has been put on hold until 2.7 opens up. It seems, however, that most users of sleep_on() may yet get fixed in 2.6. In response to some discussion on the topic, Al Viro stated:

We need to remove racy uses anyway - that can't wait for 2.7. And I really wonder if there will be anything left after that - right now only reiserfs uses look like something that might be not immediately broken.

He also noted that any use of sleep_on() within device drivers is inherently broken.

Andrew Morton took the next step in 2.6.2-rc1-mm2; that kernel includes a patch which dumps out a bunch of debugging information whenever sleep_on() is called without the BKL held. That code has already turned up a few bad calls which have been duly reported to the kernel list. Fixes for those calls have been somewhat slower in coming. They will likely arrive, however, and as Al speculated, by the time all the bad calls are fixed there may not be a whole lot left. sleep_on() will undoubtedly exist when the 2.7.0 kernel is released, but there may be very few callers of it by then.

Comments (none posted)

Module unloading in a reference counted world

Increasingly, the kernel uses reference counts to know when data structures are no longer needed and can be reclaimed. This reference counting tends to be managed by the kobject type, though other mechanisms are used as well. When properly used, this mechanism works well. Interesting issues can come up, however, when reference-counted objects are maintained by code in loadable modules. In many situations, the module cannot be unloaded until all objects it has created have seen their reference counts go to zero and have been returned to the system. Otherwise, the system can be left with objects containing invalid references to module code which no longer exists. Bad things usually result from that situation.

Alan Stern recently ran into this sort of situation; his module registers various structures with the device model, and must be sure not to allow itself to be unloaded until those structures have been released. To that end, he wrote a patch adding two functions (class_device_unregister_wait() and platform_device_unregister_wait()) which unregister those structures and explicitly wait until they have been released. This patch did not get very far, however; it was quickly pointed out that, with this code, it is relatively easy to deadlock the kernel. If the process trying to remove the module also has an open file descriptor to one of that module's sysfs entries, everything comes to a halt. The suggested solution, instead, is to simply not allow the module to be unloaded if it still has unreclaimed objects outstanding.

That approach is taken in some other contexts. The cdev structure used to represent char devices uses a kobject for its reference count. The cdev_get() function does more than just increment the count in the kobject, however; it also increments the reference count for the module which drives that device. If any cdev structure owned by a module has references, the module, too, will have a non-zero reference count and will not be unloadable.

Another approach has been taken in the network subsystem. The net_device structure represents a network device; its rules say that it must be allocated dynamically, with alloc_netdev(). When the network driver is done with the structure, it calls free_netdev() to get rid of it. The net_device structure has its own reference count, but it is not tied to the underlying module's reference count. Instead, the networking system guarantees that, once free_netdev() has been called, it will not call into the module again for that device. The release function for the net_device structure, which returns its memory to the system, lives in the networking code, rather than in any loadable module. As a result, the module can be removed even while some of its net_device structures continue to exist, and all will be well. Those structures have been detached from the module which created them, and will be freed by core kernel code.

The real lesson from all this, perhaps, is that the kernel developers are still figuring out the implications of the lifetime rules of the objects they create. The addition of sysfs in 2.5 has tended to force this issue; sysfs exposes a great many internal kernel objects to user space, which can keep references to those objects for an indeterminate period of time. Making everything work safely in this environment has proved to be a challenge at times.

And module unloading, of course, will always be a challenge. There will likely always be issues involved with removing code from a live kernel. As Linus put it:

The proper thing to do (and what we _have_ done) is to say "unloading of modules is not supported". It's a debugging feature, and you literally shouldn't do it unless you are actively developing that module.

Experience shows that many users are not happy with a kernel which cannot unload modules, however. So the kernel developers are likely to be wrestling with these issues for some time yet.

Comments (10 posted)

Linus Torvalds Linux v2.6.2-rc2 ?

Andrew Morton 2.6.2-rc2-mm1 ?

Randy.Dunlap 2.6.2-rc2-kj1 patchset ?

Andrew Morton 2.6.2-rc1-mm1 ?

Andrew Morton 2.6.2-rc1-mm2 ?

Andrew Morton 2.6.2-rc1-mm3 ?

Marcelo Tosatti Linux 2.4.25-pre7 ?

Bernhard Rosenkraenzer 2.4.25-pre7-pac1 ?

Benjamin Herrenschmidt Big powermac update ?

Tim Hockin NGROUPS 2.6.2rc2 ?

Mikael Pettersson perfctr-2.6.5 released ?

Carl-Daniel Hailfinger [2.4] forcedeth network driver ?

Heilmann, Oliver AGPGART preliminary SiS648 support - fixed and shrunk ?

Len Brown ACPI for 2.6 ?

Jaroslav Kysela ALSA 1.0.2 release ?

jd ANNOUNCE: UNH ISCSI drivers for 2.6.1 ?

Moore, Eric Dean 2.6.2-rc2 - MPT Fusion driver 3.00.02 update ?

Greg KH USB update for 2.6.2-rc2 ?

Karim Yaghmour relayfs patches for 2.6.1 ?

Miklos Szeredi FUSE 1.1-pre2 ?

raven@themaw.net autofs4-2.6 - to support autofs 4.1.x ?

Rik van Riel RSS limit enforcement for 2.6 ?

Marcel Sebek IMQ port to 2.6 ?

Patrick McHardy : altq HFSC port ?

James Morris Security mount data & sb_copy_data hook. ?

Greg KH udev 014 release ?

Mariusz Mazur Userland headers available ?

Neil Brown ANNOUNCE: mdadm 1.5.0 - A tool for managing Soft RAID under Linux ?

Kernel development

Brief items

Kernel release status

Kernel development news

Cooperative Linux 0.51

FUSE - implementing filesystems in user space

sleep_on() in 2.6.

Module unloading in a reference counted world

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous