Kernel development [LWN.net]

Kernel release status

The current stable 2.6 kernel is 2.6.14.6, released on January 7. It contains a small number of fixes, a couple of which address potential security issues. Chances are this will be the last update for the 2.6.14 kernel.

There is no 2.6.16 prepatch yet. Well over 2000 patches have been merged into the mainline git repository, however. See the separate article (below) for a list of the most significant changes.

The current -mm tree is 2.6.15-mm3. Recent changes to -mm include a big x86-64 update, sysfs support in the parallel port driver, John Stultz's core time subsystem patches, the removal of several old USB audio drivers, the openat() system call and friends, a new direct migration patch set, and multi-block allocation for the ext3 filesystem. Despite all that new stuff, -mm has thinned considerably over the last week as patches have moved into the mainline.

Comments (4 posted)

Quotes of the week

This kernel seems to have been a bit of a disaster - too much eggnog or something

-- Andrew Morton

It's things like this which make me consider a career in carpentry.

-- Andrew Morton

Comments (none posted)

Looking forward to 2.6.16

As of this writing, well over 2000 patches have been merged for the upcoming 2.6.16 kernel. The following list covers some of the more important or user-visible patches; it is not exhaustive by any means. Links to LWN articles describing the patches have been provided where available.

The 2.6.16 merge window will remain open for some time yet, so expect some more big changes before it is done.

User-visible changes

OCFS2, Oracle's clustered filesystem.
Networking changes include per-packet access control tied into the IPSec subsystem, an implementation of the "CUBIC" congestion control algorithm for TCP, an initial implementation of the DCCP protocol over IPv6, and a sysfs interface to the network bonding module, allowing runtime reconfiguration without the need to reload the module. There is also an obscure "intermediate functional block" network device option which can be used for configuration flexibility and resource sharing.
Module versioning (storing version information to help binary modules work with more than one kernel release) is no longer considered experimental.
The hotplug helper /sbin/hotplug is now officially deprecated. The control file /proc/sys/kernel/hotplug has moved to /sys/kernel/uevent_helper, but it is expected to be disabled on most systems in favor of udev and the netlink interface.
Copy-on-write support and NUMA awareness for "hugetlb" pages.
The software suspend code has seen some work. The encryption option has been removed; it was little used and offered little protection in the first place. A few steps have been taken toward moving parts of the suspend process to user space.
The swap migration code, allowing a process's pages to follow it from one processor to another. As of this writing, the direct migration patches have not been merged.
The "SLOB allocator" has been added; it is a replacement for the Linux slab code which is suited for very small-memory systems.
The oldest supported version of gcc for kernel building is now 3.2.
The ext3 filesystem has a new mount option allowing the location of the journal device to be specified.
The module loader now explicitly checks for the ndiswrapper and driverloader modules, and will mark the kernel tainted if they are found.
V9fs (the Plan9 filesystem) is now capable of performing zero-copy operations. Various other v9fs improvements have been added as well.
Support for the Cell architecture has been significantly filled out.
New drivers for ADI Eagle-based USB ADSL modems, ATI and Phillips USB remote control units, the Marvel Yukon2 Ethernet chipset, the network interface in the Intel ixp2000 (ARM) CPU, the CS5535 audio device, Digigram PCXHR boards, and the SyncLink GT and AC serial adaptor families.

Internal API changes

Ingo Molnar's mutex code has been added. A few patches converting subsystems over to mutexes have gone in, but most of that work remains to be done.
The usb_driver structure has a new field (no_dynamic_id) which lets a driver disable the addition of dynamic device IDs. The owner field has also been removed from this structure.
Some significant changes to the SCSI subsystem aimed at eliminating the use of the old scsi_request structure. The SCSI software IRQ is no longer used; postprocessing happens via the generic block software IRQ instead.
Vast numbers of typedefs have been removed from the ALSA code, bringing that subsystem more in line with kernel coding standards. Power management support has also been added to a number of ALSA drivers.
A new workqueue function schedule_on_each_cpu() will cause a function to be called on every running processor on the system.
Much of the core device model code has been reeducated to use the term "uevent" instead of "hotplug." Some changes which are visible outside of the core code include:
- kobject_hotplug() becomes kobject_uevent()
- struct kset_hotplug_ops becomes struct kset_uevent_ops, and its hotplug() member is now uevent()
- add_hotplug_env_var() becomes add_uevent_var()
A 64-bit atomic type, atomic_long_t, has been added. Supported functions are:
- long atomic_long_read(atomic_long_t *l);
- void atomic_long_set(atomic_long_t *l, long i);
- void atomic_long_inc(atomic_long_t *l);
- void atomic_long_dec(atomic_long_t *l);
- void atomic_long_add(long i, atomic_long_t *l);
- void atomic_long_sub(long i, atomic_long_t *l);
The block I/O barrier code has been rewritten. This patch changes the barrier API and also adds a new parameter to end_that_request_last().
The block_device_operations structure has a new method getgeo(); its job is to fill in an hd_geometry structure with information about the drive. With this operation in place, many block drivers will not need an ioctl() function at all.
The dentry structure has been changed: the d_child and d_rcu fields are now overlaid in a union. This change shrinks this heavily-used structure and improves its cache behavior.
struct page has also been changed; it is now smaller on large SMP systems.
Linas Vepstas's PCI error recovery patch has been merged.
A new list function, list_for_each_entry_safe_reverse(), does just what one would expect.
The high-resolution kernel timer code has been merged. Much of the core works as described in this LWN article, but there have also been changes and most of the names are different. The new high-resolution timer interface will be discussed in the January 19 Kernel Page.
Buffering for the TTY layer has been completely redone.

As noted above, more changes are likely; stay tuned. Remember that API changes will eventually find their way onto the LWN 2.6 API Changes Page.

Comments (14 posted)

The mutex API

The mutex code may well have set a record for the shortest time spent in -mm for such a fundamental patch. It would not have been surprising for mutexes to sit in -mm through at least one kernel cycle, which would have had them being merged in or after 2.6.17. But the mutex code appeared in exactly one -mm release (2.6.15-mm2, released on January 7) before being merged into the mainline on January 9.

The actual mutex type (minus debugging fields) is quite simple:

    struct mutex {
	atomic_t		count;
	spinlock_t		wait_lock;
	struct list_head	wait_list;
    };

Unlike semaphores, mutexes have one definition which is used on all architectures. Some of the actual locking and unlocking code can be overridden if it can be made to perform better on a specific architecture, but the core data structure remains the same. The count field contains the state of the mutex. A value of one indicates that it is available, zero means locked, and a negative value means that it is locked and processes might be waiting. Separating the two "locked" cases is worthwhile: in the (usual) case where nobody is waiting for the mutex, there is no need to go through the process of seeing if anybody needs to be waked up. wait_lock controls access to wait_list, which is a simple list of processes waiting on the mutex.

The mutex API (obtained through <linux/mutex.h>) is simple. Every mutex must first be initialized either at declaration time with:

    DEFINE_MUTEX(name);

Or at run time with:

    mutex_init(struct mutex *lock);

Once a mutex has been initialized, it can be locked with any of:

    void mutex_lock(struct mutex *lock);
    int mutex_lock_interruptible(struct mutex *lock);
    int mutex_trylock(struct mutex *lock);

A call to mutex_lock() will lock the mutex, putting the calling process into an uninterruptible wait if need be. mutex_lock_interruptible() uses an interruptible sleep; if the lock is obtained, it will return zero. A return value of -EINTR means that the locking attempt was interrupted by a signal and the caller should act accordingly. Finally, mutex_trylock() will attempt to obtain the lock, but will not sleep; unlike mutex_lock_interruptible(), it returns zero on failure (the lock was unavailable) and one if the lock is acquired.

In all cases, the mutex must eventually be freed (by the same process which acquired it) through a call to:

    void mutex_unlock(struct mutex *lock);

Note that mutex_unlock() cannot be called from interrupt context. This restriction appears to have more to do with keeping mutexes from ever being used as completions than a fundamental restriction caused by the mutex design itself. Note also that a mutex can only be locked once - locking calls do not nest.

Finally, there is a function for querying the state of a mutex:

    int mutex_is_locked(struct mutex *lock);

This function will return a boolean value indicating whether the mutex is locked or not, but will not change the state of the lock.

Now that this code has been merged, the semaphore type can officially be considered to be on its way out. New code should not use semaphores, and old code which uses semaphores as mutexes should be converted over when an opportunity presents itself. The reader/writer semaphore type (rwsem) is a different beast, and is not affected by this patch. There is a debugging option which can be configured into development kernels which may help with the transition; with this option enabled, quite a few types of errors will be detected.

At this point, code which uses the counting feature of semaphores lacks a migration path. There is evidently a plan to introduce a new, architecture-independent type for these users, but that code has not yet put in an appearance. Once that step has been taken, the path will be clear for the eventual removal of semaphores from the kernel entirely.

Comments (1 posted)

Linux and wireless networking

Jeff Garzik's recent State of the Union: Wireless posting came right to the point:

Another banner year has passed, with Linux once again proving its superiority in the area of crappy wireless (WiFi) support. Linux oldsters love the current state of wireless, because it hearkens back to the heady days of Yuri Gagarin, Sputnik and Linux kernel 0.99, when getting hardware to work under Linux required either engineering knowledge or luck (or both).

Jeff went on to discuss a few of the challenges facing the Linux wireless implementation. This is, indeed, one area where some real progress is needed. Proprietary chipsets are just the beginning of the issues which must be dealt with - free software developers are actually beginning to catch up in that area. But before all the resulting drivers can be merged into a coherent whole, a few other things will have to be worked out.

One of those has to do with the 802.11 stack used by the kernel. As was discussed here last December, there is a fair amount of unhappiness with the in-kernel stack, which, among other things, has no "softmac" support, needed for adapters which do not perform MAC functions in hardware. A number of out-of-tree wireless stacks do provide that support, and there have been a lot of suggestions that one of those (usually the DeviceScape stack) be merged.

Those suggestions have been strongly resisted by the networking maintainers. They would rather see work go into fixing up the stack which is in the kernel now than replace it wholesale or - even worse - having two independent 802.11 stacks to maintain. Replacing the current stack would involve significant disruption in the networking subsystem, and would be hard to do without breaking the drivers which use the old stack. The two-stack solution, instead, would bloat the kernel and increase the amount of work required to maintain the networking subsystem into the future. So it is not surprising that there is a strong interest in evolving the current stack toward the desired functionality rather than bringing in a whole new implementation.

Still, the pressure to switch over to the DeviceScape stack appears to be growing. Jeff's posting seems to recognize this fact, and asks that, in the end, the developers at least pick a single stack which they can live with. And, says Jeff, regardless of which stack is chosen in the end:

It is currently fashionable to laud DeviceScape and trash in-kernel ieee80211, but outside of the cheerleading, BOTH have real technical issues that need addressing. IOW, no matter what code is chosen, _somebody_ is on the hook for a fair amount of work. A switch is not without its costs.

Another issue has to do with the management interface for wireless adapters. Wired network adapters are relatively simple; set a few options on media access, give them an address, and they are ready to go. The wireless world is rather more complicated. To deal with the extra configuration required by wireless adapters, the "wireless extensions" interface - essentially a big set of ioctl() commands for querying and setting adapter parameters - was developed.

There seems to be a consensus that the wireless extensions have reached their expiration date, and need to be replaced with something else. Most developers would appear to favor a new (not yet specified) interface built on the netlink mechanism. User-space management code could then be rewritten to speak the new management protocol over netlink sockets.

This approach may seem strange, given the emphasis which has been placed on sysfs and the creation of scriptable, plain-text interfaces. Sysfs does seem like a poor match for wireless configuration, however. Wireless adapters have a large number of parameters, and it is often necessary to change several of them simultaneously. Sysfs, with its one-value-per-file rules, provides no means for this sort of atomic, multi-parameter update; a netlink interface could, instead, be designed with these needs in mind from the beginning.

Of the other issues mentioned, perhaps this one is the most significant: there is no wireless maintainer. The lack of a developer who is specifically interested in this area of networking and who will work to push it forward has clearly hurt. Fortunately, it appears that this era may be at an end: John Linville has stepped forward to take on this responsibility.

John has a fair amount of work ahead of him; quite a few developers have to be brought together and made to agree on the way forward. To that end, a wireless networking summit has been scheduled for early April in Portland. If the attendees at that meeting (which looks to include both kernel and user space developers) can produce a viable plan, Linux may just lose its "superiority in the area of crappy wireless support" before too long.

Comments (12 posted)

Andrew Morton 2.6.15-mm1 ?

Andrew Morton 2.6.15-mm2 ?

Andrew Morton 2.6.15-mm3 ?

Steven Rostedt 2.6.15-rt4-sr1 ?

Chris Wright Linux 2.6.14.6 ?

Yasunori Goto Simple memory hot-add for ia64. ?

Greg Ungerer : linux-2.6.15-uc0 (MMU-less support) ?

Mike D. Day [PATCH] sysfs support for Xen attributes ?

Rafael J. Wysocki swsusp: userland interface (rev. 2) ?

Rafael J. Wysocki swsusp: separate swap-writing/reading code ?

Ingo Molnar mutex subsystem, -V15 ?

Ingo Molnar mutex subsystem, -V16 ?

john stultz Time: Generic Timeofday Subsystem (v B15-mm) ?

Dipankar Sarma RCU tuning for latency/OOM ?

Davide Libenzi POLLHUP tinkering ... ?

Arjan van de Ven Series to allow a "const" file_operations struct ?

David Woodhouse [1/6] Add pselect/ppoll system call implementation ?

Dave Jones oops pauser. ?

Marty Ridgeway January Release of LTP Available ?

Junio C Hamano GIT 1.0.7 ?

Junio C Hamano GIT 1.1.0 ?

Junio C Hamano GIT 1.1.1 ?

Mathieu Desnoyers Linux Trace Toolkit Viewer/Next Generation announcement (LTTV/LTTng) ?

Greg KH Driver Core patches for 2.6.15 ?

Greg KH PCI patches for 2.6.15 ?

Greg KH I2C and hwmon patches for 2.6.15 ?

Jeff Garzik 2.6.x net driver updates ?

Alessandro Zummo RTC subsystem ?

Ben Collins : How to be a kernel driver maintainer ?

Latchesar Ionkov v9fs: new multiplexer implementation ?

Latchesar Ionkov v9fs: zero copy implementation ?

Eric Van Hensbergen : v9fs: add readpage support ?

Trond Myklebust NFS client updates against 2.6.15 available ?

Andreas Gruenbacher Generic infrastructure for acls ?

Andreas Gruenbacher Access Control Lists for tmpfs ?

Mingming Cao multiple block allocation to current ext3 ?

Adrian Bunk the scheduled removal of obsolete OSS drivers (v2) ?

Greg KH devfs going away, last chance to complain ?

Dave McCracken Shared page tables ?

Benjamin LaHaise use local_t for page statistics ?

Christoph Lameter Direct Migration V9: Overview ?

Jeff Garzik State of the Union: Wireless ?

Harald Welte x_tables, take 5 (Final Review) ?

Len Brown Linux/ACPI mailing list moved ?

Stephen Hemminger iproute2 2.6.15-060110 ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quotes of the week

Looking forward to 2.6.16

User-visible changes

Internal API changes

The mutex API

Linux and wireless networking

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Networking

Miscellaneous