Kernel development [LWN.net]

Kernel release status

The current 2.6 prepatch is 2.6.16-rc4, announced by Linus on February 17. Things are settling down, and this prepatch contains "only" 100 fixes or so, many concentrated in the SCSI subsystem. Details can be found in the long-format changelog.

As of this writing, the mainline git repository contains about 75 post-rc4 patches, including one reverting a change which broke systems running non-current versions of HAL (see below).

The current -mm tree is 2.6.16-rc4-mm1. Recent changes to -mm include the addition of Al Viro's "bird" tree, a big x86-64 update, some memory management tweaks, some software suspend patches, a big "generic bit operations" patch set, and the lightweight robust futex patch.

For 2.4 users, Marcelo has released the second 2.4.33 prepatch with several fixes, some of which are security-related.

Comments (2 posted)

Quote of the week

Please stop CC'ing me on this pointless thread! Dunno who put me back, but I have absolutely ZERO interesting in reading any of it anymore. I'd rather get a root canal while listening to Michael Bolton and getting my right leg sawed off

-- Jens Axboe gets tired of the cdrecord discussion (going strong into its second month).

Comments (21 posted)

The kevent interface

The Linux asynchronous I/O implementation is notoriously incomplete; among the many things on the "to do" list is asynchronous network I/O. Network writes are already, to some extent, asynchronous, but only if the kernel is able to copy user data into a kernel buffer. The current interface cannot be simultaneously zero-copy and asynchronous. There is also no way to set up asynchronous, zero-copy reads. Evgeniy Polyakov has recently posted a patch which tries to fill that gap - and quite a bit more besides - through the addition of three new system calls and a completely new kernel event subsystem.

Evgeniy's patch adds a new "kevent" type. The kernel can generate and report kevents for a number of possible situations, including:

The arrival of network data or connections.
Any situation which can be reported by the poll() system call.
Events which can be returned by inotify(), such as the creation or removal of files.
Network asynchronous I/O events.
Timer events.

All of this becomes possible through the addition of a complex system call:

    struct kevent_user_control
    {
	unsigned int cmd;
	unsigned int num;
	unsigned int timeout;
    };

    long kevent_ctl(int fd, struct kevent_user_control ctl);

The file descriptor argument to kevent_ctl() has little to do with any requested events; it is, instead, mostly used as a place for the kevent subsystem to stash some of its own housekeeping information. That file descriptor must be allocated, however, with a call like:

    ctl.cmd = KEVENT_CTL_INIT;
    int kevent_fd = kevent_ctl(0, &ctl);

The returned file descriptor can be used to add, remove, modify, and wait for events. Event requests are passed from user space in a structure like:

    struct kevent_id
    {
	__u32		raw[2];
    };

    struct ukevent
    {
	struct kevent_id id;
	__u32 type;
	__u32 event;
	__u32 req_flags;
	/* ... */
    };

Here, the embedded id structure usually holds a file descriptor number for which associated events are desired. For timer events, instead, it holds the timeout period. The type and event fields describe what sorts of events are desired; type can be one of: KEVENT_SOCKET (data and/or connections on sockets), KEVENT_INODE (file creation and removal), KEVENT_POLL (any poll() event), KEVENT_TIMER (timer events), or KEVENT_NAIO (network asynchronous I/O). The event field is a bitmask which depends on type; as an example, for inode events, it can contain KEVENT_INODE_CREATE and/or KEVENT_INODE_REMOVE. The main thing seen in req_flags is KEVENT_REQ_ONESHOT, indicating that only one event should be returned.

The attentive reader may have noticed that the kevent_ctl() interface has no parameter for the ukevent structure. Instead, the user-space process is expected to place one or more ukevent structures immediately after the kevent_user_control structure in memory, and to set the num field to how many of those structures are present. A process interested in events should create this set of structures and pass them to kevent_ctl() with a cmd value of KEVENT_CTL_ADD. After that, the kernel will start generating events at the appropriate times. Other possible cmd values are KEVENT_CTL_REMOVE and KEVENT_CTL_MODIFY, which have the obvious effect.

The final supported command is KEVENT_CTL_WAIT, which will wait for the number of events specified in the num field. An optional timeout value can also be provided. The returned events will, once again, go into memory just after the kevent_user_control structure. It is also possible to pass the kevent file descriptor to poll() or select().

Extending this mechanism to asynchronous network I/O requires the addition of two more system calls:

    long aio_send(int kevent_fd, int socket_fd, void *buffer, size_t size,
                  unsigned flags);
    long aio_recv(int kevent_fd, int socket_fd, void *buffer, size_t size,
                  unsigned flags);

Either one of these calls will put together and enqueue a special kevent request on the given kevent_fd file descriptor. The I/O will remain outstanding; once it completes, the associated event will be returned to the process. Until the completion event, the buffer should not be touched. There is also a provision for an aio_sendfile() system call, though it has not yet been implemented.

At the lower levels, enabling asynchronous I/O for a protocol requires the addition of two new methods to the proto structure:

    int	(*async_recv) (struct sock *sk, void *dst, size_t size);
    int (*async_send) (struct sock *sk, struct page **pages, 
                       unsigned int poffset, size_t size);

In Evgeniy's patch, only the TCP protocol has been extended in this manner.

There has been very little discussion of this patch on the netdev mailing list (where it was posted). Your editor suspects that, while the functionality provided by the patch is welcome, the user-space interface, perhaps, needs a little bit of work before it will be ready for inclusion into the mainline kernel.

Comments (1 posted)

Sysfs and a stable kernel ABI

Some things are fairly predictable. There is a long list of regressions in the 2.6.16 kernel, and some of those do not appear to be getting a whole lot of developer attention. But when one of those bugs causes a developer's iPod to stop working with Linux, it will get fixed in a timely manner. This time around, it also set off a discussion on what it really means to have a stable application interface to the kernel.

Back in the dim and distant past (last year), the "user events" mechanism was added to the kernel. One of the first events to be implemented was block device mount and unmount operations. Over time, however, it was concluded that user events were not the right way to communicate this information. So a new interface - allowing interested user-space processes to call poll() on /proc/mounts - was added to the kernel. Then, a patch was merged for 2.6.16 which removes the mount and unmount events.

When Pekka Enberg (the iPod user) fingered this patch as the cause of the problem, the author of that patch (Key Sievers) responded: "Upgrade HAL, it's too old for that kernel." This response didn't sit well with Andrew Morton:

You took a kernel interface which was present in 2.6.10, 2.6.11, 2.6.12, 2.6.13, 2.6.14 and 2.6.15 and changed it in a non-compatible way, without telling us that it was non-compatible and without even notifying people that we'd gone and broken existing userspace.

We. Don't. Do. That.

Linus, too, was unimpressed:

Guys: you now have two choices: fix it by sending me a patch and an explanation of what went wrong, or see the patch that broke things be reverted.... I'm fed up with hearing how "breaking user space is ok because it's HAL or hotplug". IT IS NOT OK. Get your damn act together, and stop blaming other people.

For now, the issue has been resolved by reverting the patch in question. The feature removal schedule has been updated to note that the mount and unmount events will disappear in February of 2007. iPod owners can rest easy for now.

But this episode drives home a point which is worth noting. Longstanding kernel policy has been that, while kernel internals can change at any time, the user-space interface must remain absolutely stable. Even when an interface turns out to have been badly designed, it must continue to work. Interfaces can be augmented or superseded, but they cannot be broken.

Not that long ago, the kernel ABI consisted entirely of the system call interface and a few files in /proc. While regressions were not unknown, the fact is that keeping a couple hundred system calls in a stable state is a relatively straightforward task. People notice when a system call interface is changed. In more recent times, the interface to the kernel has gotten much wider; it includes several netlink-based protocols and a number of kernel-based virtual filesystems like configfs and sysfs. It can be easy for kernel developers to lose track of the fact that, when they work on one of those interfaces, they risk breaking the user-space ABI. And it can be easy for changes which change the user-space interface to slip past the review process.

This risk is especially acute with sysfs. The directory tree exported via sysfs matches, in a very close way, the data structures maintained within the kernel. Every sysfs directory corresponds to a kobject embedded within some kernel structure, and every sysfs attribute is tied, somehow, to an attribute of the associated structure within the kernel. There are some advantages to this arrangement; sysfs has become a clear window into the organization of the system as seen by the kernel. And, because sysfs is so closely tied to the kernel's data structures, most developers need not even think about it. When a new type of device, for example, is added to the kernel, the associated sysfs entries will generally just happen by themselves.

But every entry in sysfs - 3400 attributes in 1175 directories on your editor's relatively simple system - is part of the kernel ABI. That's 3400 attributes tied to 1175 kernel internal data structures which cannot be changed without the risk of breaking user-space code. Sysfs has evolved into a highly complex - and, to a great extent, undocumented - binary interface to the kernel. In the short term, that makes sysfs susceptible to inadvertent regressions as developers make changes without thinking about the possible user-space effects.

In the longer term, a different problem might arise. The kernel developers have always been willing to make incompatible changes to the internal API if the end result is a better, more capable, or safer interface. This freedom to change things is widely exploited; see the LWN 2.6 API changes page to see just how widely. As kernel data structures get tied into sysfs, however, they become part of an ABI which cannot be broken. In a few years, the kernel hackers may find themselves in the position of wanting to make significant internal structural changes, only to be thwarted by the inability to change the associated sysfs structure. At that point, the choice be to either (1) not make the changes, or (2) interpose some sort of compatibility translation layer between sysfs and the kernel structures it represents. Neither looks like a whole lot of fun.

Comments (9 posted)

Wasabi white paper on kernel modules

The folks at Wasabi Systems have published a white paper on the legal status of loadable kernel modules. "As attorneys ourselves, we cannot find a coherent legal argument for excluding LKMs from [GPL] coverage. So why does the Free Software Foundation tolerate them? Because of its dual interests. On the one hand, it seeks to enforce the GPL. On the other hand, it seeks to promote the use of free software such as Linux." Or, perhaps, because the FSF has little copyright interest in the kernel.

Comments (41 posted)

Linus Torvalds Linux 2.6.16-rc4 ?

Andrew Morton 2.6.16-rc4-mm1 ?

Alexey Dobriyan 2.6.16-rc4-kj ?

Ingo Molnar 2.6.15-rt17 ?

Marcelo Tosatti Linux 2.4.33-pre2 ?

Paul Mackerras Provide an interface for getting the current tick length ?

Ingo Molnar lightweight robust futexes: -V3 ?

Ingo Molnar lightweight robust futexes: -V4 ?

john stultz Time: Generic Timeofday Subsystem (v B19) ?

john stultz [RFC][PATCH] Time: Experimental improved ntp error accounting for TOD work ?

Junio C Hamano GIT 1.2.1 ?

Junio C Hamano GIT 1.2.2 ?

Marco Costalba qgit 1.1 ?

Roland Dreier [RFC] IBM eHCA InfiniBand adapter driver ?

jayakumar.acpi@gmail.com ACPI: Atlas ACPI driver ?

Randy Dunlap ACPI objects for SATA/PATA ?

Carlos Aguiar mmc: add OMAP driver ?

Alessandro Zummo RTC subsystem ?

Martin Michlmayr [PATCH 1/2] Driver to remember ethernet MAC values: maclist ?

Mike Christie scsi tgt: scsi target netlink interface ?

Kenji Kaneshige PCI legacy I/O port free driver (take2) ?

Michael Kerrisk man-pages-2.24 is released ?

Ian Kent autofs4 - add direct mount functionality to autofs ?

Jun'ichi Nomura [PATCH 0/3] sysfs representation of stacked devices (dm/md) ?

Paul Mundt relay: Migrate from relayfs to a generic relay API. ?

David Gibson RFC: Block reservation for hugetlbfs ?

Badari Pulavarty [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() ?

David Howells [PATCH 1/5] NFS: Permit filesystem to override root dentry on mount ?

Rusty Russell Remove MODULE_PARM ?

Con Kolivas mm: implement swap prefetching (v26) ?

Con Kolivas mm: Implement swap prefetching v27 ?

Mel Gorman Reducing fragmentation using zones v5 ?

Evgeniy Polyakov Kevent, network AIO system calls implementation. ?

Patrick McHardy : Netfilter patches for 2.6.17 ?

Heinz Mauelshagen *** Announcement: dmraid 1.0.0.rc10 *** ?

Kay Sievers udev 085 release ?

Douglas Gilbert lsscsi-0.17 released ?

Douglas Gilbert sdparm 0.97 ?

Theodore Ts'o Kernel Summit 2006 planning kickoff ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quote of the week

The kevent interface

Sysfs and a stable kernel ABI

Wasabi white paper on kernel modules

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Networking

Miscellaneous