Kernel development [LWN.net]

Kernel release status

The current 2.6 kernel remains 2.6.20; no 2.6.21 prepatches have been released. Patches are flowing into the mainline git repository, however - see below for the highlights.

For older kernels: 2.6.16.40 was released on February 10 with a relatively small number of patches.

The first 2.4.35 prepatch is now available; it contains a few fixes and a backport of the 2.6 "sky2" network driver.

Comments (none posted)

Quotes of the week

I'm sorry, but could we please not mix the kernel with Vogon poetry contest?

-- Al Viro

I have an email sitting in my drafts folder stating that I'll no longer accept any features unless they've been publicly reviewed in detail and run-time tested by a third party. The idea being to force people to spend more time reviewing and testing each other's stuff and less time writing new stuff. Maybe on a sufficiently gloomy day I'll actually send it.

-- Andrew Morton

Comments (none posted)

What came in through the merge window

As of this writing, the 2.6.21 merge window is wide open. Something over 2,300 changesets have been merged, making changes all over the tree. This article summarizes the major changes merged so far for the 2.6.21 release.

User-visible changes include:

A big ACPI update with sysfs support for backlight devices, a simplified table manager which adds more functionality with less code, the removal of 16-bit support, experimental support for removable drive bays, and more.
New device drivers add support for Silan SC92031 network interface chips, Qlogic 4032 NIC chips, PA Semi PWRficient Ethernet chips, Avocent PC300/RSV and PC300/X21 WAN cards, Atmel MACB network controllers, Yukon Extreme Ethernet chips, several USB-attached NCR printers, Chelsio T3 10G Ethernet adapters, GTCO CalComp tablets, Delkin compact flash adapters, Attansic L1 Gigabit Ethernet adapters, VIA VT1708(a) HD audio codecs, several auxilliary LCD display devices, PC-style CMOS real-time clocks, SNI RM 53c710 SCSI controllers, Gigaset M101 wireless RS232 adapters, and S3 Trio/Virge video chips (fbdev). Also, the long-broken SKMC and Oaknet drivers have been removed.
Sysfs shadow directory support - allowing different namespaces to have different views of sysfs - has been added.
USBmon has a new binary API which promises to be somewhat faster and more complete than the older, text-based interface.
A big PowerPC/Cell/PS3 update, including support for the Toshiba "Celleb" architecture, serial ports accessed through OpenFirmware, and AMCC Taishan 440GX evaluation boards.
Netfilter now has a connection tracking helper for the SANE network scanner protocol.
Encryption modules for the FCrypt and Camilla cipher algorithms have been added.
The ASoC (ALSA System on Chip) layer has been added to the ALSA sound system. It provides improved support for sound processors on embedded systems; it includes a dynamic power management subsystem. A number of platform and codec drivers for ASoC have been merged as well.
Tainting the kernel from user space is now supported.
Minix V3 filesystems can now be mounted on Linux systems.
eCryptfs now has public-key encryption support.
A long set of patches has made the kernel able to support boot-time command lines of arbitrary length.

Changes visible to kernel developers include:

Quite a few kobject functions - kobject_init(), kobject_del(), kobject_unregister(), kset_register(), kset_unregister(), subsystem_register(), subsystem_unregister(), and subsys_create_file() - now return harmlessly if passed a NULL pointer.
Many kernel subsystems which once used class_device structures have been changed to use struct device instead; this work is toward a long-term goal of getting rid of the class tree and having a single device tree in sysfs.
Significant changes have been made to the crypto support interface.
The device resource management patches, making a lot of driver code easier to write, have been merged.
The DMA memory zone (ZONE_DMA) is now optional and may not be present in all kernels.
The local_t type has been made consistent across architectures and has gained some documentation.
The nopfn() address space operation can now return NOPFN_REFAULT to indicate that the faulting instruction should be re-executed.
A new function, vm_insert_pfn(), enables the insertion of a new page into a process's address space by page-frame number.
A new driver API for general-purpose I/O signals has been added.
The sysctl code has been heavily reworked, leading to a number of internal API changes.

A number of patches are still waiting to merged, and some decisions are yet to be made. Come back next week for what should be the final list of major new features in 2.6.21.

Comments (none posted)

Alternatives to fibrils

Since the writing of last week's article on fibrils, there has been relatively little discussion of that set of patches. That silence does not mean that interest in the idea has faded for now, however; instead, a couple of different approaches have been posted for consideration.

Linus Torvalds got inspired to create an asynchronous system call patch of his own. Simplicity is the word to describe this patch: it adds less than 200 lines of code to the kernel ("I even added comments, so a lot of the few new added lines aren't even code!"). It works like this:

The new async() system call takes a system call number, arguments for the system call, and a pointer to a location for the final status code.
The process's register set is saved, then the system call is executed as usual.
Should the kernel call schedule(), meaning that the system call is about to block, the process will fork before blocking.
The new child process returns to user space and continues executing there. Meanwhile, the original process will finish out the asynchronous system call.

The largest claimed advantage to this patch, beyond its simplicity, is that there is almost no overhead if the asynchronous system call can be completed without blocking. The fibril patch, instead, always runs asynchronous calls in independent fibrils. Linus claims that almost all asynchronous system calls can, in fact, be completed synchronously without blocking, so he would really rather see little or no up-front cost in that case.

There are various issues with Linus's patch. If the asynchronous call blocks, for example, the return to user space will happen in a different process - a change which could prove confusing to user space. Only one asynchronous operation can be outstanding at any given time. There is also no way to wait for an asynchronous operation to complete except to poll the exit status. But this patch was never meant to be a complete solution; as a proof of concept it is interesting.

For a rather more elaborate approach, Ingo Molnar's syslet patchset is worth a look. With syslets, a user-space program can run system calls asynchronously. Beyond that, however, it can load little programs into the kernel and let them run independently.

To use syslets, the application starts by filling in one of these structures:

    struct syslet_uatom {
	unsigned long		flags;
	unsigned long		nr;
	long 	 		*ret_ptr;
	struct syslet_uatom	*next;
	unsigned long		*arg_ptr[6];
	void 	 		*private;
    };

Here, nr is the number of the system call to run, arg_ptr holds pointers to the arguments, and ret_ptr tells the kernel where to put the final status from the call. The private field is not used by the kernel at all. We'll get to the other fields shortly.

Once the syslet_uatom structure is ready, the application can run it with:

    long async_exec(struct syslet_uatom *atom);

This call will start on the requested system call immediately. If that system call never blocks, it will be run synchronously and the address of the atom will be returned from async_exec(). Otherwise the kernel will grab a thread from a pool and use that thread to return to user space, continuing the system call in the original thread. The application can then go off and do whatever makes sense - including running more syslets - while the system call runs to completion.

What actually happens when the system call completes is a little more complex and interesting, however. Unless user space has requested otherwise, the kernel does not just complete the syslet after the first system call runs; instead, it looks at the next field of the syslet_uatom structure. If that field is non-NULL, it is taken as the user-space address of the next syslet to be run by the kernel. In other words, an application is not restricted to running individual asynchronous system calls; it can chain up a whole series of them to run without ever exiting the kernel. The cost of fetching a new syslet atom is far less than a transition to user space and back, so there is a significant performance improvement to be had just by chaining two system calls together.

The final field in struct syslet_uatom is flags, which controls how syslets are executed. Four of them (SYSLET_STOP_ON_NONZERO, SYSLET_STOP_ON_ZERO, SYSLET_STOP_ON_NEGATIVE, and SYSLET_STOP_ON_NON_POSITIVE) will test the result of the current atom's system call and, possibly, terminate execution of the syslet. In this way, for example, a chain of system calls can be stopped early if one of them fails. It is also possible to create a kernel-space loop which reads a file until no more data is available.

The SYSLET_SKIP_TO_NEXT_ON_STOP modifies the above flags so that, rather than terminating the syslet, the kernel skips to an atom found immediately after the current one in the process's address space. This flag allows a syslet to terminate a loop and move on to further processing within the syslet. If an application knows that a syslet will block, it can request asynchronous execution from the outset with SYSLET_ASYNC. There is also a SYSLET_SYNC flag which causes the whole thing to run synchronously.

Syslets do not have any variables of their own. To help with the writing of useful programs, Ingo has added a new system call:

    long umem_add(unsigned long *pointer, unsigned long increment);

This call simply adds the given increment to *pointer, returning the resulting value.

The application can register a ring buffer with the kernel using the async_register() system call. Whenever an atom completes, its address will be stored in the next ring buffer entry; the application can then use that address to find the system call status. The kernel will not overwrite non-NULL ring buffer entries, so the application must reset them as it consumes them. If the application needs to wait for syslet completion, it can call:

    long async_wait(unsigned long min_events);

This call will block the process until at least min_events have been stored into the ring buffer.

This patch set, too, presents a number of unanswered questions. Once again, signal handling has been punted for now. There's no end of security implications which must be thought out; in the end, a number of system calls will probably be marked as being off-limits for asynchronous execution. There has still been no discussion on how this sort of interface would play with the kevent patches - kevents seem to be concept that nobody wants to talk about at the moment. 64/32-bit compatibility could present interesting challenges of its own. And so on. But the initial reaction to syslets appears to be positive (though Linus hates it); syslets might just point to the form of the fibril idea which eventually makes it into the mainline kernel.

Comments (10 posted)

Adrian Bunk Linux 2.6.16.40 ?

Willy Tarreau Linux 2.4.35-pre1 ?

Andi Kleen What will be in the x86-64/x86 2.6.21 merge ?

Greg Ungerer : linux-2.6.20-uc0 (MMU-less updates) ?

Mathieu Desnoyers atomic.h : standardizing atomic primitives ?

Mathieu Desnoyers local_t : adding and standardising local atomic primitives ?

Ingo Molnar ANNOUNCE: "Syslets", generic asynchronous system call support ?

Paul E. McKenney QRCU fastpath optimization ?

Venkatesh Pallipadi Introducing cpuidle: core cpuidle infrastructure ?

Evgeniy Polyakov kevent: Description. ?

Gautham R Shenoy Freezer based Cpu-hotplug ?

Christopher Li sparse-0.2-cl2 is now available ?

Josef Sipek Guilt v0.19 ?

Mathieu Desnoyers Linux Kernel Markers - kernel 2.6.20 ?

Junio C Hamano GIT 1.5.0 ?

Alan pata_acpi: take two ?

Bartlomiej Zolnierkiewicz IDE updates for 2.6.20 ?

Jeff Garzik net driver updates ?

Dave Airlie drm tree for 2.6.21-rc1 ?

Greg KH USB patches for 2.6.20 ?

Greg KH PCI patches for 2.6.20 ?

Greg KH Driver core patches for 2.6.20 ?

Jaroslav Kysela alsa-git merge request ?

Wim Van Sebroeck v2.6.20 watchdog patches ?

James Ketrenos d80211 based driver for Intel PRO/Wireless 3945ABG ?

Marcelo Tosatti Marvell Libertas 8388 802.11b/g USB driver (v3) ?

Roland Dreier please pull infiniband.git ?

Len Brown ACPI patches for 2.6.21 - part II ?

Jeff Garzik libata updates 1 of 3 ?

Ben Dooks fb: SM501 framebuffer driver ?

Arnd Bergmann Open Firmware serial port driver ?

Jean Delvare i2c updates for 2.6.21 ?

Jean Delvare hwmon updates for 2.6.21 ?

Linas Vepstas lpfc: add PCI error recovery support ?

Randy Dunlap phy layer: add kernel-doc + DocBook ?

Rafael J. Wysocki PM: Document requirements for basic PM support in drivers ?

NeilBrown [PATCH 000 of 9] knfsd: NFSv4 ACL improvements and a couple of bug fixes. ?

Artem Bityutskiy Pull UBI tree ?

Miklos Szeredi add filesystem subtype support ?

Dave Hansen filesystem helpers for custom 'struct file's ?

Takashi Sato ext4 online defrag (ver 0.3) ?

Sorin Faibish DualFS: File System with Meta-data and Data Separation ?

Nick Piggin mm: NUMA replicated pagecache ?

Kok, Auke Multiple transmit/receive queue kernel ?

Samir Bellabes [PATCH] Network Events Connector ?

David Howells [RFC] AF_RXRPC socket family implementation ?

Patrick McHardy : Netfilter update/fixes ?

YOSHIFUJI Hideaki IPv6 Updates ?

Herbert Xu Crypto Update for 2.6.21 ?

Tetsuo Handa TOMOYO Linux 1.3.2 released ?

David Howells MODSIGN: Kernel module signing ?

Rusty Russell lguest ?

menage@google.com containers (V7): Generic Process Containers ?

Jeremy Fitzhardinge [patch 00/21] Xen-paravirt: Xen guest implementation for paravirt_ops interface ?

Stephen Hemminger sk-drivers mailing list ?

Rik van Riel page replacement requirements ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quotes of the week

What came in through the merge window

Alternatives to fibrils

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous