Kernel development [LWN.net]

Kernel release status

The current 2.6 prepatch is 2.6.8-rc3, which was announced by Linus on August 3. Most of the additions this time around are relatively small fixes; they include some kbuild work, a great many "sparse" annotations, the removal of the (non-functional) "fastroute" networking option, some crypto-API work (including an x86-optimized AES implementation which may be yanked out due to licensing concerns), and several architecture updates. The long-format changelog has all the details.

Linus's BitKeeper repository contains no patches after 2.6.8-rc3 as of this writing.

The current patch set from Andrew Morton is 2.6.8-rc2-mm2. Recent additions to -mm include some read-copy-update work (to address more latency issues), performance improvements for O_SYNC disk I/O, the staircase CPU scheduler (see below), token-based thrashing control (see below again), a change to /dev/mem allowing architectures block accesses to kernel memory, a new vprintk() function, and large numbers of fixes and updates.

The current 2.4 prepatch is 2.4.27-rc5, released by Marcelo on August 3. This one contains a fix for a new security issue (which could allow unprivileged processes to read kernel memory), takes out DVD-RW support for now (they will try again in 2.4.28), and adds a few fixes.

Comments (none posted)

Scheduler tweaks get serious

Con Kolivas has been working on his staircase scheduler patch for a while; it was covered here in the beginning of June. That scheduler found its way into the 2.6.8-rc2-mm2 patch, along with this comment from Andrew Morton:

This will probably have to come out again because various people are still fiddling with the CPU scheduler. But my feeling here is that the current 1st-gen CPU scheduler has been tweaked as far as it can go and is still not 100% right. It is time to start thinking about a new design which addresses the requirements and current problems by algorithmic means rather than by tweaking.

So it would seem that it is now open season for scheduler work.

Initial reports on the staircase scheduler are generally - but not uniformly - good. Martin Bligh posted some benchmark results showing some significant performance improvements for the 2.6.8-rc2-mm2 kernel, especially for "low to mid loads." Ingo Molnar, instead, has found a workload which performs poorly with this scheduler; it involves running multiple processes each of which wants most, but not all, of the CPU.

Con, meanwhile, has posted a couple of additional patches implementing additional policies in the staircase scheduler. SCHED_BATCH is another attempt at an "idle process" mode, where batch processes only run if nothing else wants the processor. This patch attempts to avoid priority inversion problems by scheduling SCHED_BATCH processes at normal priority when they are running in kernel mode.

SCHED_ISO, instead, is an "isochronous" mode intended for applications which need soft real-time response. Putting a process into SCHED_ISO is an unprivileged operation, any user can do it. Isochronous tasks start out with a relatively high priority, and should get scheduled quickly. Their allocated time slices are half of what they would otherwise be, however, and their priority drops especially quickly with CPU usage. So this mode is suitable for I/O bound processes which need to respond quickly (audio recording, CD burning, etc.), but it should not allow a hostile user to take over the system.

Peter Williams has been working on a different set of scheduler patches. His approach is to get rid of the "expired" array (where processes go to languish when they have used up their time slices) and move everything to a single array. The patch offers two modes, being the traditional priority-based mode and a new "entitlement" mode which tries to figure how much processor time each task is entitled to, then works to ensure that each is given at least that much time. His patches are available in a dizzying number of varieties; they seem to have seen less testing so far, but Andrew has said that one of them might get a turn in -mm for a while.

Nick Piggin's -np trees also contain a new scheduler. Nick's work tries to simplify many of the scheduler calculations while retaining logic which tries to evaluate the "interactivity" of each process. Unlike some implementations, this scheduler gives longer time slices to higher-priority processes. All slices are scaled depending on the job mix, however; low-priority processes will get longer slices if there are no high-priority processes around.

Ingo Molnar has continued his work on voluntary preemption; his voluntary-preempt-2.6.8-rc2-O2 patch features a new implementation of the interrupt threads feature. The available reports indicate that, with this patch, latency problems in the 2.6 kernel are becoming few and far between.

There is no way to tell, at this point, which of these scheduler approaches - if any - will find its way into the mainline kernel. Evaluating schedulers takes a long time, and, for any given scheduler, there always seems to be some strange workload out there which makes it fall apart. The approaches described above (with the exception of voluntary preemption) share one nice feature, however, which is likely to argue in favor of including one of them: they all remove a significant amount of code and make the scheduler simpler and easier to understand. That, in and of itself, may be a worthwhile step toward the implementation of a top-quality Linux scheduler.

Comments (2 posted)

Token-based thrashing control

A system which is in the throes of VM thrashing is no fun to work with. The kernel is forever throwing out pages which it will need in the near future in favor of pages needed right now, and little work actually gets done. It seems like there has to be a better way.

Rik van Riel has put together a patch based on the work of Song Jiang which might help. The basic idea is that a process which is currently bringing in pages should, for a short period, not have its other pages booted out to swap. With luck, that process will actually make some progress during that grace period before the VM grim reaper swoops down and consigns it, once again, to the swap ghetto.

Clearly, not all processes which are bringing in pages can be sheltered from page reclamation at the same time; if they could, the system would not be thrashing in the first place. This problem is addressed through the creation of a "swap token." A process holding the swap token will be allowed to bring in pages without having its current working set molested for a period of time. After a while, the token is passed on to the next needy process.

In Rik's patch, the (single, system-wide) token is implemented through swap_token_mm, a pointer to the mm structure of the process holding the token. If the kernel, on behalf of a process incurring a page fault, decides that the token is available, swap_token_mm will be set and the faulting process will get its breathing space for a while. The token is deemed to be available if (1) it has been held for longer than the maximum period, which is set to a surprisingly long 300 seconds, or (2) the process holding the token has not incurred any page faults recently. Once the token becomes available, the first process which comes looking for it will grab it - unless it has held the token in the recent past.

Rik's tests show some performance improvements with this patch applied. There are potential improvements which could be made, such as trying to add some intelligence to the decision of which process gets the token. A huge process may hold the token for some time, grow to fill much of memory, and still not have enough to get any real work done. Meanwhile, small processes which could have benefited from a few extra pages continue to thrash. Some tweaks could be made to the patch to address this issue, but there are limits to how much code and complexity should be added to the kernel to deal with a (hopefully) rare situation.

Comments (3 posted)

Recent patches of interest

A number of interesting kernel patches have been posted in recent times. Since your editor is pressed for time, a few of those patches will be quickly covered here.

Nigel Cunningham has been working at getting some small pieces of his software suspend implementation into the kernel. One of those pieces is this patch, which has to do with the "freezing" of kernel threads prior to suspending the system. As processes are put on hold, the kernel risks stopping a process which is needed later on in the suspend process; think about a process handling NFS service or software interrupts, for example. To avoid this situation, kernel threads are simply not frozen. But many of them can be, and that would make the suspend process more robust. So Nigel's patch goes through and tries to set up each thread with the appropriate flags, so that only truly necessary kernel threads continue to run while the system is being suspended.

A number of these threads, it turns out, are part of a workqueue. As a way of setting up every workqueue process with the right flags, Nigel changed the interface to create_workqueue() and create_singlethread_workqueue(), thus breaking all code which creates its own workqueues. Andrew Morton expressed some discomfort at the API change, but acknowledged that it was useful in that it forces people to think about whether every workqueue needs to run during a system suspend operation or not. This patch has not yet appeared in -mm, as of this writing.

Rik van Riel and Arjan van de Ven have put together a new patch which allows normal users to lock memory into physical RAM without root privilege. The RLIMIT_MEMLOCK resource limit puts an upper bound on how much memory can be locked, and its default value is zero. By raising this limit, system administrators can enable users to lock a single page (useful for cryptographic applications which do not want to see passphrases and clear text swapped to disk) or larger amounts (for CD writing tasks, for example). Various issues were raised regarding the security of this patch, but the latest version appears to have resolved them. This code should eventually replace the magic "mlock group" hack that was covered here last May.

Fistgen 0.1 has been released; this is the first version for the 2.6 kernel. The announcement describes fistgen as "a package of stackable templates," which may not be particularly illuminating to many readers. More information can be found at filesystems.org; one developer calls it "a yacc for filesystems." Using fistgen and a small amount of code, a set of filters can be set up to create a filesystem with a given set of characteristics. For example, this template describes a filesystem which encrypts data using the sophisticated "rot13" algorithm. The fistgen parser reads the template file and generates C code implementing the filesystem, which can then be loaded into the kernel.

John McCutchan has been working on his inotify patch for some time. Inotify is meant to be a replacement for the dnotify mechanism, used by processes which wish to be alerted when files are changed. The inotify patch takes a different approach; it creates a char device which supports a small set of ioctl() operations. After opening this device and using ioctl() to express interest in a particular set of files, a process need only read the device to get the change events for those files.

OpenSSI 1.0 is out. OpenSSI is a "single system image" clustering environment based on the 2.4 kernel; it includes member ship functions, the CFS and Lustre Lite filesystems, process management, and a cluster-wide device mechanism built on devfs. See the OpenSSI web page for more information.

The sysfs directory /sys/module contains, among other things, attributes for parameters exported by loaded modules. Dominik Brodowski noticed that, if these modules are built directly into the kernel, those parameters are not available via sysfs. If they were, they shouldn't be under /sys/module in any case, since the code in question is not part of a module. So he has posted a patch creating a new directory (/sys/parameters) and putting attributes there, for both modules and built-in code. This is a user-space API change, but it is unlikely that anything of any consequence depends on parameters under /sys/module at this point.

Jens Axboe has posted a new SCSI generic ("sg") implementation (called "bsg") which works through the block layer. This driver implements the SG_IO ioctl() call, and also allows communication through regular reads and writes. The latter functionality caused some complaints; when structures are passed between user and kernel space with read() and write() calls, it becomes very hard to convert them when the process is running in 32-bit mode on a 64-bit platform. For all that the developers dislike ioctl(), that interface does, at least, make it clear when and where a structure is being transferred across the user-kernel boundary. To address these complaints, the bsg driver may be restricted to the ioctl() mode only.

Comments (1 posted)

Linus Torvalds Linux 2.6.8-rc3 ?

Matt Mackall 2.6.8-rc3-tiny1 for small systems ?

Andrew Morton 2.6.8-rc2-mm2 ?

Con Kolivas 2.6.7-ck6 ?

Marcelo Tosatti Linux 2.4.27-rc5 ?

Marcelo Tosatti Linux 2.4.27-rc4 ?

Andi Kleen x86_64-2.6.8rc2-3 released ?

Sam Ravnborg kbuild: Various updates for 2.6.8 ?

Ingo Molnar voluntary-preempt-2.6.8-rc2-M5 ?

Ingo Molnar voluntary-preempt-2.6.8-rc2-O2 ?

Con Kolivas Staircase cpu scheduler 2.6.8-rc2-mm1 ?

Con Kolivas Scheduler policies for staircase ?

Con Kolivas Schedrange ?

Con Kolivas Sched batch for staircase ?

Con Kolivas Isochronous scheduling for staircase scheduler ?

Peter Williams V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation ?

Perez-Gonzalez, Inaky FUSYN Realtime & robust mutexes for Linux, v2.3.1 ?

Marty Ridgeway August release of LTP available ?

Pavel Machek Solving suspend-level confusion ?

Jesse Barnes add PCI ROMs to sysfs ?

Alex Williamson dev_acpi: device driver for userspace access to ACPI ?

Jean Tourrilhes Wireless drivers update for WE-17 ?

Jens Axboe block layer sg, bsg ?

Angelo Dell'Aera TCP Westwood+ references ?

Erez Zadok fistgen-0.1 released (linux-2.6 support) ?

Maneesh Soni sysfs backing store (Re-splitted) ?

John McCutchan inotify 0.8 ?

Suparna Bhattacharya Concurrent O_SYNC write support ?

Ravikiran G Thirumalai Lockfree fd lookup 0 of 5 ?

Ravikiran G Thirumalai Lockfree fd lookup 1 of 5 ?

Ravikiran G Thirumalai Lockfree fd lookup 2 of 5 ?

Ravikiran G Thirumalai Lockfree fd lookup 0 of 5 ?

Ravikiran G Thirumalai Lockfree fd lookup 4 of 5 ?

Jeff Garzik fastroute dead code... ?

Dave Hansen don't pass mem_map into init functions ?

Arjan van de Ven mlock-as-nonroot revisted ?

Rik van Riel mlock-as-user for 2.6.8-rc2-mm2 ?

Rik van Riel token based thrashing control ?

Paul Jackson subset zonelists and big numa friendly mempolicy MPOL_MBIND ?

Jean Tourrilhes Wireless Extension v17 for Linus ?

Jon Smirl OLS and console rearchitecture ?

Aneesh Kumar K.V OpenSSI 1.0.0 released!! ?

Stephen Hemminger iproute2 update ?

Dominik Brodowski export module parameters in sysfs for modules _and_ built-in code ?

Dominik Brodowski export module parameters in sysfs for modules _and_ built-in code: remove /sys/module/*parameters* ?

Kernel development

Brief items

Kernel release status

Kernel development news

Scheduler tweaks get serious

Token-based thrashing control

Recent patches of interest

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Networking

Miscellaneous