Kernel development [LWN.net]

Kernel release status

The current development kernel is 2.5.40, released by Linus on October 1. Among the usual fixes and updates, it includes high memory support for User-mode Linux, the CPU frequency (power management) patches, more disk management thrashups from Al Viro, the in-kernel NUMA topology API, the removal of the task queue subsystem, an ISDN update, and an ARM update. Here's the long-format changelog with the details.

Linus announced 2.5.39 on September 27. The biggest change, perhaps, was the inclusion of the deadline I/O scheduler (covered in last week's LWN Kernel Page); this kernel also contained a bunch of XFS fixes, an SCTP update, a bunch of memory management work by Andrew Morton, Ingo Molnar's in-kernel symbolic oops dumper, some driver model work, and numerous other fixes and updates. The the long-format changelog is available.

Linus's pre-2.5.41 BitKeeper tree contains a big ALSA update (the source of some grumbling from Linus), Ingo Molnar's "workqueue" implementation (see below), and a relatively small number (as of this writing) of other fixes and updates.

Dave Jones jumped back into the prepatch business with 2.5.39-dj1, which contained a number of fixes from his tree. Dave evidently still has a substantial pile of fixes to push on to Linus, but has been busy.

After a long absence, Alan Cox has also started putting out development kernel prepatches again. 2.5.40-ac1 includes support for the Voyager architecture, a merge of the uClinux distribution, and a number of fixes.

The latest 2.5 status summary from Guillaume Boissiere is dated October 2.

The current stable kernel is 2.4.19. Marcelo released 2.4.20-pre8 seconds after last week's Kernel Page was posted; it included an IBM hotplug driver update, a couple of security fixes, an x86-64 update, and a number of other fixes.

The current prepatch from Alan Cox is 2.4.20-pre8-ac3. Alan's recent releases have contained quite a few fixes, but no major new work.

Comments (none posted)

The feature freeze is coming

As part of the 2.5.40 announcement, Linus reminded the world that the feature freeze is coming soon:

And a small reminder that we're now officially in the last month of features, and since I'm going to be away basically the last week of October, so I actually personally consider Oct 20th to be the drop-date, unless you've got a really good and scary costume.. So don't try to leave it to the last day.

Linus also let it be known that he's "perfectly happy with the kernel" and feels no need to deal with last-minute code submissions.

In fact, the list of outstanding features is getting smaller. A couple of big changes that are pending, and which could be disruptive, are:

Changing the sector_t type to 64 bits, allowing block devices to be larger than 2TB. One would think that 2TB would last for a while, even by the standards of modern disks, but large RAID arrays are already pushing that boundary. A patch (by Peter Chubb) is being prepared, but it's taking him a while; among other things, he points out that testing is a slow process because it takes a full day just to write 4TB to a device.
Turning dev_t into a 32-bit value. Increasing the number of devices has been on the list since long before the 2.5 series began, but the change has not yet been made. This is not a trivial change, since the major device number is still used to index into static arrays within the kernel. Drastically increasing the number of devices requires dealing with those arrays. Alexander Viro has a plan to that end, but a lot of work remains to be done.

Beyond that, quite a few other developments are pending, and they won't all get in. Some outstanding items include the completion of the Linux Security Module and asynchronous I/O merges, ext3 indexed directory support, Rusty Russell's new module loader, a new kernel configuration and build system, a whole pile of memory management work, etc.

What also remains to be seen is how serious Linus is about the feature freeze. Past kernel freezes have tended to be slushy at best. Some substantial work will have to be integrated after the freeze; it will be interesting to see what gets in as "stabilization" or "feature completion."

Then, there is the much-publicized debate over whether the next stable series should be 2.6 or 3.0. Linus started by saying that there was nothing all that revolutionary in this kernel, and that it should be called 2.6. Numerous other developers disagreed, however, and Linus appears to have relented. It seems likely that the next major stable kernel will be called Linux 3.0.

Comments (9 posted)

The end of task queues

Kernel code often needs to set aside a task to be performed "a little later." The classic example is that of an interrupt handler, which must perform its task quickly, without blocking. Typically interrupt handlers simply acknowledge the interrupt, then arrange for the real work to be done outside of interrupt context. That work, which can include starting new I/O operations, delivering data to user space, or cleanup actions, gets done when the kernel gets around to it - and, usually, when it's safe to sleep.

In the good old days, the "bottom half" mechanism was used to set aside tasks in this manner. Linux bottom halves were quite inflexible, being identified by globally-unique, compile-time numbers. There could be no more than 32 of them - the number that could be tracked in a single-word bitmask. And bottom halves were not safe places for extended processing or tasks that needed to sleep.

More recent kernels moved much of the bottom half work to "task queues." A task queue is a simple linked list of functions to call (and data to pass to them). Certain predefined task queues were run at well-defined times; one was executed whenever the scheduler was called, and another was run out of the timer interrupt handler. Task queues cleaned things up significantly, but they were not particularly transparent and, fundamentally, they were still bottom halves. Their removal has been on numerous peoples' "todo" lists for some time.

One replacement for task queues is the "tasklet" interface, which was introduced in the 2.3 development series. Tasklets provide a high-performance interface for quick tasks that do not sleep; they are thus suitable for certain sorts of operations, but they do not replace task queues in all situations.

More recently, an attempt was made to address other deferred processing needs by wrapping a new interface (schedule_task()) around (what was) the scheduler task queue, and creating a special kernel thread (keventd) to run that queue. keventd provided a well-defined process context for tasks that need it (in particular, those which can sleep). But keventd still suffered the limitations of task queues, plus one other: all tasks were executed by a single thread. One very slow task could thus hold up everything else in the queue, creating unpredictable latencies.

A couple of patches recently posted by Ingo Molnar address these problems and clean up deferred processing substantially. The first patch removes the task queue interface and converts its remaining users over to schedule_task(); this patch was included in 2.5.40. The more interesting work is contained in the workqueue patch (since updated), which has not yet (as of this writing) been merged by Linus. This patch replaces the task queue mechanism (and schedule_task() entirely with a mechanism which is simpler to use and which yields better-defined results.

With the workqueue patch, task queues are replaced with the new "workqueue" concept. The basic idea is the same: a workqueue is a linked list of structures containing functions to call and data to pass to them. But the internals of workqueues are better hidden so that users need not worry about what is really going on. Workqueues are executed in process context, so tasks executed from those queues may sleep. Each workqueue, however, has its own worker threads (one per CPU), so one subsystem's workqueue entries will not block others from running. There is a default workqueue (analogous to the old schedule_task() functionality) for relatively simple tasks that do not justify their own queue.

For those who are interested, we have written up a separate article with reasonably complete documentation of the workqueue interface.

There has been a bit of discussion over the details of this interface. It has been through one set of modifications already, and will likely evolve more in the near future. The basic idea, however, appears to have been well received; some version of this patch will probably go in before too long.

Comments (none posted)

Some security hooks get the hook

The Linux Security Module code works by allowing security-related code to hook into almost every access decision the kernel makes. Security modules can only tighten restrictions by vetoing access that would have otherwise been around. A number of security regimes - most notably the NSA's SELinux - have been built on the LSM structure. The LSM patch has been partially merged into the kernel; many of the LSM hooks are not yet there, however.

Recently some developers have been questioning some of the specific hooks. In response, LSM maintainer Greg Kroah-Hartman has posted a patch removing a few LSM hooks: those for creating, initializing, and deleting modules. Nobody seems to have an issue with the ability to control those operations - it's just that no code is currently using those hooks.

That is, in fact, Greg's stated policy with LSM: any hooks that are not actually being used by an available, open source security module will be removed.

I am not happy with the idea that there would be hooks in the kernel that are not being used. That's not the Linux way. If the code isn't being used, it's removed.

The idea, of course, is that there is no point in trying to maintain code that is not in use. By the time somebody actually tries to make use of it, chances are it will be broken anyway. And, it is said, it is easy to reintroduce a hook should the need develop.

Of course, given the LSM design, it's not that easy to put in a new hook. LSM requires security modules to provide an explicit implementation for every available hook, with the result that security modules accumulate a lot of stub "no-op" hooks. Adding a hook will break every security module out there until they implement a stub for that hook. Given that, security module authors who see a use for some of the more obscure hooks might want to document that use before too long.

Comments (3 posted)

Catching code which sleeps on the job

The kernel is full of code which is not allowed to sleep. Anything which is handling an interrupt or otherwise running out of process context, for example, should not try to go to sleep. This particular case is easy to catch in the scheduler, but others are not. For example, any code which is holding a spinlock can not sleep either. Sleeping in this situation can lead to deadlocks (some other process spinning on the lock can prevent the holder from running again and releasing the lock), mutual exclusion failures (on uniprocessor systems where spinlocks are optimized out), or, at a minimum, excessive lock hold time and lock contention.

The problem is that it can be easy to sleep in the wrong places. Sleeps are often not done directly; instead, a piece of atomic code calls a function which calls some other function which sleeps. The "sleep tendency" of functions is not always documented, and, in any case, kernel hackers, being human, can make mistakes. Even if it seems, at times, that they don't sleep.

Until recently, these mistakes have been hard to catch. There was no "I'm running in an atomic section" flag, and thus no way for the kernel to know that it is sleeping in a bad place - until something went badly wrong. The preemptible kernel patch changed all that, however. Any place where the code can not sleep is also certainly a bad place for that code to be preempted. So the functions which mark atomic sections (such as spinlock operations) now set a "don't preempt me" flag.

But once you have that flag, why not use it to detect sleeps in the wrong place? Andrew Morton posted a patch which does exactly that, and Linus merged it on the spot. The patch was titled "increase traffic on linux-kernel," and it has done exactly that. There are, it turns out, quite a few places where sleeping functions are called within code that is supposed to be atomic. These mistakes are being fixed almost as quickly as they are found. A small patch has done a lot to eliminate a whole class of kernel programming errors.

Comments (none posted)

The new CPU frequency code

A new CPU frequency subsystem, written by Dominik Brodowski and others, was integrated by Linus into the 2.5.40 release. This code provide user-space control over the clock frequency of the CPU(s) in the system - at least, for processors which provide that capability.

One might wonder why it would be desirable to run a processor at anything below its rated speed. The reasons, of course, are power consumption and thermal control. A faster CPU requires more power to run. If you're using your laptop on an airplane, and you're not trying to crack any encryption keys or set kernel build time records before you land, you might just want to slow down the processor a little to avoid draining your battery. Meanwhile, the processor may decide to slow down on its own if it's getting too warm.

In fact, some modern processors can take a fairly smart approach to frequency control. If the processor notices that it is spending a lot of time idle, it can slow itself down. If it's constantly busy, it can turn up the speed a bit. If a particular processor supports setting its frequency in a "dumb" mode only, it might be nice for the operating system to provide the automatic adjustment in software.

For this reason, a simple "set the frequency" interface was deemed to be insufficient. The CPU frequency code merged into 2.5.40 reflects the new understanding of the problem: it allows the user to set a range of acceptable frequencies and the desired policy. If the user selects "performance" as the policy, the processor will be instructed to run at the upper end of its range; if it slows down, it does so gradually. With the "powersave" policy, speeds will be kept lower even in the face of sustained work to do. Overall, the new interface gives the user a great deal of control over how the system operates. Of course, this interface is just a cryptic /proc file (see Documentation/cpufreq in the 2.5.40 tree for details); look for the KDE and GNOME applications to show up in the near future.

For now, the code that has been merged into the kernel supports only the i386 architecture. Code for a number of other processors exists and will show up in the proper, architecture-specific trees.

Comments (2 posted)

Andrea Arcangeli 2.4.20pre8aa1 ?

Andrea Arcangeli 2.4.20pre8aa2 ?

J.A. Magallon Linux 2.4.20-pre8-jam1 ?

Marc-Christian Petersen [PATCH] Linux-2.5.38-mcp3 ?

Marc-Christian Petersen [PATCH] Linux-2.5.39-mcp1 ?

Marc-Christian Petersen [PATCH] Linux-2.5.40-mcp1 ?

Greg Ungerer linux-2.5.38uc2 (MMUless support) ?

Greg Ungerer linux-2.5.39uc0 (MMU-less support) ?

Greg Ungerer linux-2.5.40uc0 (MMU-less support) ?

Jeff Dike Update UML to 2.5.39 ?

Jeff Dike UML highmem support ?

Jeff Dike UML networking update ?

Jeff Dike UML bug fixes ?

Jeff Dike Bring UML up to 2.5.40 ?

James Bottomley new subarchitecture for 2.5.40 ?

Dave Hansen 4KB stack + irq stack for x86 ?

Roman Zippel linux kernel conf 0.7 ?

Lightweight Patch Manager Single linked lists for Linux,v2 ?

Andrew Morton prepare_to_wait/finish_wait sleep/wakeup API ?

Andrew Morton use prepare_to_wait in VM/VFS ?

george anzinger High-res-timers part 1 (core) ?

george anzinger High-res-timers part 2 (x86 platform code) ?

Andi Kleen Put modules into linear mapping ?

Dominik Brodowski cpufreq patches for 2.5.39 follow ?

Dominik Brodowski (1/5) CPUfreq core ?

Dominik Brodowski (2/5) CPUfreq i386 core ?

Dominik Brodowski (3/5) CPUfreq i386 drivers ?

Dominik Brodowski (4/5) CPUfreq Documentation ?

Dominik Brodowski (5/5) CPUfreq /proc/sys/cpu/ add-on patch ?

Ingo Molnar smptimers, old BH removal, tq-cleanup, 2.5.39 ?

Ingo Molnar generic work queue handling, workqueue-2.5.39-D6 ?

Ingo Molnar Workqueue Abstraction, 2.5.40-H7 ?

Robert Love set_cpus_allowed() for 2.4 ?

Michael Hohnbaum Simple NUMA scheduler patch ?

Ingo Molnar futex-2.5.40-B5 ?

Grover, Andrew ACPI patches updated (20021002) ?

Andrew Morton increase traffic on linux-kernel Catches code that calls sleeping functions in the wrong places. ?

Ingo Molnar kksymoops-2.5.38-C9 ?

John Levon oprofile for 2.5.39 ?

John Levon oprofile for 2.5.40 ?

Greg KH consolidate /sbin/hotplug call for pci and usb - take 2 ?

Greg KH More USB changes for 2.5.38 ?

Greg KH USB changes for 2.5.39 ?

Matt Domsch x86 BIOS Enhanced Disk Device (EDD) polling ?

Yuri van Oers RivaTV 0.8.1 ?

Marc Boucher New hsflinmodem-5.03.03.L3mbsibeta02093000 release ?

James Bottomley [PATCH] first cut at fixing unable to requeue with no outstanding commands ?

Alan Cox PATCH: 2.5 Forward port AMD random number generator ?

Keith Owens Announce: XFS split patches for 2.4.19 - respin ?

Steve Best Journaled File System (JFS) release 1.0.23 ?

Kevin Corry EVMS Release 1.2.0 ?

Kevin Corry EVMS Submission for 2.5 ?

tytso@mit.edu New set of code snapshots for ext3-dxdir (kernel and userspace) ?

tytso@mit.edu Add ext3 indexed directory (htree) support ?

William Lee Irwin III hugetlbfs-2.5.39-5 (was Re: hugetlbfs-2.5.39-3) ?

Tim Schmielau remove 614 includes of linux/sched.h ?

Andrew Morton slab reclaim balancing ?

Andrew Morton 2.5.38-mm3 ?

Andrew Morton 2.5.39-mm1 ?

Andrew Morton 2.5.40-mm1 ?

Ingo Molnar 'sticky pages' support in the VM, futex-2.5.38-C5 ?

Ingo Molnar 'virtual => physical page mapping cache', vcache-2.5.38-B8 ?

Dave McCracken Snapshot of shared page tables ?

nf@hipac.org NF-HIPAC: High Performance Packet Classification for Netfilter ?

Trond Myklebust 2.4.20 Direct IO patch for NFS. (Note: a trivial API change...) ?

Antti Tuominen Mobile IPv6 for 2.5.40 (request for kernel inclusion) ?

Greg KH LSM changes for 2.5.38 ?

Trent Jaeger LSM Verification Tools ?

Greg KH No more module_* hooks ?

Oliver Xymoron /dev/random cleanup ?

Oliver Xymoron /dev/random cleanup: 01-xfer-bug ?

Oliver Xymoron /dev/random cleanup: 02-untrusted ?

Oliver Xymoron /dev/random cleanup: 03-entropy-estimation ?

Oliver Xymoron /dev/random cleanup: 04-pool-cleanup ?

Oliver Xymoron /dev/random cleanup: 05-urandom-pool ?

Oliver Xymoron /dev/random cleanup: 06-update-users ?

Oliver Xymoron /dev/random cleanup: 07-remove-legacy ?

Jari Ruusu Announce loop-AES-v1.6g file/swap crypto package ?

Con Kolivas 2.5.38-mm3 with contest 0.37 ?

Con Kolivas 2.4.20-pre8 contest results ?

Con Kolivas 2.5.39 with contest 0.41 ?

Con Kolivas Preempt effect on 2.5.39 with contest 0.41 ?

Con Kolivas Benchmark 2.5.39-mm1 ?

Con Kolivas 2.5.40{-mm1} with contest 0.42 ?

Patrick Mau LMbench results for 2.5.40 ?

Matthias Andree lk-changelog.pl 0.39 released ?

Thomas Molina 2.5 Kernel Problem Reports as of 27 Sep ?

Rik van Riel procps 2.0.9 ?

Denis Vlasenko lk maintainers ?

Kernel development

Brief items

Kernel release status

Kernel development news

The feature freeze is coming

The end of task queues

Some security hooks get the hook

Catching code which sleeps on the job

The new CPU frequency code

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Janitorial

Memory management

Networking

Security-related

Benchmarks and bugs

Miscellaneous