Kernel development [LWN.net]

Current kernel release status

The current development kernel is 2.5.28, released on July 24. It contains major changes to the interrupt handling subsystem (see below), large m68k and PPC64 updates, Russell King's long-awaited new serial driver, numerous filesystem and block device changes from Alexander Viro, and more. Those wanting the details can see the long format changelog.

2.5.27 was announced by Linus on July 20 (the long format changelog is also available). The truly significant changes in this release included Rik van Riel's reverse-mapping VM and the beginning of the Linux Security Module merge. The LSM patch includes hooks mostly relating to process control; the rest should find their way in with later releases. This kernel also contains a lot of USB and RAID changes, some NFS tweaks, and various other fixes and updates.

2.5.27 also included Martin Dalecki's IDE 99 and IDE 100 patches which, for some reason, were not posted to the public list. Unfortunately, IDE 99 contains a bug which can lead to system lockups and file corruption; thus 2.5.27 gave some users more than they had bargained for. The discussion of the 2.5 IDE problems continues on linux-kernel; the latest development is that IDE hacker Bartlomiej Zolnierkiewicz, who, until recently, has been one of Martin Dalecki's defenders, has stated his intention to create his own IDE subsystem, based on the 2.4 implementation.

The current prepatch from Dave Jones is 2.5.27-dj1. "Mostly resyncing with the various trees that have sprouted in the last week, and applying obvious stuff that didn't take much thinking."

Guillaume Boissiere's latest 2.5 status summary is dated July 23. Guillaume has also posted a 2.5 TODO list with the best available guesses as to what will happen between now and the Halloween feature freeze.

The current stable kernel is 2.4.18. Marcelo posted the third 2.4.19 release candidate on July 19. It is, he says, the last release candidate unless something really serious comes up.

Alan Cox's current prepatch is 2.4.19-rc3-ac2; in addition to numerous fixes it includes the new disk quota code from 2.5.

Comments (none posted)

Thrashing the interrupt code

Dig into the source of an old Unix system, and you will almost certainly find calls to cli() and sti(), which disable and enable interrupts, respectively. The Linux kernel, too, has these calls. In the Good Old Days, when Linux did not run on SMP systems, a call to cli() was sufficient to guarantee exclusive access to any resource of interest. Kernel code was not preemptable, so, in the absence of interrupts, no other kernel code had any possibility of running.

SMP changed all that, of course. The cli() call remained, however, for the few places that really needed it - and to avoid having to change a great deal of code which relied on cli() for mutual exclusion. The cli() call became global, in that it disabled the handling of interrupts on all processors in the system. Note that it did not disable the interrupts themselves, just the processing of those interrupts. This was accomplished by way of the "big IRQ lock" (global_irq_lock); once cli() was called, any processor attempting to handle an interrupt would spin on that lock until things were released with sti(). Needless to say, spending a lot of time with interrupts globally disabled in this way is not good for performance; thus the use of cli() and sti() has been discouraged for a long time.

As of 2.5.28, these functions are no longer discouraged - they are gone. Ingo Molnar sent out a patch (since revised an unbelievable number of times) which removes the global_irq_lock, the cli() and sti() primitives, and more. The result is the removal of a bunch of old legacy code, a faster IRQ handling subsystem, and a great many broken drivers. Said drivers are being fixed, but building Linux kernels for SMP systems could be a bit challenging for the next release or two.

This patch also merges three different counters that the kernel used to maintain:

The hard IRQ counter (__local_irq_count), which tracked the number of hardware interrupts currently being serviced by each processor;
The soft IRQ counter (__local_bh_count), which tracked software interrupts (bottom halves, tasklets, etc.); and
The preemption counter (preempt_count, in the task structure) which noted whether the process had been preempted in kernel space.

The soft IRQ and preemption counters could also be used to disable software IRQs and kernel preemption by setting them to a nonzero value. The two IRQ counters, taken together, indicate whether the processor is currently responding to an interrupt. In other words, all of these counters are related to each other - they describe what kind of code is running at the moment and what sorts of diversions the processor is allowed to take. So, with Ingo's patch, all three have been merged into the per-process preemption counter. This change results in some simplified code; it should be mostly transparent to the rest of the kernel.

The cli() change is not transparent, though. People maintaining or writing drivers will now need to bear in mind that there is no longer any way to globally disable interrupts. You can still disable interrupts for the current processor (with local_irq_save() and friends), but other processors will still accept and handle interrupts. The only really safe way of protecting resources is most situations is with spin_lock_irq(); a number of drivers will need to be (finally) converted over to real locking before they will work again. Ingo has included a document (cli-sti-removal.txt) in the kernel source to help driver maintainers who are wondering how to handle this change.

Comments (none posted)

On the initialization of structures

The kernel source contains a great many structures which are initialized at compile time. Back in the 2.3 development series, substantial effort went into converting all of those initializations into the gcc designated initialization format:

    struct something my_struct = {
	field_1:    value,
	field_2:    value,
	...
    };

The advantage of this format, of course, is that it is possible to clearly initialize a subset of the structure's fields and not have things break if the declaration of the structure changes. It was a good change which cleaned up a lot of code.

There's only one problem: the C99 standard chose a different format. Standard-compliant C should instead contain initializations that look like:

    struct something my_struct = {
	.field_1 = value,
	.field_2 = value,
	...
    };

After a bit of discussion, the kernel hackers have decided to, you guessed it, convert all of the structure initializations in the kernel to the new format. Those changes are starting to find their way into the mainline; all new code should certainly be done the standard way.

Comments (none posted)

Implementing SMP clusters

Larry McVoy's cache-coherent cluster (or SMP cluster) idea was discussed (briefly) on this page two weeks ago. Now Karim Yaghmour has posted a white paper describing how such clusters might be implemented. The design uses a modified version of Adeos to run multiple Linux kernels, each of which has control over a subset of the whole system. The result is a path toward SMP clusters that requires only minimal changes to the Linux code itself. There is still the little matter of actually doing the work, of course, but this design is a promising start.

Those interested in Adeos may also want to look at the milestone 2 release which, among other things, adds SMP support.

Comments (none posted)

J.A. Magallon Linux 2.4.19-rc1-jam1 "<q>BEWARE: this kernel probably will eat your disk and your dog, but anyways...</q>" ?

J.A. Magallon Linux 2.4.19-rc3-jam1 ?

Andrea Arcangeli 2.4.19rc3aa1 ?

Jeff Dike UML - part 1 of 2 Contains the generic code changes needed to support User-mode Linux. ?

Jeff Dike UML 2.5.27 ?

Greg Ungerer linux-2.5.27uc1 MMU-less patches ?

Brad Hards Mass storage device config re-org ?

Matthew Dobson config option for irq_balance() ?

Greg Banks ANNOUNCE gcml 0.6 ?

Rusty Russell Hotplug CPU Boot Change 1/2 "<q>Breaks everything. But is cool.</q>" ?

Rusty Russell Hotplug CPU Boot Change 2/2 ?

Dipankar Sarma Read-Copy Update 2.5.26 ?

Dipankar Sarma read_barrier_depends 2.5.26 ?

Ingo Molnar scheduler bits for 2.5.27 ?

Ingo Molnar scheduler bits for 2.5.27, -C0 ?

Ingo Molnar scheduler bits for 2.5.27, -D4 ?

Christoph Hellwig vmap_pages() for 2.5.27 ?

Rusty Russell Initcall depends ?

Erich Focht node affine NUMA scheduler ?

Dave Hansen reduce code in generic spinlock.h ?

Dominik Brodowski cpufreq core for 2.5 ?

Ingo Molnar irqlock patch 2.5.27-H4 ?

Jesse Barnes spinlock assertion macros for 2.5.26 ?

Rusty Russell Kernel probes for i386 2.5.26 ?

Keith Owens Announce: ksymoops 2.4.6 is available ?

Adam G Litke lockmeter for 2.5.[25,26] ?

Marc Boucher New hsflinmodem-5.03.03.L3mbsibeta02072100 release ?

Marcel Holtmann Bluetooth Subsystem PC Card drivers for 2.5.27 ?

Marcel Holtmann 2.5.28 Bluetooth Subsystem PC Card drivers update ?

Tom Rini A generic RTC driver ?

Patrick Mochel driverfs updates 3 ?

Kevin Corry EVMS Release 1.1.0-pre5 ?

Anton Altaparmakov NTFS: 2.0.22 - Cleanups, mainly ntfs_readdir(), and use C99 initializers ?

Christoph Hellwig vmap_pages() "<q>The vmap_pages() functions allows to map an array of virtually non-continguos pages into the kernel virtual memory.</q>" ?

Craig Kulesa "not nearly so minimal" rmap for 2.5.26 ?

Craig Kulesa VM statistics for full rmap ?

Craig Kulesa move slab pages to the lru, for rmap ?

Craig Kulesa Full rmap VM for 2.5.27 ?

Craig Kulesa move slab pages to the lru, for 2.5.27 ?

Robert Love VM strict overcommit, again ?

William Lee Irwin III pte_chain_mempool-2.5.27-2 ?

William Lee Irwin III pte_chain_slab-2.5.27-1 ?

Rik van Riel rlimit rss enforcement ?

Rusty Russell TODO list before feature freeze (for netfilter). ?

Chris Wright 2.4.19-rc3-lsm1 ?

Greg KH LSM changes for 2.5.27 ?

James Morris security_ops locking ?

Matthias Andree lk-changelog.pl v0.29 released ?

Keith Owens Announce: modutils 2.4.17 is available ?

Keith Owens Announce: modutils 2.4.18 is available ?

Jes Rahbek Klinke core file names ?

Andrew Rodland Panicking in morse code v3 ?

Neil Brown PATCH - mark 2: type safe(r) list_entry repacement: container_of ?

Pavel Machek ANNOUNCE: bitkeeper to CVS gateway ?

Roman Zippel new module interface ?

Kernel development

Brief items

Current kernel release status

Kernel development news

Thrashing the interrupt code

On the initialization of structures

Implementing SMP clusters

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous