LWN.net Logo

Kernel development

Current kernel release status

The current development kernel is 2.5.21, which was announced by Linus on June 8. Changes include a big S/390 patch, a number of networking fixups, more kernel build changes (see last week's LWN Kernel Page), more driver model work, an NTFS update, some USB updates, and more. The long format changelog is available for those wanting the details.

Note that the IDE reworking process left a bug in 2.5.21 which can, apparently, send "format" commands to IDE drives. Said commands do not actually get run - nobody's drive has actually been formatted. But this is a good reminder that development kernels can always be a little hazardous, especially when fundamental layers (like IDE) are in a state of constant flux.

Linus's in-progress 2.5.22 patch (in BitKeeper) includes a big X86-64 update, a fix for a potential X86 security bug, an ACPI update, a new set of VFS and block device cleanups from Alexander Viro, a number of fixes for problems found by the Stanford Checker (see below), more IDE reworking, another set of kbuild fixes (not from kbuild-2.5), and more.

The latest prepatch from Dave Jones is 2.5.20-dj4; it brings in some fixes from the 2.4.19-pre series and the new CPU "frequency scaling" code ("Handle with care, still experimental").

The current 2.5 kernel status summary from Guillaume Boissiere was posted on June 12.

The current stable kernel remains 2.4.18. There have been no 2.4.19 prepatches or -ac patches released in the last week.

For followers of ancient kernels, David Weinehall has released 2.0.40-rc5, the fifth 2.0.40 release candidate.

Comments (none posted)

The return of the Stanford Checker

We first looked at the "Stanford Checker" back in March, 2001. The Checker is a system built on top of gcc which analyzes large amounts of source code and looks for obscure errors. In the past, it has been responsible for many kernel bug fixes. The Checker team has been quiet for a while; now, perhaps with the end of the academic year, the group has returned with a new set of error reports.

So what has the checker group found this time?

  • Missing unlocks. Here, the Checker looked for situations where kernel code could either take out a lock or disable interrupts, then fail to undo the action before returning. 18 possible errors were found.

  • Memory leaks. The Checker looked for failure paths which failed to return allocated memory. "while we only include 24 errors, there were lots in general."

  • Failure to check return codes. Numerous places were found where kernel code does not look at the return status from a function which can fail.

  • Missing null pointer checks (54 errors). Most of the errors seem to be with calls to kmalloc.

  • Large stack variables (37). Allocating a variable of size greater than 1KB may not be, strictly, an error, but it can lead to problems quickly when the stack runs out of space.

The Checker code itself remains unreleased, unfortunately. The Checker group does the kernel a great service by performing this testing and passing on the problems for fixing. But there are no end of other development projects out there that could benefit from this code. One can only hope that, someday, the Checker code will be more widely available.

Comments (5 posted)

DMA, small buffers, and cache incoherence

Roland Dreier reported on an interesting class of bugs which can affect drivers on some architectures. This particular source of subtle bugs is worth a look as an example of how hard it can be to really make things work on modern hardware.

All modern systems, of course, employ one or more levels of cache in the processor to cut down on slow accesses to main memory. One challenge with in-processor caching has always been to avoid doing the wrong thing when something other than the processor changes memory. On SMP systems, for example, any processor can write anywhere in memory, and the other processors have to adjust immediately. For that reason, SMP systems have elaborate schemes for moving "ownership" of cached data between processors. This "cache line bouncing" is effective but expensive; modern operating system kernels try to minimize the need for such bouncing.

Another possible source of cache confusion is DMA I/O. Peripheral devices doing DMA can change memory directly and leave the processor cache in an incorrect state. Some processors (i.e. the x86) have a coherent cache which notices changes made by peripherals and automatically updates itself. Other processors have incoherent caches which can be fooled by DMA I/O operations.

The Linux DMA support code has been very carefully written to hide cache coherence issues from driver code. If you use the primitives provided and follow the rules regarding processor access to DMA buffers, you will not be bitten by cache problems. The DMA code takes care of invalidating cache contents as needed so that caches never contain incorrect copies of main memory.

That is the idea, anyway. Roland has found a situation where this protection does not quite work. Consider a driver which is using a structure like this:

    struct iostruct {
    	...
	int ifield;
	char dma_buffer[SMALL_SIZE];
	...
    };

If this structure is allocated properly (with kmalloc, for example), then using the dma_buffer field in DMA operations is a legal thing to do. The problem is that other fields in the structure (such as ifield in the example above) may share a cache line with part of the buffer. Consider, then, a sequence of things that can happen:

  1. The driver starts a DMA read into dma_buffer. As part of this operation, the kernel will invalidate the cache data containing both dma_buffer and ifield.

  2. While the operation is outstanding, the driver accesses the ifield member, bringing the invalidated cache line back into memory.

  3. The I/O operation completes, changing memory underneath the cached data.

At this point, the data in the processor cache does not match what is in memory. If the driver accesses the data in dma_buffer, it may well find old data that was in memory before the I/O operation took place. If the driver changes ifield, the processor could write back the (incorrect) cache data, corrupting the data in main memory. If the kernel simply invalidates the cache again at the end of the operation, it could lose changes made to ifield. There really is no correct thing to do at this point.

The only way to deal with this problem is to not let it happen in the first place. A number of possibilities are being considered. One way, suggested by Roland, is to create a __dma_buffer attribute which can be used in the declaration of small buffers; on non-cache-coherent systems, this attribute would force the size and alignment of the buffer such that it would not share cache lines with any other data. Another approach is to require that all DMA buffers be allocated separately; the kernel memory allocation primitives already ensure that even the smallest buffers are properly aligned and padded. Yet another approach could be to simply disable caching for the page(s) in question while the operation is in progress; most architectures support this in their page tables. This approach could create performance problems, however (if the page in question has heavily-used data), and it could be complex.

David Miller, who wrote much of the current DMA code, has a different approach. He thinks that this kind of subtle cache issue is a trap for driver writers that should be simply avoided altogether. Rather than come up with new ways of working around incoherent caches, it's better to just change the rules and tell driver writers to allocate their small DMA buffers using the "PCI pool" interface. This interface, which was added in 2.4.4, was designed for just this purpose: allocating small buffers for DMA. Rather than make driver writers deal with this sort of cache coherence issue - and watch some of them get it wrong, David would bury it in the PCI pool code. While no real resolution has been proclaimed, this last option appears to be the likely outcome.

Comments (none posted)

A new way of ordering kernel initialization

The Linux kernel is made up of a very large number of mostly independent modules. In general, these modules can be linked together and initialized (at boot time) in any order. There are cases, however, where initialization order matters. The memory management system generally needs to be set up early in the process, filesystems generally need a functioning block system to be ready first, etc. Some years ago, initialization order was handled with a big set of explicit calls in a single source file. This big file inhibited modularization and created a clash point for patches, and it was (mostly) eliminated some time ago.

The current scheme involves marking initialization functions with variants of the initcall attribute. At link time, these functions are marshalled together into a special section of the kernel executable; the kernel finds them there at boot time and calls them all. As an added bonus, the initialization calls can generally be flushed out of memory once initialization is complete.

This scheme is far more modular and easy to maintain, but the initialization order problem remains. In recent times that problem has been handled through a combination of hardwired calls and variants on the initcall macro. So, subsystems whose initialization calls are marked with core_initcall are initialized before those using, say, fs_initcall. These macros give a coarse solution to the problem, but initialization order problems can still show up.

Now Rusty Russell has posted a new mechanism which allows kernel hackers to make initialization dependencies explicit. If driver1 must be set up before driver2 can be initialized, driver2 can simply mark its initialization call as:

    initcall (driver2_init, driver2, init_after(driver1));
There is also an init_before marker, of course, along with init_as_part_of for complicated subsystems. A new build_initcalls script has the job of sorting out the dependencies and creating an ordered list at kernel build time. The patch looks simple and straightforward; initialization order problems could soon be a thing of the past.

Comments (none posted)

Patches and updates

The LWN.net kernel patch ticker

Since it was easy to do with the new site: there is now a new page where you can see the latest kernel patches as they get fed into our system. It is currently just an unorganized stream. We would like to hear if this feature is useful to anybody; if so, we may develop it further.

Comments (4 posted)

Kernel trees

Core kernel code

  • Rusty Russell: initcall dependency solution.. A mechanism for ensuring that kernel subsystems get initialized in the proper order. (June 11, 2002)

Development tools

Device drivers

  • Jeff Garzik: ANN: Linux 2.2 driver compatibility toolkit. "<span>Don't load your drivers up with 2.2.x compatibility junk. Write a 2.4.x driver... and use this toolkit to make it work under 2.2.</span>" (June 10, 2002)

Documentation

  • Dan Aloni: On the use of typedefs. A change to the CodingStyle document laying down Linus's approach to typedefs. (June 11, 2002)

Filesystems and block I/O

Janitorial

Kernel building

  • Andrew Morton: CONFIG_NR_CPUS. Trims 240KB from the kernel on 2-processor system. (June 9, 2002)

Networking

Architecture-specific

Miscellaneous

  • Pavel Machek: S4bios support. Suspend/resume support for the S4 BIOS. (June 12, 2002)

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2002, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds