Current kernel release status
The current development kernel is 2.5.21, which was
announced by Linus on June 8. Changes
include a big S/390 patch, a number of networking fixups, more kernel build
changes (see
last week's LWN Kernel Page),
more driver model work, an NTFS update, some USB updates, and more. The
long format changelog is available for those
wanting the details.
Note that the IDE reworking process left a bug in 2.5.21 which can,
apparently, send "format" commands to IDE drives. Said commands do not
actually get run - nobody's drive has actually been formatted. But this is
a good reminder that development kernels can always be a little hazardous,
especially when fundamental layers (like IDE) are in a state of constant
flux.
Linus's in-progress 2.5.22 patch (in BitKeeper) includes a big X86-64
update, a fix for a potential X86 security bug, an ACPI update, a new set
of VFS and block device cleanups from Alexander Viro, a number of fixes for
problems found by the Stanford Checker (see below), more IDE reworking,
another set of kbuild fixes (not from kbuild-2.5), and more.
The latest prepatch from Dave Jones is 2.5.20-dj4; it brings in some fixes from the
2.4.19-pre series and the new CPU "frequency scaling" code ("Handle
with care, still experimental").
The current 2.5 kernel status
summary from Guillaume Boissiere was posted on June 12.
The current stable kernel remains 2.4.18. There have been no 2.4.19
prepatches or -ac patches released in the last week.
For followers of ancient kernels, David Weinehall has released 2.0.40-rc5, the fifth 2.0.40 release candidate.
Comments (none posted)
The return of the Stanford Checker
We first looked at the "Stanford Checker"
back in March, 2001. The Checker
is a system built on top of gcc which analyzes large amounts of source code
and looks for obscure errors. In the past, it has been responsible for
many kernel bug fixes. The Checker team has been quiet for a while; now,
perhaps with the end of the academic year, the group has returned with a
new set of error reports.
So what has the checker group found this time?
- Missing unlocks. Here, the Checker
looked for situations where kernel code could either take out a lock
or disable interrupts, then fail to undo the action before returning.
18 possible errors were found.
- Memory leaks. The Checker looked for
failure paths which failed to return allocated memory. "while
we only include 24 errors, there were lots in general."
- Failure to check return codes. Numerous
places were found where kernel code does not look at the return status
from a function which can fail.
- Missing null pointer checks (54
errors). Most of the errors seem to be with calls to
kmalloc.
- Large stack variables (37). Allocating
a variable of size greater than 1KB may not be, strictly, an error,
but it can lead to problems quickly when the stack runs out of space.
The Checker code itself remains unreleased, unfortunately. The Checker
group does the kernel a great service by performing this testing and
passing on the problems for fixing. But there are no end of other
development projects out there that could benefit from this code. One can
only hope that, someday, the Checker code will be more widely available.
Comments (5 posted)
DMA, small buffers, and cache incoherence
Roland Dreier reported on an interesting class of bugs which can affect
drivers on some architectures. This particular source of subtle bugs is
worth a look as an example of how hard it can be to
really make
things work on modern hardware.
All modern systems, of course, employ one or more levels of cache in the
processor to cut down on slow accesses to main memory. One challenge with
in-processor caching has always been to avoid doing the wrong thing when
something other than the processor changes memory. On SMP systems, for
example, any processor can write anywhere in memory, and the other
processors have to adjust immediately. For that reason, SMP systems have
elaborate schemes for moving "ownership" of cached data between
processors. This "cache line bouncing" is effective but expensive; modern
operating system kernels try to minimize the need for such bouncing.
Another possible source of cache confusion is DMA I/O. Peripheral devices
doing DMA can change memory directly and leave the processor cache in an
incorrect state. Some processors (i.e. the x86) have a coherent
cache which notices changes made by peripherals and automatically updates
itself. Other processors have incoherent caches which can be fooled by DMA
I/O operations.
The Linux DMA support code has been very carefully written to hide cache
coherence issues from driver code. If you use the primitives provided and
follow the rules regarding processor access to DMA buffers, you will not be
bitten by cache problems. The DMA code takes care of invalidating cache
contents as needed so that caches never contain incorrect copies of main
memory.
That is the idea, anyway. Roland has found a
situation where this protection does not quite work. Consider a driver
which is using a structure like this:
struct iostruct {
...
int ifield;
char dma_buffer[SMALL_SIZE];
...
};
If this structure is allocated properly (with kmalloc, for example),
then using the
dma_buffer field in DMA operations is a legal thing to do. The
problem is that other fields in the structure (such as ifield in
the example above) may share a cache line with part of the buffer.
Consider, then, a sequence of things that can happen:
- The driver starts a DMA read into dma_buffer. As part of
this operation, the kernel will invalidate the cache data containing
both dma_buffer and ifield.
- While the operation is outstanding, the driver accesses the
ifield member, bringing the invalidated cache line back into
memory.
- The I/O operation completes, changing memory underneath the cached
data.
At this point, the data in the processor cache does not match what is in
memory. If the driver accesses the data in dma_buffer, it may
well find old data that was in memory before the I/O operation took place.
If the driver changes ifield, the processor could write back the
(incorrect) cache data, corrupting the data in main memory. If the kernel
simply invalidates the cache again at the end of the operation, it could
lose changes made to ifield. There really is no correct thing to
do at this point.
The only way to deal with this problem is to not let it happen in the first
place. A number of possibilities are being considered. One way, suggested by Roland, is to create a
__dma_buffer attribute which can be used in the declaration of
small buffers; on non-cache-coherent systems, this attribute would force
the size and alignment of the buffer such that it would not share cache
lines with any other data. Another approach is to require that all DMA
buffers be allocated separately; the kernel memory allocation primitives
already ensure that even the smallest buffers are properly aligned and
padded. Yet another approach could be to simply disable caching for the
page(s) in question while the operation is in progress; most architectures
support this in their page tables. This approach could create performance
problems, however (if the page in question has heavily-used data), and it
could be complex.
David Miller, who wrote much of the current DMA code, has a different approach. He thinks that this
kind of subtle cache issue is a trap for driver writers that should be
simply avoided altogether. Rather than come up with new ways of working
around incoherent caches, it's better to just change the rules and tell
driver writers to allocate their small DMA buffers using the "PCI pool"
interface. This interface, which was added in 2.4.4, was designed for just
this purpose: allocating small buffers for DMA. Rather than make driver
writers deal with this sort of cache coherence issue - and watch some of
them get it wrong, David would bury it in the PCI pool code. While no real
resolution has been proclaimed, this last option appears to be the likely
outcome.
Comments (none posted)
A new way of ordering kernel initialization
The Linux kernel is made up of a very large number of mostly independent
modules. In general, these modules can be linked together and initialized
(at boot time) in any order. There are cases, however, where
initialization order matters. The memory management system generally needs
to be set up early in the process, filesystems generally need a functioning
block system to be ready first, etc. Some years ago, initialization order
was handled with a big set of explicit calls in a single source file.
This big file inhibited modularization and created a clash point for
patches, and it was (mostly) eliminated some time ago.
The current scheme involves marking initialization functions with variants
of the initcall attribute. At link time, these functions are
marshalled together into a special section of the kernel executable; the
kernel finds them there at boot time and calls them all. As an added
bonus, the initialization calls can generally be flushed out of memory once
initialization is complete.
This scheme is far more modular and easy to maintain, but the
initialization order problem remains. In recent times that problem has
been handled through a combination of hardwired calls and variants on the
initcall macro. So, subsystems whose initialization calls are
marked with core_initcall are initialized before those using, say,
fs_initcall. These macros give a coarse solution to the problem,
but initialization order problems can still show up.
Now Rusty Russell has posted a new mechanism
which allows kernel hackers to make initialization dependencies explicit.
If driver1 must be set up before driver2 can be
initialized, driver2 can simply mark its initialization call as:
initcall (driver2_init, driver2, init_after(driver1));
There is also an
init_before marker, of course, along with
init_as_part_of for complicated subsystems. A new
build_initcalls script has the job of sorting out the dependencies
and creating an ordered list at kernel build time. The patch looks simple
and straightforward; initialization order problems could soon be a thing of
the past.
Comments (none posted)
Patches and updates
The LWN.net kernel patch ticker
Since it was easy to do with the new site: there is now
a new page where you can see the latest kernel
patches as they get fed into our system. It is currently just an
unorganized stream. We would like to hear if this feature is useful to
anybody; if so, we may develop it further.
Comments (4 posted)
Kernel trees
Core kernel code
- Rusty Russell: initcall dependency solution.. A mechanism for ensuring that kernel subsystems get initialized in the proper order.
(June 11, 2002)
Development tools
Device drivers
- Jeff Garzik: ANN: Linux 2.2 driver compatibility toolkit. "<span>Don't load your drivers up with 2.2.x compatibility junk. Write a 2.4.x
driver... and use this toolkit to make it work under 2.2.</span>"
(June 10, 2002)
Documentation
- Dan Aloni: On the use of typedefs. A change to the CodingStyle document laying down Linus's approach to typedefs.
(June 11, 2002)
Filesystems and block I/O
Janitorial
Kernel building
- Andrew Morton: CONFIG_NR_CPUS. Trims 240KB from the kernel on 2-processor system.
(June 9, 2002)
Networking
Architecture-specific
Miscellaneous
- Pavel Machek: S4bios support. Suspend/resume support for the S4 BIOS.
(June 12, 2002)
Page editor: Jonathan Corbet
Next page: Distributions>>