LWN.net Logo

Kernel development

Release status

Kernel release status

The current development kernel is 2.5.51, which was released by Linus on December 9. It's a huge patch containing several hundred changesets; some of the more significant changes include a big frame buffer device merge, some memory management performance improvements, an ACPI update, various architecture updates (PPC64, S/390, x86-64, SPARC64), a reorganization of the AGP code, a Linux Security Module update, the addition of the Twofish and Serpent crypto algorithms, a new system call restart mechanism (see below), an XFS update, more driver model work, more loadable module fixes, and a long list of other fixes and updates. The long-format changelog has the details.

The current 2.5 Status Summary from Guillaume Boissiere is dated December 10. Dave Jones has released a new version of his 2.5 Changes Document, which is a comprehensive look at what has changed in this development series.

The current stable kernel is 2.4.20. Marcelo started the 2.4.21 process on December 10 with the first 2.4.21 prepatch. It includes a bunch of new IDE code, a number of driver updates, a Summit chipset support update, and, of course, a fix for the data=journal ext3 corruption bug (see below). "Test it carefully, since the new IDE code is not yet fully tested. Do not use it with critical data."

Alan Cox has released 2.4.20-ac2, which adds a number of fixes (some backported from 2.5) to the 2.4.20 kernel.

Comments (none posted)

Kernel development news

The 2.4.20 ext3 corruption bug

Shortly after the release of the 2.4.20 stable kernel, word got out that there was a bug which could lead to corruption on ext3 filesystems. This particular bug will not affect all that many users: to be bitten, one must (1) use the non-default data=journal option, and (2) unmount the filesystem after making changes, but before those changes are synced to disk. Nonetheless, filesystem corruption is not a good feature to include in a stable kernel release.

2.4.20 users who wish to be protected from this bug should apply this patch from Andrew Morton. Andrew also includes some information on how the bug came to be. The trouble, it seems, comes from a longstanding confusion between two operations:

  • Flushing data to a filesystem to get it out of main memory, and

  • Fully synchronizing a filesystem to get it into a consistent, current state on disk.

The write_super() filesystem operation once performed the second operation above. A full sync, however, requires waiting for all of the I/O operations to complete. Most of the time, that is not what the kernel wants to do; it simply wants to get dirty buffers headed toward the disk sometime soon. So the ext3 write_super() method was made asynchronous, as a way of increasing performance. After another tweak went in, however, the lack of synchronization allowed the filesystem to be unmounted before the data actually made it to disk. And that, of course, led to corruption.

The solution is to properly separate the two operations. So Andrew's patch adds a new sync_fs() operation; it writes everything to the filesystem, and does not return until the job is done. With this patch in place, write_super() can be safely made into an asynchronous flush operation; kernel code which needs to be sure that everything has been written out will use sync_fs() instead.

Andrew has also posted a version of the patch for the 2.5 kernel. It is a more extensive change (though the patch is still small) in that it tries to improve performance by getting all sync operations going before waiting for any of them.

Comments (none posted)

A new system call restart mechanism

System calls often have to wait for things - I/O completion, availability of a resource, or simply for a timeout to expire, for example. Normally the process making the system call becomes unblocked at the appropriate time, and the call completes its work and returns to user space. What happens, though, if a signal is queued for the process while it is waiting? In that case, the system call needs to abort its work and allow the actual delivery of the signal. For this reason, kernel code which sleeps tends to follow the sleep with a test like:

    if (signal_pending(current))
	return -ERESTARTSYS;

After the signal has been handled, the system call will be restarted (from the beginning), and the user-space application need not deal with "interrupted system call" errors. For cases where restarting is not appropriate, a -EINTR return status will cause a (post-signal) return to user space without restarting the system call.

In general, this mechanism works reasonably well. But, what about cases where the system call should not just be restarted from the beginning? The case which raised that question is the nanosleep() system call, which puts the process to sleep for a (potentially) short time. By the POSIX standard, nanosleep() should not return early as a result of a signal if the process has no handler for that signal. So the call should be restarted. The problem is that the argument to nanosleep() tells how long the process wants to sleep - not when it wants to wake up. When the call is restarted, it must take into account how long the process had slept before the signal, and how long it took to deal with the signal, and adjust the sleep time accordingly. In other words, it should save the absolute time when the process wanted to wake up, and the restarted call should sleep until that time (or just return if the time has already passed). But there is no easy place for a system call to save that sort of information.

To solve this problem, Linus added a new mechanism to the 2.5.51 kernel, based on work by George Anzinger. This mechanism allows interrupted system calls to specify a different function to run when the call is restarted, along with information to be passed to that function.

Specifically, the thread_info structure now includes a restart_block structure. A system call needing different restart behavior can put a restart handler function into that structure, along with some arguments for that function. Then, if interrupted, the system call should return -ERESTARTSYS_RESTARTBLOCK. After the signal is dispatched, and if there was no handler specified by the process (and the process still lives), the function in the restart block will be called, with the block itself as an argument.

nanosleep(), which is currently the only user of this mechanism, need only save the wakeup time in the restart block, along with pointers to the user arguments. Interrupted sleeps will now be handled properly. It is not clear how many other system calls will make use of the new restart system; in most cases it is better to just return -EINTR in complicated situations. But, for cases where you really need to see the operation through, the new mechanism should help.

Comments (none posted)

Shrinking the x86 stack

The kernel stack on x86 systems is two pages - 8KB - in length. This stack area exists for every process on the system; one can easily see that, in a system with a large number of processes, the amount of memory given over to stacks could get large. This memory is unpageable kernel memory; it also requires an "order one" (two page) allocation for every new process. As memory becomes fragmented, multi-page allocations get harder to satisfy, and creation of new processes can fail. So there are plenty of reasons for wanting to reduce the size of the kernel stack.

Dave Hansen has posted a patch (originally by Ben LaHaise) which cuts the per-process kernel stack down to a single page. To accomplish that, this patch must do a few things:

  • One, of course, is to provide an option to use the smaller stack. Since there is a very real possibility of overflowing the reduced stack, this option will not be for everybody - at least, until all of the overflows have been found.

  • To help in finding those overflows, the patch includes a debugging option which uses the gcc profiling mechanism to regularly check the state of the stack. If it gets more than half full, a warning is emitted; should the stack overflow, the system will panic immediately. Or, almost immediately - it switches to an "overflow stack" first to give the panic code room to operate in.

  • Interrupt handling puts its own demands on the kernel stack. But handling of interrupts has nothing to do with any particular process, so there is no real need to use a per-process kernel stack. The patch thus sets up a separate, per-CPU stack which is used only for interrupts. Switching stacks when an interrupt happens is easy enough; the only tricky part is copying some information that the rest of the kernel expects to find on the stack - the preempt count and task pointer - when switching from one stack to another.

    Having a separate, per-CPU interrupt stack can also give a small boost to performance through better cache behavior.

This patch does not try to address the problem of kernel code which puts large variables on the stack. Heavy stack usage has always been considered poor form, but there are still kernel functions which do it. A smaller kernel stack would, undoubtedly, increase interest in fixing those functions.

A variant of the smaller-stack patch has been circulated before, but Linus has not commented on it. It is not clear whether this patch, at this time, would pass the "feature freeze" test. The idea probably makes enough sense to be integrated at some point, however, whether in this development series or the next.

Comments (3 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Networking

Architecture-specific

  • Jeff Dike: uml-patch-2.5.50-1. "<span>NOTE: I get reproducable filesystem corruption with this version. Offhand, it doesn't look like my fault, so I'm releasing it anyway.</span>" (December 7, 2002)

Security-related

Benchmarks and bugs

  • Con Kolivas: 2.4.20-aa1. (Contest benchmark result). (December 5, 2002)

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2002, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds