Release status
Kernel release status
The current development kernel is 2.5.51, which was
released by Linus on December 9. It's a
huge patch containing several hundred changesets; some of the more
significant changes include a big frame buffer device merge, some memory
management performance improvements, an ACPI update, various architecture
updates (PPC64, S/390, x86-64, SPARC64), a reorganization of the AGP code,
a Linux Security Module update, the addition of the Twofish and Serpent
crypto algorithms, a new system call restart mechanism (see below), an XFS
update, more driver model work, more loadable module fixes, and a long list
of other fixes and updates.
The long-format
changelog has the details.
The current 2.5 Status Summary from
Guillaume Boissiere is dated December 10. Dave Jones has released a
new version of his 2.5 Changes Document,
which is a comprehensive look at what has changed in this development
series.
The current stable kernel is 2.4.20. Marcelo started the 2.4.21
process on December 10 with the first
2.4.21 prepatch. It includes a bunch of new IDE code, a number of driver updates,
a Summit chipset support update, and, of course, a fix for the
data=journal ext3 corruption bug (see below). "Test it
carefully, since the new IDE code is not yet fully tested. Do not use it
with critical data."
Alan Cox has released 2.4.20-ac2, which adds
a number of fixes (some backported from 2.5) to the 2.4.20 kernel.
Comments (none posted)
Kernel development news
The 2.4.20 ext3 corruption bug
Shortly after the release of the 2.4.20 stable kernel, word got out that
there was a bug which could lead to corruption on ext3 filesystems. This
particular bug will not affect all that many users: to be bitten, one must
(1) use the non-default
data=journal option, and
(2) unmount the filesystem after making changes, but before those
changes are synced to disk. Nonetheless, filesystem corruption is not a
good feature to include in a stable kernel release.
2.4.20 users who wish to be protected from this bug should apply this patch from Andrew Morton. Andrew also
includes some information on how the bug came to be.
The trouble, it seems, comes from a longstanding confusion between two operations:
- Flushing data to a filesystem to get it out of main memory, and
- Fully synchronizing a filesystem to get it into a consistent, current
state on disk.
The write_super() filesystem operation once performed the second
operation above. A full sync, however, requires waiting for all of the I/O
operations to complete. Most of the time, that is not what the kernel
wants to do; it simply wants to get dirty buffers headed toward the disk
sometime soon. So the ext3 write_super() method was made
asynchronous, as a way of increasing performance. After another tweak went
in, however, the lack of synchronization allowed the filesystem to be
unmounted before the data actually made it to disk. And that, of course,
led to corruption.
The solution is to properly separate the two operations. So Andrew's patch
adds a new sync_fs() operation; it writes everything to the
filesystem, and does not return until the job is done. With this patch in
place, write_super() can be safely made into an asynchronous flush
operation; kernel code which needs to be sure that everything has been
written out will use sync_fs() instead.
Andrew has also posted a version of the
patch for the 2.5 kernel. It is a more extensive change (though the
patch is still small) in that it tries to improve performance by getting
all sync operations going before waiting for any of them.
Comments (none posted)
A new system call restart mechanism
System calls often have to wait for things - I/O completion, availability
of a resource, or simply for a timeout to expire, for example. Normally
the process
making the system call becomes unblocked at the appropriate time, and the
call completes its work and returns to user space. What happens, though,
if a signal is queued for the process while it is waiting? In that case,
the system call needs to abort its work and allow the actual delivery of
the signal. For this reason, kernel code which sleeps tends to follow the
sleep with a test like:
if (signal_pending(current))
return -ERESTARTSYS;
After the signal has been handled, the system call will be restarted (from
the beginning), and the user-space application need not deal with
"interrupted system call" errors. For cases where restarting is not
appropriate, a -EINTR return status will cause a (post-signal)
return to user space without restarting the system call.
In general, this mechanism works reasonably well. But, what about cases
where the system call should not just be restarted from the beginning? The
case which raised that question is the nanosleep() system call,
which puts the process to sleep for a (potentially) short time. By the
POSIX standard, nanosleep() should not return early as a result of
a signal if the process has no handler for that signal. So the call
should be restarted. The problem is that the argument to
nanosleep() tells how long the process wants to sleep - not when
it wants to wake up. When the call is restarted, it must take into account
how long the process had slept before the signal, and how long it took to
deal with the signal, and adjust the sleep time accordingly. In other
words, it should save the absolute time when the process wanted to wake up,
and the restarted call should sleep until that time (or just return if the
time has already passed). But there is no easy place for a system call to
save that sort of information.
To solve this problem, Linus added a new
mechanism to the 2.5.51 kernel, based on work by George Anzinger. This
mechanism allows interrupted system calls to specify a different function
to run when the call is restarted, along with information to be passed to
that function.
Specifically, the thread_info structure now includes a
restart_block structure. A system call needing different restart
behavior can put a restart handler function into that structure, along with
some arguments for that function. Then, if interrupted, the system call
should return -ERESTARTSYS_RESTARTBLOCK. After the signal is
dispatched, and if there was no handler specified by the process (and the
process still lives), the function in the restart block will be called,
with the block itself as an argument.
nanosleep(), which is currently the only user of this mechanism,
need only save the wakeup time in the restart block, along with pointers to
the user arguments. Interrupted sleeps will now be handled properly. It
is not clear how many other system calls will make use of the new restart
system; in most cases it is better to just return -EINTR in
complicated situations. But, for cases where you really need to see the
operation through, the new mechanism should help.
Comments (none posted)
Shrinking the x86 stack
The kernel stack on x86 systems is two pages - 8KB - in length. This stack
area exists for every process on the system; one can easily see that, in a
system with a large number of processes, the amount of memory given over to
stacks could get large. This memory is unpageable kernel memory; it also
requires an "order one" (two page) allocation for every new process. As
memory becomes fragmented, multi-page allocations get harder to satisfy,
and creation of new processes can fail. So there are plenty of reasons for
wanting to reduce the size of the kernel stack.
Dave Hansen has posted a patch (originally
by Ben LaHaise) which cuts the per-process kernel stack down to a single
page. To accomplish that, this patch must do a few things:
This patch does not try to address the problem of kernel code which puts
large variables on the stack. Heavy stack usage has always been considered
poor form, but there are still kernel functions which do it. A smaller
kernel stack would, undoubtedly, increase interest in fixing those
functions.
A variant of the smaller-stack patch has been circulated before, but Linus
has not commented on it. It is not clear whether this patch, at this time,
would pass the "feature freeze" test. The idea probably makes enough sense
to be integrated at some point, however, whether in this development series
or the next.
Comments (3 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
- Jeff Dike: uml-patch-2.5.50-1. "<span>NOTE: I get reproducable filesystem corruption with this version. Offhand,
it doesn't look like my fault, so I'm releasing it anyway.</span>"
(December 7, 2002)
Security-related
Benchmarks and bugs
- Con Kolivas: 2.4.20-aa1. (Contest benchmark result).
(December 5, 2002)
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>