The current 2.6 prepatch is 2.6.8-rc3
, which was announced
by Linus on August 3. Most of
the additions this time around are relatively small fixes; they include
some kbuild work, a great many "sparse" annotations, the removal of the
(non-functional) "fastroute" networking option, some crypto-API work
(including an x86-optimized AES implementation which may be yanked out due
to licensing concerns), and several architecture updates. The long-format changelog
has all the details.
Linus's BitKeeper repository contains no patches after 2.6.8-rc3 as of this
The current patch set from Andrew Morton is 2.6.8-rc2-mm2. Recent additions to -mm include
some read-copy-update work (to address more latency issues), performance
improvements for O_SYNC disk I/O, the staircase CPU scheduler (see
below), token-based thrashing control (see below again), a change to
/dev/mem allowing architectures block accesses to kernel memory,
a new vprintk() function, and large numbers of fixes and
The current 2.4 prepatch is 2.4.27-rc5, released by Marcelo on August 3. This one
contains a fix for a new
security issue (which could allow unprivileged processes to read kernel
memory), takes out DVD-RW support for now (they will try again in 2.4.28),
and adds a few fixes.
Comments (none posted)
Kernel development news
Con Kolivas has been working on his staircase scheduler patch for a while;
it was covered here in the beginning of
. That scheduler found its way into the 2.6.8-rc2-mm2 patch
, along with this comment
from Andrew Morton:
This will probably have to come out again because various people
are still fiddling with the CPU scheduler. But my feeling here is
that the current 1st-gen CPU scheduler has been tweaked as far as
it can go and is still not 100% right. It is time to start
thinking about a new design which addresses the requirements and
current problems by algorithmic means rather than by tweaking.
So it would seem that it is now open season for scheduler work.
Initial reports on the staircase scheduler are generally - but not
uniformly - good. Martin Bligh posted some
benchmark results showing some significant performance improvements for
the 2.6.8-rc2-mm2 kernel, especially for "low to mid loads." Ingo Molnar,
instead, has found a workload which
performs poorly with this scheduler; it involves running multiple processes
each of which wants most, but not all, of the CPU.
Con, meanwhile, has posted a couple of additional patches implementing
additional policies in the staircase scheduler. SCHED_BATCH is another attempt at an "idle
process" mode, where batch processes only run if nothing else wants the
processor. This patch attempts to avoid priority inversion problems by
scheduling SCHED_BATCH processes at normal priority when they are
running in kernel mode.
SCHED_ISO, instead, is an "isochronous" mode
intended for applications which need soft real-time response. Putting a
process into SCHED_ISO is an unprivileged operation, any user can
do it. Isochronous tasks start out with a relatively high priority, and
should get scheduled quickly. Their allocated time slices are half of what
they would otherwise be, however, and their priority drops especially quickly with CPU
usage. So this mode is suitable for I/O bound processes which need to
respond quickly (audio recording, CD burning, etc.), but it should not
allow a hostile user to take over the system.
Peter Williams has been working on a different set of scheduler patches.
His approach is to get rid of the "expired" array (where processes go to
languish when they have used up their time slices) and move everything to a
single array. The patch offers two modes, being the traditional
priority-based mode and a new "entitlement" mode which tries to figure how
much processor time each task is entitled to, then works to ensure that
each is given at least that much time. His patches are available in a dizzying number of varieties; they seem to
have seen less testing so far, but Andrew has said that one of them might
get a turn in -mm for a while.
Nick Piggin's -np trees
also contain a new scheduler. Nick's work tries to simplify many of the
scheduler calculations while retaining logic which tries to evaluate the
"interactivity" of each process. Unlike some implementations, this
scheduler gives longer time slices to higher-priority processes. All slices
are scaled depending on the job mix, however; low-priority processes will
get longer slices if there are no high-priority processes around.
Ingo Molnar has continued his work on voluntary preemption; his voluntary-preempt-2.6.8-rc2-O2 patch features a
new implementation of the interrupt threads feature. The available reports
indicate that, with this patch, latency problems in the 2.6 kernel are
becoming few and far between.
There is no way to tell, at this point, which of these scheduler approaches
- if any - will find its way into the mainline kernel. Evaluating
schedulers takes a long time, and, for any given scheduler, there always
seems to be some strange workload out there which makes it fall apart. The
approaches described above (with the exception of voluntary preemption)
share one nice feature, however, which is likely to argue in favor of
including one of them: they all remove a significant amount of code and
make the scheduler simpler and easier to understand. That, in and of
itself, may be a worthwhile step toward the implementation of a top-quality
Comments (2 posted)
A system which is in the throes of VM thrashing is no fun to work with.
The kernel is forever throwing out pages which it will need in the near
future in favor of pages needed right now, and little work actually gets
done. It seems like there has to be a better way.
Rik van Riel has put together a patch based
on the work of Song Jiang which might help. The basic idea is that a
process which is currently bringing in pages should, for a short period,
not have its other pages booted out to swap. With luck, that process will
actually make some progress during that grace period before the VM grim
reaper swoops down and consigns it, once again, to the swap ghetto.
Clearly, not all processes which are bringing in pages can be sheltered
from page reclamation at the same time; if they could, the system would not
be thrashing in the first place. This problem is addressed through the
creation of a "swap token." A process holding the swap token will be
allowed to bring in pages without having its current working set molested
for a period of time. After a while, the token is passed on to the next
In Rik's patch, the (single, system-wide) token is implemented through
swap_token_mm, a pointer to the mm structure of the
process holding the token. If the kernel, on behalf of a process incurring
a page fault, decides that the token is available, swap_token_mm
will be set and the faulting process will get its breathing space for a
while. The token is deemed to be available if (1) it has been held
for longer than the maximum period, which is set to a surprisingly long 300
seconds, or (2) the process holding the token has not incurred any page
faults recently. Once the token becomes available, the first process which
comes looking for it will grab it - unless it has held the token in the
Rik's tests show some performance improvements with this patch applied.
There are potential improvements which could be made, such as trying to add
some intelligence to the decision of which process gets the token. A huge
process may hold the token for some time, grow to fill much of memory, and
still not have enough to get any real work done. Meanwhile, small
processes which could have benefited from a few extra pages continue to
thrash. Some tweaks could be made to the patch to address this issue, but
there are limits to how much code and complexity should be added to the
kernel to deal with a (hopefully) rare situation.
Comments (3 posted)
A number of interesting kernel patches have been posted in recent times.
Since your editor is pressed for time, a few of those patches will be
quickly covered here.
Nigel Cunningham has been working at getting some small pieces of his
software suspend implementation into the kernel. One of those pieces is this patch, which has to do with the "freezing"
of kernel threads prior to suspending the system. As processes are put on
hold, the kernel risks stopping a process which is needed later on in the
suspend process; think about a process handling NFS service or software
interrupts, for example. To avoid this situation, kernel threads are
simply not frozen. But many of them can be, and that would make the
suspend process more robust. So Nigel's patch goes through and tries to
set up each thread with the appropriate flags, so that only truly necessary
kernel threads continue to run while the system is being suspended.
A number of these threads, it turns out, are part of a workqueue. As a way
of setting up every workqueue process with the right flags, Nigel changed
the interface to create_workqueue() and
create_singlethread_workqueue(), thus breaking all code which
creates its own workqueues. Andrew Morton expressed some discomfort at the API change,
but acknowledged that it was useful in that it forces people to think about
whether every workqueue needs to run during a system suspend operation or
not. This patch has not yet appeared in -mm, as of this writing.
Rik van Riel and Arjan van de Ven have put together a new patch which allows normal users to lock
memory into physical RAM without root privilege. The
RLIMIT_MEMLOCK resource limit puts an upper bound on how much
memory can be locked, and its default value is zero. By raising this
limit, system administrators can enable users to lock a single page (useful
for cryptographic applications which do not want to see passphrases and
clear text swapped to disk) or larger amounts (for CD writing tasks, for
example). Various issues were raised regarding the security of this patch,
but the latest version appears to have resolved them. This code should
eventually replace the magic "mlock group" hack that was covered here last May.
Fistgen 0.1 has been released; this is the
first version for the 2.6 kernel. The announcement describes fistgen as "a
package of stackable templates," which may not be particularly illuminating
to many readers. More information can be found at filesystems.org; one
developer calls it "a yacc for filesystems." Using fistgen and a small
amount of code, a set of filters can be set up to create a filesystem with
a given set of characteristics. For example, this template describes a filesystem which
encrypts data using the sophisticated "rot13" algorithm. The fistgen
parser reads the template file and generates C code implementing the
filesystem, which can then be loaded into the kernel.
John McCutchan has been working on his inotify
patch for some time. Inotify is meant to be a replacement for the
dnotify mechanism, used by processes which wish to be alerted when files
are changed. The inotify patch takes a different approach; it creates a
char device which supports a small set of ioctl() operations.
After opening this device and using ioctl() to express interest in
a particular set of files, a process need only read the device to get the
change events for those files.
OpenSSI 1.0 is out. OpenSSI is a "single
system image" clustering environment based on the 2.4 kernel;
it includes member ship functions,
the CFS and Lustre Lite filesystems, process management, and a cluster-wide
device mechanism built on devfs. See the
OpenSSI web page for more information.
The sysfs directory /sys/module contains, among other things,
attributes for parameters exported by loaded modules. Dominik Brodowski
noticed that, if these modules are built directly into the kernel, those
parameters are not available via sysfs. If they were, they shouldn't be
under /sys/module in any case, since the code in question is not
part of a module. So he has posted a patch
creating a new directory (/sys/parameters) and putting attributes
there, for both modules and built-in code. This is a user-space API
change, but it is unlikely that anything of any consequence depends on
parameters under /sys/module at this point.
Jens Axboe has posted a new SCSI generic ("sg")
implementation (called "bsg") which works through the block layer. This driver
implements the SG_IO ioctl() call, and also allows
communication through regular reads and writes. The latter functionality
caused some complaints; when structures are passed between user and kernel
space with read() and write() calls, it becomes very hard
to convert them when the process is running in 32-bit mode on a 64-bit
platform. For all that the developers dislike ioctl(), that
interface does, at least, make it clear when and where a structure is being
transferred across the user-kernel boundary. To address these complaints,
the bsg driver may be restricted to the ioctl() mode only.
Comments (1 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>