The current 2.6 prepatch is 2.6.19-rc5
by Linus on
November 7. It contains another pile of fixes, many of them in
architecture-specific code; the long-format
has the details. Linus says "there may be a -rc6, but
maybe we don't even need one.
Adrian Bunk calls those "famous last words" in his 2.6.19-rc5 known regressions list.
The current -mm tree is 2.6.19-rc5-mm1. Recent changes
to -mm include the latest kevent code (see below), the kernel virtual machine patch
set, and some big updates to the high-resolution timer and dynamic tick code -
which still has some problems.
The current stable 2.6 release is 220.127.116.11, released on November 3.
Once again, quite a long list of patches has been merged into this release.
On the 2.6.16 front, 18.104.22.168 was released on
November 3, followed by 22.214.171.124 on November 7.
Between these two releases quite a few bugs have been fixed, including
several which are security-related.
For 2.4 users, 2.4.34-pre5
came out on November 4. The first 2.4.34 release candidate is
expected before too long.
Comments (none posted)
Kernel development news
It took a long time to come about, but it has happened: OSDL has pulled
together the money to fund a technical writer to work on kernel
documentation for a year. The job
is available on the net for anybody who might be interested in
Full Story (comments: 25)
One of the more complicated core kernel functions is
, in kernel/fork.c
. This routine
is the heart of the fork()
system calls; it
must create a
coherent copy of a running process, bearing in mind the various clone flags
which are present. There are sixteen different goto
error exits. This is clearly a place where a lot of things can go wrong.
It is also an operation of interest to many other kernel subsystems. A look
at copy_process() reveals hooks for task delay accounting, auditing,
the process fork connector, SYSV semaphore undo information management,
NUMA memory policy enforcement, cpuset maintenance, keyring management, and
more. Many of these subsystems want to know about other events in the
process lifecycle as well, with the result that hooks are placed all over
the process code. It might just be nice to have a cleaner solution to the
problem of learning about process-related events.
That cleaner solution would appear to be present in the form of Matt
Helsley's task watchers patch
set, currently in its second major iteration. This patch takes an
interesting approach to providing what is essentially just another notifier
interface in order to minimize overhead in a performance-critical part of
In this patch, a "task watcher" is a function which is notified whenever an
interesting process event takes place. Watchers have this prototype:
int my_watcher(unsigned long info, struct task_struct *tsk);
When the watcher function is called, info will have additional
information for the specific event, and tsk points to the
process generating the event. Arranging for a task watcher to be called is
a simple matter of adding a declaration like the following:
Where event is the event of interest, and function is the
task watcher function to be called in response to that event. The possible
- init: a process is first created; info is the set of
flags passed to clone().
- clone: a process forks; info is the set of
clone() flags. Note that this watcher appears to be called with
the child process; it differs from init in that it is called
toward the end of copy_process(), when creation of the new process
- exec: a process executes a new program; info is
- uid: a process changes its real or effective UID;
info is zero.
- gid: a process changes its real or effective GID;
info is zero.
- exit: a process dies; info is the exit code.
- free: a process's task structure is being freed;
info is the exit code.
The task_watcher_func() macro creates a pointer to the watcher
function in a special ELF section. There is a separate section for each
watched-for event; when such an event is signaled, the watcher code simply
iterates through each function found in the relevant executable section.
There are a couple of implications resulting from this mechanism: task
watchers exist for the life of the system (they cannot be registered and
unregistered), and they cannot be located in loadable modules (though this
restriction will eventually go away).
One might well wonder why things were done this way, rather than using a
simple notifier list. Your editor wondered, and asked Mr. Helsley about
it. The problem is that process creation is a performance-critical part of
the kernel, and any change which increases process fork time tends to get a
lot of scrutiny. Fork times are measured by a number of benchmarks; quick
process creation is also important in fork-heavy loads. Since kernel
compilation can require a lot of forks, there is an especially strong
incentive to keep it fast.
If a notifier list is used with watchers, some sort of locking is required
to keep that list from being corrupted when watchers come and go. The
separate ELF sections, instead, are read-only structures created at kernel
build time. So they impose less overhead on the process lifecycle and,
thus, are less likely to bother kernel developers who, perhaps, are not
really interested in the watcher functionality.
Comments (none posted)
The proposed kevent interface was last covered here
in August. This
new API, which seeks to provide a single interface for applications to
received events of interest, has been under development for the better part
of a year now. It continues to evolve, so, in celebration of the version 23 kevent patch
another look is called for.
Parts of the interface remain relatively stable. So, the main multiplexer
system call remains:
int kevent_ctl(int fd, unsigned int cmd, unsigned int num,
struct ukevent *arg);
The functions performed by this call are reduced in number, however. It is
no longer used to create the kevent file descriptor in the first place;
instead, an open of /dev/kevent is called for. But
kevent_ctl() is still the place to add events of interest, and to
remove and modify them.
The synchronous interface for waiting for events is also pretty much as it
has been for a little while:
int kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
__u64 timeout, struct ukevent *buf,
This system call will wait until at least min_nr events are ready
for consumption, then copy up to max_nr completed events into
buf. The call will return early, however, if timeout
nanoseconds pass before min_nr events are signaled. The current
kevents says that an indefinite wait can be had by passing -1 for
timeout - slightly strange, given that timeout is an
unsigned quantity. It would not be surprising to see some sort of
KEVENT_WAIT_FOREVER value defined for that purpose instead.
The biggest changes can be found in the kevent ring buffer code which, last
time we looked, was rather awkward to use. The previous implementation
also placed the ring buffer in nailed-down kernel memory, potentially
opening the system up to denial of service problems. So, in the new
implementation, the ring buffer is kept entirely in user space. The
application simply allocates an array of the desired size with the
unsigned int ring_kidx;
struct ukevent event;
The actual number of events to be stored in the ring is determined by the
application. The kevent subsystem must be told about this ring with:
int kevent_ring_init(int fd, struct kevent_ring *ring,
unsigned int num);
where num is the number of ukevent structures in the
ring. This call will remember the ring's address and size, and set
ring_kidx - the index of the entry where the kernel will store the
next completed event - to zero.
There are a few things to be aware of when working with the kevent ring.
One is that there is no place in this data structure to track which event
the application should consume next; the application must store that index
elsewhere. There also appears to be no way to disconnect or resize the
ring buffer without simply closing the event file descriptor and starting
over; an attempt to replace one ring with another will fail. Finally, the
application must tell the kernel to put events into the ring with:
int kevent_wait(int fd, unsigned int num, __u64 timeout);
This system call will wait until at least one event is available, then copy up
to num events into the ring buffer. Once the events are copied,
the kernel considers them to be consumed and will forget about them (or
requeue them if the event so requests). The application can work through
the events at leisure - stopping before hitting the current
ring_kidx value - with no further system calls required.
The current API seems to have made most of the people who are paying
attention happy - though it has been a little while since Ulrich Drepper,
an important player, has chimed in. In the past, he has been unhappy about
the timeout parameter (preferring that the interface use an absolute
timespec value rather than a relative value). Ulrich has also
suggested that the blocking system calls could use a version which
specifies an event mask, much like the recently added ppoll() and
pselect() system calls. He points out that, while it is possible
to receive signals as kevents, some applications will certainly still use
traditional signals, with their traditional atomicity problems.
So there may be a few remaining issues to take care of before the kevent
API is merged into the mainline kernel - and consequently set in stone.
But there is apparent progress in that direction, and the number of
developers showing interest in this API appears to be on the increase. It
may not be too many more kernel cycles before Linux has a unified event
interface of its very own.
Comments (2 posted)
The "sparse" utility has long been one of Linux's best-kept secrets. It is
a static analysis tool which can find a wide variety of bugs in the kernel
code base; sparse is a useful tool, but it can be surprisingly hard to
find. It has never had a web page, and almost no distributions package
it. Interested users must, instead, track down the git tree or Dave
Sparse was originally written by Linus Torvalds, but he has
not done much with it for a while, and he recently suggested that somebody else should take it
Anyway, I suspect it would be better if people didn't consider me
the maintainer for sparse, simply because it does the things I
really cared about, and as a result I'm not really very active.
As a result of this discussion, sparse has a new maintainer: Josh
Triplett. Josh started things off with sparse 0.1, the first-ever
sparse release with a version number. He has set up a new git tree
for sparse, and, even, a sparse web
Josh was kind enough to answer some questions posed by your editor. It
turns out that he has been working with sparse for a while; it was part of
his PhD work, where he enhanced it to be able to verify proper use of the
read-copy-update (RCU) primitives. That work continued at IBM over the
summer, where he was able to work on RCU verification with Paul McKenney.
As a result, his first priority for sparse in the near future is the
continuation of the RCU work. This effort is also expanding into locking
verification in general; some of the necessary annotations and resulting
fixes have gone into the 2.6.18 and 2.6.19-rc kernels. Josh also plans to
work on the elimination of false positives and on noise reduction in
general. Then, there's various patches from other developers which have
been floating around for a while and really need to be merged into the
In terms of project management, Josh says:
I plan to continue making regular Sparse releases, and I'd like to
get Sparse packaged in various distributions, at least in their
"experimental" sections or equivalent. Any potential distribution
packagers, feel free to join the linux-sparse list, and let me know
what I can do to help or to get things going more smoothly.
Getting sparse into distributions could only help increase its use - and
bring about a corresponding reduction in bugs in shipped code. This will
be especially true if Josh succeeds in another one of his goals: expanding
sparse usage beyond the kernel into user-space projects. X.org seems like
it could be an early sparse adopter.
Longer-term, Josh wants to look at more advanced techniques which can look
at larger chunks of a program and find potential bugs. Part of this effort
will require attracting other researchers interested in static analysis to
the sparse platform. Says Josh:
I feel that several classes of bugs
exist in the Linux kernel and in userspace code which simply should
not exist, because the tools exist to find and eliminate almost all
of them. This includes bugs like "scheduling while atomic",
__init-related bugs, errors on error paths, and many
One can only imagine that free software users all over are wishing Josh the
best of luck in his effort to track down and get rid of all those
Comments (8 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>