The current development kernel is 2.5.40
by Linus on October 1. Among the
usual fixes and updates, it includes high memory support for User-mode
Linux, the CPU frequency (power management) patches, more disk management
thrashups from Al Viro, the in-kernel NUMA topology API, the removal of the
task queue subsystem, an ISDN update, and an ARM update. Here's the long-format changelog
with the details.
Linus announced 2.5.39 on
September 27. The biggest change,
perhaps, was the inclusion of the deadline I/O scheduler (covered in last week's LWN Kernel Page);
this kernel also contained a bunch of XFS fixes, an SCTP update, a bunch of
memory management work by Andrew Morton, Ingo Molnar's in-kernel symbolic
oops dumper, some driver model work, and numerous other fixes and updates.
long-format changelog is available.
Linus's pre-2.5.41 BitKeeper tree contains a big ALSA update (the source of
some grumbling from Linus), Ingo Molnar's
"workqueue" implementation (see below), and a relatively small number (as
of this writing) of other fixes and updates.
Dave Jones jumped back into the prepatch business with 2.5.39-dj1, which contained a number of fixes
from his tree. Dave evidently still has a substantial pile of fixes to
push on to Linus, but has been busy.
After a long absence, Alan Cox has also started putting out development
kernel prepatches again. 2.5.40-ac1
includes support for the Voyager architecture, a merge of the uClinux
distribution, and a number of fixes.
The latest 2.5 status summary from Guillaume
Boissiere is dated October 2.
The current stable kernel is 2.4.19. Marcelo released 2.4.20-pre8 seconds after last week's Kernel
Page was posted; it included an IBM hotplug driver update, a couple
of security fixes, an x86-64 update, and a number of other fixes.
The current prepatch from Alan Cox is 2.4.20-pre8-ac3. Alan's recent releases have
contained quite a few fixes, but no major new work.
Comments (none posted)
Kernel development news
As part of the 2.5.40 announcement, Linus reminded the world that the
feature freeze is coming soon:
And a small reminder that we're now officially in the last month of
features, and since I'm going to be away basically the last week of
October, so I actually personally consider Oct 20th to be the
drop-date, unless you've got a really good and scary costume.. So
don't try to leave it to the last day.
Linus also let it be known that he's "perfectly happy with the kernel" and
feels no need to deal with last-minute code submissions.
In fact, the list of outstanding features is getting smaller. A couple of
big changes that are pending, and which could be disruptive, are:
- Changing the sector_t type to 64 bits, allowing block devices
to be larger than 2TB. One would think that 2TB would last for a
while, even by the standards of modern disks, but large RAID arrays
are already pushing that boundary. A patch (by Peter Chubb) is being
prepared, but it's taking him a while; among other things, he points
out that testing is a slow process because it takes a full day just to
write 4TB to a device.
- Turning dev_t into a 32-bit value. Increasing the number of
devices has been on the list since long before the 2.5 series began,
but the change has not yet been made. This is not a trivial change,
since the major device number is still used to index into static
arrays within the kernel. Drastically increasing the number of devices
requires dealing with those arrays. Alexander Viro has a plan to that end, but a lot of work
remains to be done.
Beyond that, quite a few other developments are pending, and they won't all
get in. Some outstanding items include the completion of the Linux
Security Module and asynchronous I/O merges, ext3 indexed directory
support, Rusty Russell's new module loader, a new kernel configuration and
build system, a whole pile of memory management work, etc.
What also remains to be seen is how serious Linus is about the feature
freeze. Past kernel freezes have tended to be slushy at best. Some
substantial work will have to be integrated after the freeze; it will be
interesting to see what gets in as "stabilization" or "feature completion."
Then, there is the much-publicized debate over whether the next stable
series should be 2.6 or 3.0. Linus started by saying that there was
nothing all that revolutionary in this kernel, and that it should be called
2.6. Numerous other developers disagreed, however, and Linus appears to have relented. It seems likely that
the next major stable kernel will be called Linux 3.0.
Comments (9 posted)
Kernel code often needs to set aside a task to be performed "a little
later." The classic example is that of an interrupt handler, which must
perform its task quickly, without blocking. Typically interrupt handlers
simply acknowledge the interrupt, then arrange for the real work to be done
outside of interrupt context. That work, which can include starting new
I/O operations, delivering data to user space, or cleanup actions, gets
done when the kernel gets around to it - and, usually, when it's safe to
In the good old days, the "bottom half" mechanism was used to set aside
tasks in this manner. Linux bottom halves were quite inflexible, being
identified by globally-unique, compile-time numbers. There could be no
more than 32 of them - the number that could be tracked in a single-word
bitmask. And bottom halves were not safe places for extended processing or
tasks that needed to sleep.
More recent kernels moved much of the bottom half work to "task queues." A
task queue is a simple linked list of functions to call (and data to pass
to them). Certain predefined task queues were run at well-defined times;
one was executed whenever the scheduler was called, and another was run out
of the timer interrupt handler. Task queues cleaned things up
significantly, but they were not particularly transparent and,
fundamentally, they were still bottom halves. Their removal has been on
numerous peoples' "todo" lists for some time.
One replacement for task queues is the "tasklet" interface, which was
introduced in the 2.3 development series. Tasklets provide a
high-performance interface for quick tasks that do not sleep; they are thus
suitable for certain sorts of operations, but they do not replace task
queues in all situations.
More recently, an attempt was made to address other deferred processing
needs by wrapping a new interface (schedule_task()) around (what
was) the scheduler task queue, and creating a special kernel thread
(keventd) to run that queue. keventd provided a
well-defined process context for tasks that need it (in particular, those
which can sleep). But keventd still suffered the limitations of
task queues, plus one other: all tasks were executed by a single thread.
One very slow task could thus hold up everything else in the queue,
creating unpredictable latencies.
A couple of patches recently posted by Ingo Molnar address these problems
and clean up deferred processing substantially. The first patch removes the task
queue interface and converts its remaining users over to
schedule_task(); this patch was included in 2.5.40. The more
interesting work is contained in the workqueue
patch (since updated),
which has not yet (as of this writing) been merged by Linus.
This patch replaces the task queue mechanism (and schedule_task()
entirely with a mechanism which is simpler to use and which yields
With the workqueue patch, task queues are replaced with the new "workqueue"
concept. The basic idea is the same: a workqueue is a linked list of
structures containing functions to call and data to pass to them. But the
internals of workqueues are better hidden so that users need not worry
about what is really going on. Workqueues are executed in process context,
so tasks executed from those queues may sleep. Each workqueue, however,
has its own worker threads (one per CPU), so one subsystem's workqueue
not block others from running. There is a default workqueue (analogous to
the old schedule_task() functionality) for relatively simple tasks
that do not justify their own queue.
For those who are interested, we have written up a separate article with reasonably complete
documentation of the workqueue interface.
There has been a bit of discussion over the details of this
interface. It has been through one set of modifications already, and will
likely evolve more in the near future.
The basic idea, however, appears to have been well
received; some version of this patch will probably go in before too long.
Comments (none posted)
The Linux Security Module code works by allowing security-related code to
hook into almost every access decision the kernel makes. Security modules
can only tighten restrictions by vetoing access that would have otherwise
been around. A number of security regimes - most notably the NSA's SELinux
- have been built on the LSM structure. The LSM patch has been partially
merged into the kernel; many of the LSM hooks are not yet there, however.
Recently some developers have been questioning some of the specific hooks.
In response, LSM maintainer Greg Kroah-Hartman has posted a patch removing a few LSM hooks: those for
creating, initializing, and deleting modules. Nobody seems to have an
issue with the ability to control those operations - it's just that no code
is currently using those hooks.
That is, in fact, Greg's stated policy with
LSM: any hooks that are not
actually being used by an available, open source security module will be
I am not happy with the idea that there would be hooks in the
kernel that are not being used. That's not the Linux way. If the
code isn't being used, it's removed.
The idea, of course, is that there is no point in trying to maintain code
that is not in use. By the time somebody actually tries to make use of it,
chances are it will be broken anyway. And, it is said, it is easy to
reintroduce a hook should the need develop.
Of course, given the LSM design, it's not that easy to put in a new
hook. LSM requires security modules to provide an explicit implementation
for every available hook, with the result that security modules accumulate
a lot of stub "no-op" hooks. Adding a hook will break every security
module out there until they implement a stub for that hook. Given that,
security module authors who see a use for some of the more obscure hooks
might want to document that use before too long.
Comments (3 posted)
The kernel is full of code which is not allowed to sleep. Anything which
is handling an interrupt or otherwise running out of process context, for
example, should not try to go to sleep. This particular case is easy to
catch in the scheduler, but others are not. For example, any code which is
holding a spinlock can not sleep either. Sleeping in this situation can
lead to deadlocks (some other process spinning on the lock can prevent the
holder from running again and releasing the lock), mutual exclusion
failures (on uniprocessor systems where spinlocks are optimized out), or,
at a minimum, excessive lock hold time and lock contention.
The problem is that it can be easy to sleep in the wrong places. Sleeps
are often not done directly; instead, a piece of atomic code calls a
function which calls some other function which sleeps. The "sleep
tendency" of functions is not always documented, and, in any case, kernel
hackers, being human, can make mistakes. Even if it seems, at times, that
they don't sleep.
Until recently, these mistakes have been hard to catch. There was no "I'm
running in an atomic section" flag, and thus no way for the kernel to know
that it is sleeping in a bad place - until something went badly wrong. The
preemptible kernel patch changed all that, however. Any place where the
code can not sleep is also certainly a bad place for that code to be
preempted. So the functions which mark atomic sections (such as spinlock
operations) now set a "don't preempt me" flag.
But once you have that flag, why not use it to detect sleeps in the wrong
place? Andrew Morton posted a patch which
does exactly that, and Linus merged it on the spot. The patch was titled
"increase traffic on linux-kernel," and it has done exactly that. There
are, it turns out, quite a few places where sleeping functions are called
within code that is supposed to be atomic. These mistakes are being fixed
almost as quickly as they are found. A small patch has done a lot to
eliminate a whole class of kernel programming errors.
Comments (none posted)
A new CPU frequency subsystem, written by Dominik Brodowski and others, was
integrated by Linus into the 2.5.40 release. This code provide user-space
control over the clock frequency of the CPU(s) in the system - at least,
for processors which provide that capability.
One might wonder why it would be desirable to run a processor at anything
below its rated speed. The reasons, of course, are power consumption and
thermal control. A faster CPU requires more power to run. If you're using
your laptop on an airplane, and you're not trying to crack any encryption
keys or set kernel build time records before you land, you might just want
to slow down the processor a little to avoid draining your battery.
Meanwhile, the processor may decide
to slow down on its own if it's getting too warm.
In fact, some modern processors can take a fairly smart approach to
frequency control. If the processor notices that it is spending a lot of
time idle, it can slow itself down. If it's constantly busy, it can turn
up the speed a bit. If a particular processor supports setting its
frequency in a "dumb" mode only, it might be nice for the operating system
to provide the automatic adjustment in software.
For this reason, a simple "set the frequency" interface was deemed to be
insufficient. The CPU frequency code merged into 2.5.40 reflects the new
understanding of the problem: it allows the user to set a range of
acceptable frequencies and the desired policy. If the user selects
"performance" as the policy, the processor will be instructed to run at the
upper end of its range; if it slows down, it does so gradually. With the
"powersave" policy, speeds will be kept lower even in the face of sustained
work to do. Overall, the new interface gives the user a great deal of
control over how the system operates. Of course, this interface is just a
cryptic /proc file (see Documentation/cpufreq in the 2.5.40 tree for
details); look for the KDE and GNOME
applications to show up in the near future.
For now, the code that has been merged into the kernel supports only the
i386 architecture. Code for a number of other processors exists and will
show up in the proper, architecture-specific trees.
Comments (2 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>