The current 2.6 kernel is 2.6.9
, released, at last
on October 18. Very
few fixes were merged since 2.6.9-final
which, in turn, contained only a small number of changes since 2.6.9-rc4.
The -final naming scheme drew a few complaints, to which Linus responded
"I'm a retard." One assumes he will not do that again.
those just tuning in, 2.6.9 includes a lot of NTFS updates, block I/O
barrier support, a patch allowing unprivileged process to lock small
amounts of memory in RAM, a new USB storage driver, cluster-wide file
locking infrastructure, completely out-of-line spinlocks, AMD dual-core
support, support for the POSIX waitid() system call, KProbes, USB "on
the go" support, the "flex mmap" user-space
memory layout, m32r architecture support, a bunch of latency-reduction
work, and lots of fixes.
See the (lengthy) changelog for a full list of
changes since 2.6.8.
There have been no 2.6.10 prepatches released yet, but the floodgates have
certainly opened; several hundred changesets have found their way into
Linus's BitKeeper repository. These include a set of SCSI updates, a big
rework of the IRQ subsystem (pulling lots of duplicated code into a single,
generic core - no functional changes), some software suspend fixes, a
number of scheduler tweaks, CDRW packet writing support, switchable and
loadable I/O schedulers, a new version of
the completely fair queueing (CFQ) I/O scheduler, the removal of the
(unused) wake_up_all_sync() function, a simple generic circular
buffer implementation, a big USB update, version 17 of the wireless
extensions API, the kernel events notification mechanism, a patch changing
the core device model function exports to GPL-only, a PCI subsystem update,
the BSD "secure levels" security module, and lots of fixes.
Andrew Morton has not released any -mm patches over the last week.
The current 2.4 prepatch is still 2.4.28-pre4; Marcelo has not
released any prepatches since October 8.
Comments (8 posted)
Kernel development news
On a side note, the GPL buyout previously offered has been
modified. We will be contacting individual contributors and
negotiating with each copyright holder for the code we wish to
convert on a case by case basis....
SCO has contacted us and identifed [sic] with precise detail and factual
documentation the code and intellectual property in Linux they
claim was taken from Unix. We have reviewed their claims and they
appear to create enough uncertianty [sic] to warrant removal of the
-- Jeff Merkey, of course.
Yes, I can reveal them. All of XFS, All of JFS, and All of the SMP
Support in Linux. I have no idea what the hell RCU is and when I
find it, I'll remove it from the code.
-- Yes, him again.
Sorry, couldn't resist; we'll stop now.
Comments (11 posted)
A large number of patches have already been merged and will show up in the
first 2.6.10 prepatch. Some of those have been covered on this page
before, but others have not. As a way of catching up with current events,
we'll take a quick look at a few of these patches.
The completely fair queueing (CFQ) I/O scheduler endeavors to get good
performance from block devices while dividing the available bandwidth
equally between the processes contending for each device. 2.6.10 will
contain a major rework of the CFQ scheduler, called "CFQ v2." Some of
the changes in this version are:
- Process I/O context information is maintained for the lifetime of each
process, rather than just for the periods when the process has
outstanding I/O. This change fixes some starvation scenarios which
came up with CFQ v1.
- Grouping of processes can be done by user ID, group ID, thread group,
or process group; the policy in force can be changed at runtime.
- Request ordering is more strictly enforced as a way of limiting the
maximum latency experienced by any given request.
- Small backward seeks are occasionally allowed if they look like they
will improve responsiveness.
The code is also more heavily commented; author Jens Axboe says that was
done to increase its AAF - "akpm acceptance factor." AKPM is Andrew
Morton, who has been known to complain about insufficiently commented
Simple circular buffers
Circular buffers are a common data structure in the kernel, but there has
never been a generic implementation available for use. Stelian Pop decided
to change that; he was almost certainly surprised, however, by the large number of
iterations it took to respond to all the comments he got. In the end, this
effort showed the value of having a single, generic implementation in the
kernel. Even a data structure as simple as a circular buffer can be tricky
to implement correctly; it makes no sense for every developer to go through
that process each time a new one is needed. With a single, well-reviewed
implementation, the chances of it being truly correct are much better.
A circular buffer is represented by struct kfifo, defined in
<linux/kfifo.h>. A staticly-allocated buffer can be
initialized with kfifo_init(), or allocation and initialization
can be performed together with kfifo_alloc():
struct kfifo *kfifo_init(unsigned char *buffer, unsigned int size,
int gfp_mask, spinlock_t *lock);
struct kfifo *kfifo_alloc(unsigned int size, int gfp_mask,
Either way, size is the desired size of the buffer (in bytes, must
be a power of two), gfp_mask is a set of GFP_ flags
controlling how memory allocations will be performed, and lock is
a spinlock which will be used to serialize access to the data structure.
The functions for moving data into and out of the buffer are:
unsigned int kfifo_put(struct kfifo *fifo, unsigned char *buffer,
unsigned int len);
unsigned int kfifo_get(struct kfifo *fifo, unsigned char *buffer,
unsigned int len);
These functions move at most len bytes between the structure and
buffer; the actual number of bytes transferred is returned. The
number of bytes currently stored in a circular buffer can be obtained by
passing it to kfifo_len(), and a buffer may be flushed by passing
it to kfifo_reset(). A dynamically-allocated buffer may be
returned to the system with kfifo_free(); there does not seem to
be a way to free memory from staticly-allocated buffers.
The kernel events notification mechanism has been covered here a couple of
times. This code provides a way for user-space processes to learn about
important events by way of a netlink socket. The final form of the event
generation interface (for now) is:
int kobject_uevent(struct kobject *kobj, enum kobject_action action,
struct attribute *attr);
The kobject describes where the interesting event happened. For the one
explicit use currently in the kernel (filesystem mount and unmount events),
the kobject corresponds to the disk partition involved. action is
a small set of possible events; it is currently one of KOBJ_ADD,
KOBJ_REMOVE, KOBJ_CHANGE, KOBJ_MOUNT, and
KOBJ_UMOUNT. The "add" and "remove" actions are generated along
with hotplug events; "change" describes attribute value changes, and
"mount" and "unmount" are for filesystem events. The final parameter
(attr) is an optional attribute of the given kobject which
provides further information.
The patches merged also modify how hotplug events are handled; such events
now are reported in two ways: via the new events mechanism and through an
invocation of /sbin/hotplug.
Comments (2 posted)
In last week's episode
, we saw the release
of a number of patches intended to bring (something closer to) realtime
response to the standard Linux kernel. The level of activity in this area
remains high; here is what has been happening over the last week.
Bill Huey of LynuxWorks surfaced to
announce that he, too, has been working on realtime preemption; his patches
are available at mmlinux.sourceforge.net.
Mr. Huey seemed a bit annoyed at the posting from MontaVista which started
the current discussion; his version, it seems, has been working for some
months. But, by his own admission, he had
been sitting on the patches for some time as a result of the "commercial
development attitude" at his employer. "Release early" is the kernel
developers' mantra for a reason.
The mmlinux patch resembles the others, in that it turns all spinlocks into
semaphores and makes most critical sections preemptible. It includes a
threaded interrupt handler patch from TimeSys, and uses standard Linux
semaphores, without priority inheritance. See the mmlinux release announcement for more
The folks at MontaVista must be feeling a bit like their own vehicle has
taken off and left them behind. Even so, Daniel Walker announced a new MontaVista realtime patch,
based on Ingo Molnar's work. It includes an architecture-independent mutex
implementation (but still different from regular Linux kernel semaphores),
and some latency tracing code.
The real work, however, continues to be done by Ingo Molnar; he has been
releasing patches at such a rate that some
developers working on slower systems may have trouble simply compiling them
before the next one comes out. Ingo's focus has been the
elimination of the (numerous) remaining spinlocks, especially those outside
of the core kernel. The current situation, as he put it, is "an
opt-in model to correctness which is bad from a maintenance and upstream
acceptance point of view." With his current patches (the latest is
RT-2.6.9-rc4-mm1-U8 as of this writing, but
that is likely to have changed by the time anybody reads this), over 90% of
the raw spinlock calls have been removed, and most non-core subsystems are
entirely free of spinlocks. At least, that is the case when realtime
preemption is configured into the kernel; without that option, the
situation is mostly unchanged.
To get to that point, Ingo had to make changes to a number of Linux mutual
exclusion primitives which got in the way. One of those is per-CPU
variables, which are based around the idea that, as long as each processor
only works with its own copy of a variable, no locking is required to make
that work safe. That assumption only holds, however, if threads are not
preempted while manipulating per-CPU variables. So using a per-CPU
variable requires disabling preemption, which runs counter to the whole
"make everything preemptible" idea. To address this problem, Ingo
introduced a new "locked" per-CPU variable type:
Threads which use the "locked" type of per-CPU variable can be preempted
while working with that variable - they can even be shifted to a different
processor while sleeping. The result could be a thread updating the
"wrong" processor's version of the variable. The lock will prevent race
conditions, however, so, as Ingo puts it,
variable is still per-CPU and update correctness is fully
Then, there is the issue of read-copy-update, which also depends on
threads not being preempted while they hold a reference to RCU-protected
data. Ingo's approach here was, essentially, to dump RCU in the realtime
case and just go back to regular locking. This change is hard to do in any
sort of automatic way, however, because the RCU read locking primitive
(rcu_read_lock(), which, normally, just disables preemption) does
not identify which data is being protected. So converting RCU code
requires picking out a spinlock or semaphore which can be used to prevent
races with writers, and to change the rcu_read_lock() calls to one
of the many new variants:
rcu_read_lock_sem(struct semaphore *sem);
rcu_read_lock_down_read(struct rwsem *sem);
This API, Ingo notes, is still in flux. There does not seem to have been
any benchmarking done yet to determine what effect these changes have on the
scalability issues RCU was created to address.
Atomic kmaps were another problem. An atomic kmap is a mechanism used to
quickly map a high memory page into the kernel's address space. It is, for
all practical purposes, an implementation of per-CPU page table entries,
and it has the same preemption issues. The solution here was the addition
of a new function (kmap_atomic_rt()) which turns into a regular,
non-atomic kmap when realtime preemption is enabled. In this case (as with
many of the others) the low-latency imperative brings a small overall
As a sort of side project, many users of semaphores in the kernel were
changed over to the completion mechanism.
Some new completion functions have been added to help with that process:
int wait_for_completion_interruptible(struct completion *c);
unsigned long wait_for_completion_timeout(struct completion *c,
unsigned long timeout);
unsigned long wait_for_completion_interruptible_timeout(struct completion *c,
unsigned long timeout);
Quite a few other changes have gone in, but the idea should be clear by
now: a vast number of changes are being made to the kernel's fundamental
assumptions about locking and the execution environment. Few readers will
be surprised to learn that the brave souls testing these patches have been
encountering significant numbers of bugs. Those bugs are being squashed in
a hurry, though, to the point that Ingo can say:
...this is i believe the first correct conversion of the Linux kernel
to a fully preemptible (fully mutex-based) preemption model, while
still keeping all locking properties of Linux.
I also think that this feature can and should be integrated into
the upstream kernel sometime in the future. It will need
improvements and fixes and lots of testing, but i believe the basic
concept is sound and inclusion is manageable and desirable.
The interesting thing is that nobody has come forward to challenge that
statement. As the realtime preemption patches become more stable, and the
pressure for their inclusion starts to build, that situation may well
change. It is hard to imagine a patch this intrusive going in without some
sort of fight - especially when many developers are far from convinced
about the goal of supporting realtime applications in Linux to begin with.
Comments (none posted)
It's hard to turn down an opportunity to give Rusty Russell some grief, so
let's take a moment to review a comment he
posted on LWN
Regarding module_param(): MODULE_PARM() will certainly stay
throughout the 2.6 series, so no need to change existing code just
Those who held off on changing their out-of-tree modules may want to do so
now. Rusty has sent out a patch marking
MODULE_PARM() obsolete in preparation for its removal from the
kernel. A set of companion patches deals with many of the remaining
MODULE_PARM() uses in the mainline tree.
MODULE_PARM() declares parameters for loadable modules; these
parameters can be changed when the module is loaded to affect its
operation. One of the many changes that came with the new module loader in
the 2.5 series was a new mechanism (module_param()) for declaring
module parameters. The new scheme has a number of advantages over the old
one: it is type safe, it allows module parameters to be represented (and
changed) in sysfs, and it provides a flexible mechanism for new types of
parameters. But, since the older way continued to work, many modules were
Under the old development model, things probably would have gone as Rusty
suggested: MODULE_PARM() would have remained through the 2.6 series
in order to avoid breaking things. The new development model lacks the
same sort of obvious demarcation point where compatibility can be broken,
so those changes end up going into the regular patch stream. This is
especially true of internal API changes, where there never has been a
guarantee of any sort of continuity, even in an old-style stable series.
So some of these changes are coming more quickly than some developers might
With regard to MODULE_PARM, The current patches in circulation
suggest that the time to update to module_param() is running out.
Consider yourself warned.
Comments (5 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>