The current 2.6.10 prepatch remains 2.6.10-rc1
; no new kernel
prepatches have been released since October 22.
Patches continue to accumulate in Linus's BitKeeper repository; they
include the ext3 block reservation and online resizing patches, sysfs backing store, locking behavior
annotations for the "sparse" utility, a reworking of spin lock
initialization (see below), the un-exporting of add_timer_on(),
sys_lseek(), and a number of other kernel functions, an x86 signal
delivery optimization, an IDE update, I/O space
write barrier support, a frame buffer driver update, more scheduler
tweaks, some big kernel lock preemption patches, an IDE update, a large
number of architecture updates, and lots of fixes.
The current prepatch from Andrew Morton is 2.6.10-rc1-mm2. Recent changes to -mm include
the kswapd high-order page freeing patch, a
new PCMCIA device model integration patch, some scheduler tweaks, a generic
CPU time abstraction (which comes from the S/390 port), and various fixes.
The current 2.4 prepatch is still 2.4.28-rc1; Marcelo has released
no prepatches since October 22.
Comments (none posted)
Kernel development news
There have traditionally been two ways to initialize a spinlock inside the
kernel. It can be done with an explicit assignment:
spinlock_t lock = SPIN_LOCK_UNLOCKED;
or with a function call:
Linus has recently merged a set of patches which move all in-kernel
initializations over to the function-based form. There has been no patch
to remove the SPIN_LOCK_UNLOCKED macro, but it is not hard to see
a move in that direction once the conversion is complete.
The stated reasons for this change include consistency and making life
easier for automatic lock validators. There is also an unstated, but
evident reason: the assignment form of lock initialization gets in the way
of the realtime preemption patches. Those patches change most spinlocks in
the kernel to a different, mutex type, and that breaks the initializers.
As a result, the preemption patches must change all of those
initializations throughout the kernel. By putting those specific changes
into the mainline, it is possible to make the realtime patches smaller,
less intrusive, and a little bit less scary.
Comments (1 posted)
The 2.5 development series included the addition of the kernel crypto API.
This interface was added to enable in-kernel code to use cryptographic
functions where needed; the IPSec code was one of its first users. This
API has been extended since its addition, and it now supports a wide
variety of cryptographic algorithms.
There is just one little problem, however: the current Linux crypto API is
a synchronous interface. When kernel code requests that a transformation
be applied to a block of data, that work is done immediately, with a status
value returned to the caller. A synchronous interface works fine when the
cryptographic transformations are implemented in software. If the CPU has
to do the work anyway, there is usually no time like the present to get it
Increasingly, however, computers are being equipped with hardware
cryptographic capabilities. It would be nice if Linux could make use of
crypto hardware, especially on systems (such as high-bandwidth servers)
which may have to do a lot of transformations. Hardware crypto complicates
the situation, however; hardware operations take time. A synchronous
interface does not work well when hardware is involved; the kernel needs to
be able to go off and do other things while the hardware works through the
data. Scheduling issues come into play as well; if a system has multiple
crypto cards installed, it would be nice to balance the load across them
and keep them all busy.
The current crypto API does not address hardware-related issues at all.
This shortcoming has been understood from the beginning; the initial crypto
API deliberately did not set out to solve the entire problem. Hardware
support was one of those "we'll get to that later" items.
Evgeniy Polyakov, based in Russia, has gotten around to it with his posting
of an asynchronous crypto layer patch.
This large patch creates a new cryptographic API which addresses the needs
of hardware cryptography. There is a callback-based asynchronous interface
which enables the queueing of transformation requests and notification of
their completion. The patch not only includes load balancing; it also has
a pluggable mechanism allowing a choice of which load balancer to use.
There is a priority mechanism built in, and a failover handler which does
the right thing when a cryptographic peripheral fails. There is even a
request routing feature for complicated transformations (encryption
followed by signing, say) which may have to be performed by a series of
The new code has been welcomed, though the developers have a number of
issues with the specifics of the implementation. Chances are that those
issues can be overcome, and the new asynchronous API will eventually find
its way into the mainline. At that point, it will almost certainly
obsolete the existing crypto APIs - for both crypto users and the
implementation of software transforms. A certain amount of scrambling will
be required to make everything work again, but, when the dust settles,
Linux should have a much more comprehensive and capable cryptographic
Comments (none posted)
An automounter implements a special filesystem which mounts remote
filesystems on demand, when requested by a user-space process. The Linux
automounter (autofs) is a mildly complicated subsystem; the autofsNG patches
make it somewhat more
complicated yet. Adam Richter decided that he could make things simpler,
and solve a wider class of problems at the same time. The result has been
recently posted as trapfs
, a filesystem
which can do automounts and more in less than 500 lines.
Trapfs is derived from ramfs; by itself, it implements a simple,
memory-based filesystem. A user-space process can create files,
directories, device nodes, etc. in a trapfs filesystem, and everything will
work as expected. There is one additional little twist, however: a trapfs
filesystem can be mounted with the location of a special helper program
given as a parameter. Whenever an attempt is made to look up a nonexistent
file, the helper program is invoked and given a chance to cause that file to
exist. When the helper exits, trapfs will return whatever the helper left
behind to the original caller.
So, if you want to implement an automounter, you just set up a trapfs
filesystem with a little script which can figure out which remote
filesystem to mount in response to a lookup request. The task can be done
with a screenfull of commands - especially if security is not a big concern.
Of course, there are some little details (such as unmounting idle
filesystems) which are left as an exercise for the reader, but the basic
idea is straightforward.
Another possibility is to use trapfs to create a devfs-style device
filesystem. The helper program responds to lookup requests by seeing if an
appropriate device node can be created.
Whether trapfs will prove useful for real-world tasks remains to be seen.
It could have a role, however, in the creation of simple, dynamic
filesystems in cases where a more complete solution (using FUSE, for example) is more work than is
justified by the task. Unless there are major objections, Adam plans to
try to get trapfs merged in the relatively near future.
Comments (1 posted)
A constant fact of Linux kernel development would appear that people always
want to play around with the CPU scheduler. Con Kolivas (with help from
William Lee Irwin) has decided to make this playing easier through the
creation of a pluggable scheduler
. This mechanism is intended to make it possible for multiple
schedulers to exist in the kernel, with one being selected for use at boot
time. With "plugsched" in place, developers interested in experimenting
with schedulers could switch quickly between them while running the same
The patch works by splitting the large body of code in
kernel/sched.c into public and private parts. Code meant to be
shared between schedulers goes into a new scheduler.c file, while
the current (and default) scheduler stays put. Also added to
scheduler.c is a new structure (struct sched_drv)
containing pointers to the functions which handle scheduling tasks. These
functions are invoked for various process events (fork(),
exit(), etc.), to obtain scheduling-related information, and, of
course, for calls to the core schedule() function. Implementing a new
scheduler is simply a matter of writing replacements for the relevant
functions and plugging the whole thing in.
There have been few objections to the pluggable scheduler implementation.
Ingo Molnar, however, is strongly opposed to
the idea in the first place:
I believe that by compartmenting in the wrong way we kill the
natural integration effects. We'd end up with 5 (or 20) bad generic
schedulers that happen to work in one precise workload only, but
there would not be enough push to build one good generic scheduler,
because the people who are now forced to care about the Linux
scheduler would be content about their specialized schedulers.
Ingo's position is that having one core scheduler forces developers to
think about the whole problem, rather than one small piece of it. In
particular, claims Ingo, the scheduling
domains patch would never have come about if the kernel had pluggable
schedulers; instead there would be a separate NUMA scheduler, an SMP
scheduler, and so on.
Ingo, meanwhile, continues his efforts to make the One Big Scheduler
provide real-time response. The latest patch is -RT-2.6.10-rc1-mm2-V0.7.1. The biggest change
in recent times is a new semaphore/mutex implementation which sticks closer
to the original Linux semaphore semantics; this change allows a number of
patches switching parts of the kernel over to the completion interface to
The new semaphores also include a priority inheritance mechanism. Whenever
a process blocks on a semaphore, the kernel checks to see if that process
has a higher priority than the process currently holding the semaphore. If
so, the holder's priority is bumped up to match that of the blocking
process. This technique should help to avoid situations where a
low-priority process can keep higher-priority tasks from running for
extended periods of time.
Comments (1 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
- Andrea Arcangeli: PG_zero.
(November 1, 2004)
Page editor: Jonathan Corbet
Next page: Distributions>>