User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6.10 prepatch remains 2.6.10-rc1; no new kernel prepatches have been released since October 22.

Patches continue to accumulate in Linus's BitKeeper repository; they include the ext3 block reservation and online resizing patches, sysfs backing store, locking behavior annotations for the "sparse" utility, a reworking of spin lock initialization (see below), the un-exporting of add_timer_on(), sys_lseek(), and a number of other kernel functions, an x86 signal delivery optimization, an IDE update, I/O space write barrier support, a frame buffer driver update, more scheduler tweaks, some big kernel lock preemption patches, an IDE update, a large number of architecture updates, and lots of fixes.

The current prepatch from Andrew Morton is 2.6.10-rc1-mm2. Recent changes to -mm include the kswapd high-order page freeing patch, a new PCMCIA device model integration patch, some scheduler tweaks, a generic CPU time abstraction (which comes from the S/390 port), and various fixes.

The current 2.4 prepatch is still 2.4.28-rc1; Marcelo has released no prepatches since October 22.

Comments (none posted)

Kernel development news

Unified spinlock initialization

There have traditionally been two ways to initialize a spinlock inside the kernel. It can be done with an explicit assignment:

	spinlock_t lock = SPIN_LOCK_UNLOCKED;

or with a function call:

  	spinlock_t lock;

Linus has recently merged a set of patches which move all in-kernel initializations over to the function-based form. There has been no patch to remove the SPIN_LOCK_UNLOCKED macro, but it is not hard to see a move in that direction once the conversion is complete.

The stated reasons for this change include consistency and making life easier for automatic lock validators. There is also an unstated, but evident reason: the assignment form of lock initialization gets in the way of the realtime preemption patches. Those patches change most spinlocks in the kernel to a different, mutex type, and that breaks the initializers. As a result, the preemption patches must change all of those initializations throughout the kernel. By putting those specific changes into the mainline, it is possible to make the realtime patches smaller, less intrusive, and a little bit less scary.

Comments (1 posted)

Asynchronous crypto

The 2.5 development series included the addition of the kernel crypto API. This interface was added to enable in-kernel code to use cryptographic functions where needed; the IPSec code was one of its first users. This API has been extended since its addition, and it now supports a wide variety of cryptographic algorithms.

There is just one little problem, however: the current Linux crypto API is a synchronous interface. When kernel code requests that a transformation be applied to a block of data, that work is done immediately, with a status value returned to the caller. A synchronous interface works fine when the cryptographic transformations are implemented in software. If the CPU has to do the work anyway, there is usually no time like the present to get it done.

Increasingly, however, computers are being equipped with hardware cryptographic capabilities. It would be nice if Linux could make use of crypto hardware, especially on systems (such as high-bandwidth servers) which may have to do a lot of transformations. Hardware crypto complicates the situation, however; hardware operations take time. A synchronous interface does not work well when hardware is involved; the kernel needs to be able to go off and do other things while the hardware works through the data. Scheduling issues come into play as well; if a system has multiple crypto cards installed, it would be nice to balance the load across them and keep them all busy.

The current crypto API does not address hardware-related issues at all. This shortcoming has been understood from the beginning; the initial crypto API deliberately did not set out to solve the entire problem. Hardware support was one of those "we'll get to that later" items.

Evgeniy Polyakov, based in Russia, has gotten around to it with his posting of an asynchronous crypto layer patch. This large patch creates a new cryptographic API which addresses the needs of hardware cryptography. There is a callback-based asynchronous interface which enables the queueing of transformation requests and notification of their completion. The patch not only includes load balancing; it also has a pluggable mechanism allowing a choice of which load balancer to use. There is a priority mechanism built in, and a failover handler which does the right thing when a cryptographic peripheral fails. There is even a request routing feature for complicated transformations (encryption followed by signing, say) which may have to be performed by a series of devices.

The new code has been welcomed, though the developers have a number of issues with the specifics of the implementation. Chances are that those issues can be overcome, and the new asynchronous API will eventually find its way into the mainline. At that point, it will almost certainly obsolete the existing crypto APIs - for both crypto users and the implementation of software transforms. A certain amount of scrambling will be required to make everything work again, but, when the dust settles, Linux should have a much more comprehensive and capable cryptographic subsystem.

Comments (none posted)

Trapfs - an automounter on the cheap

An automounter implements a special filesystem which mounts remote filesystems on demand, when requested by a user-space process. The Linux automounter (autofs) is a mildly complicated subsystem; the autofsNG patches make it somewhat more complicated yet. Adam Richter decided that he could make things simpler, and solve a wider class of problems at the same time. The result has been recently posted as trapfs, a filesystem which can do automounts and more in less than 500 lines.

Trapfs is derived from ramfs; by itself, it implements a simple, memory-based filesystem. A user-space process can create files, directories, device nodes, etc. in a trapfs filesystem, and everything will work as expected. There is one additional little twist, however: a trapfs filesystem can be mounted with the location of a special helper program given as a parameter. Whenever an attempt is made to look up a nonexistent file, the helper program is invoked and given a chance to cause that file to exist. When the helper exits, trapfs will return whatever the helper left behind to the original caller.

So, if you want to implement an automounter, you just set up a trapfs filesystem with a little script which can figure out which remote filesystem to mount in response to a lookup request. The task can be done with a screenfull of commands - especially if security is not a big concern. Of course, there are some little details (such as unmounting idle filesystems) which are left as an exercise for the reader, but the basic idea is straightforward.

Another possibility is to use trapfs to create a devfs-style device filesystem. The helper program responds to lookup requests by seeing if an appropriate device node can be created.

Whether trapfs will prove useful for real-world tasks remains to be seen. It could have a role, however, in the creation of simple, dynamic filesystems in cases where a more complete solution (using FUSE, for example) is more work than is justified by the task. Unless there are major objections, Adam plans to try to get trapfs merged in the relatively near future.

Comments (1 posted)

Schedulers, pluggable and realtime

A constant fact of Linux kernel development would appear that people always want to play around with the CPU scheduler. Con Kolivas (with help from William Lee Irwin) has decided to make this playing easier through the creation of a pluggable scheduler framework. This mechanism is intended to make it possible for multiple schedulers to exist in the kernel, with one being selected for use at boot time. With "plugsched" in place, developers interested in experimenting with schedulers could switch quickly between them while running the same kernel.

The patch works by splitting the large body of code in kernel/sched.c into public and private parts. Code meant to be shared between schedulers goes into a new scheduler.c file, while the current (and default) scheduler stays put. Also added to scheduler.c is a new structure (struct sched_drv) containing pointers to the functions which handle scheduling tasks. These functions are invoked for various process events (fork(), exit(), etc.), to obtain scheduling-related information, and, of course, for calls to the core schedule() function. Implementing a new scheduler is simply a matter of writing replacements for the relevant functions and plugging the whole thing in.

There have been few objections to the pluggable scheduler implementation. Ingo Molnar, however, is strongly opposed to the idea in the first place:

I believe that by compartmenting in the wrong way we kill the natural integration effects. We'd end up with 5 (or 20) bad generic schedulers that happen to work in one precise workload only, but there would not be enough push to build one good generic scheduler, because the people who are now forced to care about the Linux scheduler would be content about their specialized schedulers.

Ingo's position is that having one core scheduler forces developers to think about the whole problem, rather than one small piece of it. In particular, claims Ingo, the scheduling domains patch would never have come about if the kernel had pluggable schedulers; instead there would be a separate NUMA scheduler, an SMP scheduler, and so on.

Ingo, meanwhile, continues his efforts to make the One Big Scheduler provide real-time response. The latest patch is -RT-2.6.10-rc1-mm2-V0.7.1. The biggest change in recent times is a new semaphore/mutex implementation which sticks closer to the original Linux semaphore semantics; this change allows a number of patches switching parts of the kernel over to the completion interface to be dropped.

The new semaphores also include a priority inheritance mechanism. Whenever a process blocks on a semaphore, the kernel checks to see if that process has a higher priority than the process currently holding the semaphore. If so, the holder's priority is bumped up to match that of the blocking process. This technique should help to avoid situations where a low-priority process can keep higher-priority tasks from running for extended periods of time.

Comments (1 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Filesystems and block I/O


Memory management

  • Andrea Arcangeli: PG_zero. (November 1, 2004)



Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds