LWN.net Logo

Kernel development

Release status

Kernel release status

The current development kernel is 2.5.36, which was released by Linus on September 17. The big news was, of course, is the merge of the XFS journaling filesystem. There's also the x86 "huge page" patch, an IEEE-1394 ("Firewire") update, a big USB update (converting the code to the new driver model scheme), an IDE update, and various other fixes. See the long-format changelog for the details.

Linus has had a busy week; 2.5.35 was released on September 15. This (large) patch included, among other things, the merge of User-mode Linux, a large IDE update, various memory management improvements, more threading improvements, a bunch of NFS server patches and PPC64 and SPARC updates. Again, the long-format changelog has the details.

Linus's BitKeeper tree, which will become 2.5.37, has some block I/O work, some RPC fixes, a bit of memory management work, and Linus's simple solution to the get_pid() problem (see below).

The current 2.5 Status Summary from Guillaume Boissere is dated September 17.

The current stable kernel is 2.4.19; Marcelo released 2.4.20-pre7 on September 12. Big MIPS and IA-64 updates make up the bulk of the patch this time around, along with a relatively small set of other fixes.

Alan Cox's current prepatch is 2.4.20-pre7-ac2. The IDE work continues; this patch also contains a number of other, unrelated fixes.

The current ancient kernel is 2.2.22, which was released by Alan Cox on September 16. It contains a few security fixes, so people still running 2.2 will probably want to have a look at this update.

Comments (1 posted)

Kernel development news

A new way to sleep?

A quick look through the kernel source will turn up no end of examples of code like:

    while (some_condition)
        interruptible_sleep_on(some_queue);

The idea, of course, is to put the process asleep until something of interest has happened. The problem with this kind of code is that if the condition changes (and the wakeup happens) between the two lines of code above, the process will miss the wakeup and could sleep for far longer than intended. Because of this inherent race condition, the elimination of sleep_on() and its variants has been on the kernel hackers' todo list for some time.

There is a macro (wait_event) which can be used to sleep safely, but most code which includes race-free sleeps does so manually with the following approximate steps:

  • Create a wait queue entry (usually with DECLARE_WAITQUEUE).

  • Change the process to a state (usually TASK_INTERRUPTIBLE) which indicates that it is asleep - even though the process is still running in kernel code.

  • Add the current process to a wait queue which will be awakened when the condition is met.

  • Test the condition of interest; if no sleep is necessary, reset the process state to TASK_RUNNING, remove the wait queue entry, and get on with the job at hand.

  • Otherwise call the scheduler to let some other process run until somebody wakes the current process up.

  • On wakeup, go back to the top and do it all again.

This sequence works because a wakeup will reset the task state to TASK_RUNNING; this "shorts out" the sleep should the process test its condition at the wrong time and call the scheduler after the wakeup has happened. In many places, the above steps are complicated by the need to release locks or other resources before invoking the scheduler. The result is a lot of duplicated (and error-prone) code throughout the kernel - and this is the "safe" way of doing things.

As part of his 2.5.35-mm1 patch, Andrew Morton has included a new interface designed to simplify the coding of safe sleeps. Code using the new API looks like:

    DECLARE_WAIT(queueentry);
    prepare_to_wait(&wait_queue, &queue_entry, TASK_INTERRUPTIBLE);
    if (condition_not_met)
        schedule ()
    finish_wait(&wait_queue, &queue_entry);

The actual series of events that occur has not really changed; things have just been packaged inside the new prepare_to_wait() and finish_wait functions. The result, though, is code which is cleaner and more likely to be correct. Now it's just a matter of those hundreds of sleep_on calls still in the 2.5 kernel source...

Comments (none posted)

Solving the process ID allocation problem

Ingo Molnar, in his project to give Linux "world-class threading support," has set his sights on another Linux performance problem: the allocation of process ID (PID) numbers for new processes. This does not seem like it should be a difficult problem, but the current kernel get_pid() shows quadratic behavior when the number of processes gets large. Essentially, the algorithm looks like this:

    for each possible PID
        for each task in the system
	    if task_pid == pid
	        keep_trying

The above is an oversimplification, since the get_pid() code tries to find a range of usable PIDs, not just one. Look here for the real get_pid() implementation. The point is that, with very large numbers of processes (i.e. on the order of 100,000), process ID allocation can lock up the system for long periods of time.

Ingo's solution starts with some work done by William Lee Irwin. William's "idtag" infrastructure adds hash tables for managing things with numeric ID tags; it is used in this patch to manage PID-related things like process groups and session IDs. The idtags help to eliminate many iterations over the whole process space done in the kernel, but do not solve the PID allocation problem.

Ingo handles PID allocation through a new allocator that he wrote from scratch. This allocator maintains an array of pages (allocated as needed) which are used as PID bitmaps; allocating a new PID becomes a matter of finding a page with a free PID available, then finding and clearing the first set bit. It all happens with no locking required. Ingo claims:

Ie. even in the most hopeless situation, if there are 999,999 PIDs allocated already, it takes less than 10 usecs to find and allocate the remaining one PID. The common fastpath is a couple of instructions only.

So it's fast - though a few extra features have been requested. But this patch has stirred up a bit of a debate. Rather than put in a complicated new PID allocator, it is asked, why not just make the maximum PID be very large? Then, in theory, the quadratic part of get_pid() will never run so the performance problems go away, and the code stays simpler. Linus prefers this approach, as do a number of other developers; he has put a simple patch along these lines into his pre-2.5.37 BitKeeper tree.

Ingo disagrees, pointing out that any reasonable maximum PID size can be exceeded eventually. He would rather fix the problem than try to hid it behind a large process ID space. In the absence of real-world examples that show people being bitten by get_pid()'s behavior in a larger PID space, though, Linus appears unlikely to accept any more complicated fix.

Comments (4 posted)

Asynchronous I/O moves forward

There has been little (visible) progress with the asynchronous I/O code since the AIO core was merged into the 2.5.32 kernel. AIO author Ben LaHaise has not been idle, however. Slowly the other pieces of the AIO package are beginning to show up for the 2.5 tree.

One piece is this patch which adds "synchronous IOCBs" to the mix. One might wonder why an asynchronous I/O infrastructure needs I/O control blocks which have a synchronous option. The answer is that the synchronous IOCB is needed to achieve the goal of making most or all low-level I/O operations in the kernel be asynchronous. Once the I/O primitives expect an IOCB, and they work in an asynchronous mode, it is easy to layer the older, synchronous versions on top through the use of a synchronous IOCB. For now, synchronous IOCBs are only used in the generic_file_read() function.

The next step, perhaps, is this patch from Badari Pulavarty; which reworks the direct I/O (DIO) infrastructure. The DIO code handles direct operations on block devices - such as when a "raw" device is used, or when a file is opened with the O_DIRECT option. The DIO operations, with this patch, are all asynchronous, with synchronous IOCBs used when synchronous behavior is required. With this change, the task of making the block I/O subsystem be asynchronous internally is nearly complete. Other subsystems (i.e. char devices, networking) remain to be converted over to the AIO scheme, however.

Comments (none posted)

The Linux TPC results and kernel changes

HP has recently been trumpeting its results running the TPC-C benchmark with Oracle on Linux. Slightly better performance than that achieved with Windows is claimed. What may be more interesting is this note posted to the linux-kernel list on what HP did to its kernel to achieve those results. The kernel that ran the benchmark had a few patches:

  • Asynchronous I/O. Apparently using AIO improved performance by about 5%.

  • Large pages. Going to 2MB pages (i.e. using the large page patch that went into 2.5.36) improved performance by 8%.

The benchmark also made extensive use of high memory (16GB worth), direct I/O, and a number of other recent kernel features.

Comments (none posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Kernel building

Memory management

  • Rik van Riel: rmap 14b. (September 18, 2002)

Networking

Architecture-specific

Security-related

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2002, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.