Kernel development [LWN.net]

Kernel release status

The current development kernel is 2.5.38, which was released by Linus on September 21. It contains a bunch of IA-64 updates, more partition handling and filesystem work, a JFS update, some IDE changes, and a few important bug fixes. The long-format changlog is available with the details.

Linus released 2.5.37 on September 20. Among other things, this release included a bunch more memory management and performance work from Andrew Morton, James Bottomley's x86 "subarchitecture" work (finally), an ACPI update, more threading performance work, an IrDA update, some IDE and block I/O enhancements, some device model work, various architecture updates, and the removal of Keith Owens from the MAINTAINERS file. Again, see the long-format changelog for the details.

Linus's BitKeeper tree, which will become 2.5.39, contains some preemptible kernel fixes, a temporary disk elevator fix to deal with some performance problems (see below for the likely form of the real fix), some thread fixups, a USB update, more VM and block I/O work, an ISDN update, the removal of the global blk_size array (Al Viro: "it is an ex-parrot"), and various other fixes and updates.

The current stable kernel is 2.4.19; there have been no 2.4.20 prepatches or -ac patches over the last week.

Comments (none posted)

Some followups from last week

Andrew Morton sent us a note stating that last week's discussion of the new API for putting processes to sleep missed an important objective of that work. The new interface is nice, but what he was really setting out to do was to improve wakeup performance. The new code removes waiting processes from the wait queue immediately at wakeup time, rather than letting the processes remove themselves whenever they get around to it. The result is that subsequent wakeups, if they come quickly, will run faster because they do not need to deal with processes that have already been awakened.

We also mentioned, last week, a posting on the leading-edge features used in the TPC benchmark results posted by HP. Lest anybody think that HP was using a highly patched, special-purpose kernel, they have posted a followup stating that a stock Red Hat kernel (from Advanced Server 2.1) was used in the benchmark runs.

Ingo Molnar's new process ID allocator - and the objections to it - were covered last week. Ingo posted a new version of the patch which addressed some of the complaints, and which was to Linus's liking; it was merged into the 2.5.37 kernel.

Comments (2 posted)

A new deadline I/O scheduler

One of the results of all the recent VM improvements in the 2.5 tree is that the shortcomings of the current block I/O scheduler are becoming more apparent. The scheduler (or "elevator") tries to join adjacent requests together, and to sort requests along the disk in order to minimize seeks and optimize I/O performance. Unfortunately, the current scheduler is not all that good at treating requests fairly. In the name of overall performance, the scheduler can cause the starvation of individual requests for a very long time. Requests eventually get serviced, but if the starved request is satisfying a page fault in the X server, the user can get seriously annoyed in the mean time.

The scheduler shortcomings have always been there, but they are more apparent now because the VM subsystem does a more effective job of filling the disk request queues. Now that the scheduler problems can't hide behind VM problems, they need to be addressed. To that end, Jens Axboe has dusted off and fixed up his deadline I/O scheduler patch, and is looking for people to try it out and report on how it works. His initial results are good:

The Andrew Morton Interactive Workload (AMIW) rates the current kernel poorly, on my test machine it completes in 1-2 minutes depending on your luck. 2.5.38-BK does a lot better, but mainly because it's being extremely unfair. This deadline io scheduler finishes the AMIW in anywhere from ~0.5 seconds to ~3-4 seconds, depending on the io load.

The "AMIW" works, essentially, by creating a great deal of disk write traffic, then seeing how long it takes to list out a large directory while the writes are being handled.

The idea behind the deadline scheduler is that all read requests get satisfied within a specified time period - 1/2 second by default. (Write requests do not have specific deadlines; processes usually can not continue until their reads are performed, but writes can generally be deferred indefinitely). The deadline performance is achieved by sorting I/O requests into several different queues:

The sorted queues (sort_list) contain I/O requests sorted by the elevator for best disk performance. There are two queues, one for read requests, and one for writes.
The FIFO queue (read_fifo) contains all read requests. Since it's a FIFO, the request at the head of the queue will be the oldest, and will be the one whose deadline expires first.
The dispatch queue contains requests which are ready to be passed down to the block driver. Requests do not go into the dispatch queue until the scheduler decides that it is time to execute them.

With these lists, the operation of the scheduler is not that hard. Incoming requests are merged into the proper sort_list in the usual way: they are sorted by sector number, and merged with adjacent requests. (The real operation is complicated by barrier operations and a hashing mechanism designed to improve the performance of the scheduler itself, but the basic idea is as described.) Read requests are also given a deadline time stamp and put on the end of the read_fifo list.

When the block driver is ready to start another disk I/O request, the core algorithm of the deadline scheduler comes into play. In simplified form, it works like this:

If there are requests waiting in the dispatch queue then the scheduler need do no work, since it has already decided what it wants to do next.
Otherwise it's time to move a new set of requests to the dispatch queue. The scheduler looks in the following places, taking requests from (only) the first source that it finds:
- If there are pending write requests, and the scheduler has not selected any writes for "a long time," a set of write requests is selected.
- If there are expired read requests in the read_fifo list, take a set from there.
- If there are pending reads in the sorted queue, some of them are moved.
- Finally, if there are pending write operations, the dispatch queue is filled from the sorted write queue.
Once that's done, if there are any requests in the dispatch queue (i.e. if there's anything for the disk to do at all) the first one is passed back to the driver.

The definition of "a long time" for write request starvation is simply two iterations of the scheduler algorithm. After two sets of read requests have been moved to the dispatch queue, the scheduler will send over a set of writes. A "set," by default, can be as many as 64 contiguous requests - but a request that requires a disk seek counts the same as 16 contiguous requests.

There has been no word from Linus on this patch; barring problems, however, it would be surprising if some form of it were not merged in the near future.

Comments (4 posted)

Keeping disks busy

Some of the changes that have been stressing the I/O scheduler have gone in very recently. A couple of patches from Andrew Morton are currently sitting in Linus's 2.5.39 BitKeeper tree; they are worth a look.

In the end, much of the work done by the VM subsystem is deciding which pages to move to which disks, and when. A good set of decisions will lead to good performance; but if the VM is not smart in which pages it shoves out, performance can suffer. Some of Andrew's recent efforts stem from an important observation that has been somewhat overlooked until now: there is little point in trying to write pages to disks which are already overwhelmed with requests.

If you want to try to direct your efforts toward disks which are not overly busy, you first need some sort of indication of just how much work each drive has to do. So Andrew has added a new set of functions that report on whether a device's request queue is congested or not. The test used is simplistic: a device's read or write queue is not congested if at least 25% of the allocated request queue entries (a fixed number of these is allocated at queue creation time) are available for use. A simple test is good enough, though, especially considering that the size of a request queue tends to be volatile.

Once you can test for a congested state, you can start making smarter decisions. Once these functions were in, Linus merged another patch which causes the ext2 filesystem to cut back on speculative readahead operations if the underlying device is busy. If the disks are backed up, presumably there are more important things for them to be doing than reading ahead data that may or may not be used.

More impressive performance gains, however, can be had by looking at the pdflush subsystem. pdflush is a set of kernel threads whose job it is to write dirty file data back to the underlying filesystems. A fair amount of effort goes into keeping separate pdflush threads from trying to write back to the same device, and to simply keeping the right number of pdflush threads around.

With the new scheme, life gets easier. pdflush does its best to simply pass over pages when the destination queue is congested; instead, it concentrates on pages that can be written to less busy devices. Thus pdflush no longer blocks on request queues, and can concentrate on keeping them all full. A side benefit is that a single pdflush thread may now be sufficient.

According to Andrew: "This code can keep sixty spindles saturated - we've never been able to do that before." It is increasingly apparent that the 2.6 kernel is going to be an amazing performer in numerous areas, thanks to work like this.

Comments (3 posted)

Doing things in the kernel

Kernel developers often put a great deal of effort into keeping tasks out of the kernel; the thinking is that if a task can be done in user space, there's where it should happen. Keeping things out of the kernel helps to keep the kernel code smaller, and it makes it easier for users to set their own policies. Sometimes, however, there are reasons to move tasks which have been performed in user space for a long time over the line and into the kernel. A couple of patches have been circulating recently which do exactly that.

Ingo Molnar's kksymoops patch performs the same task as the classic "ksymoops" utility: it takes the numeric "oops" output from a kernel panic and attaches symbolic information so it actually makes sense to humans. Or, at least to kernel hackers, who really are human, despite occasional rumors to the contrary.

Why move this task into the kernel? Anybody who has actually used ksymoops to get a symbolic trace knows that it can be quite a pain to set up, especially when loadable modules are involved. Without quite a bit of information on the run-time configuration of the actual kernel that was running when the oops happened, ksymoops can not do its job. Expecting casual users (or customers) to set up ksymoops and use it properly when reporting problems is asking too much.

But the kernel itself knows a lot about its current configuration. It's actually not all that hard for the kernel to generate symbolic names for its own addresses; it's just a matter of keeping a symbol table around. If the kernel can generate useful oops reports from the beginning, the kernel developers will get better information on crashes, and problems will get fixed more quickly. It's probably worth the trouble.

Then, there is Rusty Russell's loadable module rewrite. The insmod utility has traditionally done much of the work of loading modules, including tasks like parsing the executable file and doing symbol binding and relocation. Rusty's patch, instead, moves these tasks into the kernel. Thus, with this patch, the kernel is in the business of picking apart ELF binaries and linking symbols itself.

Surely this sort of work belongs in user space, right? Rusty's answer is:

The amount of kernel code required by the new module loader is smaller than the older, "simpler" one. That is because the in-kernel loader has fewer "communication with user space" issues, and doesn't have to deal with differences between the user and kernel space runtime environments.
The amount of user-space code is, of course, much smaller. This reduction is useful for some situations, such as initramfs, where the module utilities need to fit into a small space.
In-kernel module linking is more flexible and has an easier time with tasks like discarding initialization code.
In the end, the whole body of code is simpler.

Whether Rusty will succeed in convincing the other kernel developers remains to be seen. One way or another, he has made it clear that the proper place to draw the line between kernel and user space is not always obvious.

Comments (1 posted)

Marc-Christian Petersen [PATCH] Linux-2.5.36-mcp2 ?

Marc-Christian Petersen [PATCH] Linux-2.5.38-mcp2 ?

Marc-Christian Petersen WOLK v3.6 FINAL ?

Marc-Christian Petersen WOLK v3.6.1 UPDATE for v3.6 FINAL ?

Martin Loschwitz Linux 2.5.38-ml3 ?

James Cleverdon Summit patch for 2.4.36 ?

Erich Focht node affine NUMA scheduler 1/2 ?

Erich Focht node affine NUMA scheduler 2/2 ?

Greg Ungerer linux-2.5.38uc0 (MMU-less support) ?

Greg Ungerer 2.5.38uc1 (MMU-less support) ?

Jeff Dike uml-patch-2.5.38-1 ?

Alan Mayer Cirqs on large machines ?

Daniel Jacobowitz ptrace() event tracing [1/3] ?

Daniel Jacobowitz ptrace() event tracing [2/3] ?

Daniel Jacobowitz ptrace() event tracing [3/3] ?

Ingo Molnar generic-pidhash-2.5.36-D4, BK-curr ?

Ingo Molnar generic-pidhash-2.5.36-J2, BK-curr ?

Rusty Russell Updated module rewrite. ?

Rusty Russell Updated module rewrite: 20 September ?

Rusty Russell Module rewrite 1/20: KBUILD_MODNAME (20-part patch; see inside for the location of the whole thing). ?

Dominik Brodowski cpufreq for 2.5.36 now available ?

Dominik Brodowski cpufreq for 2.5.38 now available ?

Solomon Peachy [2.4] add wait_event_interruptible_timeout ?

Badari Pulavarty 2.5.37 patch for making DIO async ?

john stultz linux-2.5.38_timer-change_A3 ?

Rik van Riel per user scheduling for 2.4.19 ?

Dipankar Sarma dcache_rcu and its locking model ?

Lightweight Patch Manager Single linked lists for Linux ?

Karim Yaghmour LTT 0.9.6pre1: Lockless logging, ARM, MIPS, etc. ?

Karim Yaghmour LTT for 2.5.38 1/9: Core infrastructure ?

Karim Yaghmour LTT for 2.5.38 2/9: Trace driver ?

Karim Yaghmour LTT for 2.5.38 3/9: Core trace statements ?

Karim Yaghmour LTT for 2.5.38 4/9: i386 trace support ?

Karim Yaghmour LTT for 2.5.38 5/9: PowerPC trace support ?

Karim Yaghmour LTT for 2.5.38 6/9: S/390 trace support ?

Karim Yaghmour LTT for 2.5.38 7/9: SuperH trace support ?

Karim Yaghmour LTT for 2.5.38 8/9: MIPS trace support ?

Karim Yaghmour LTT for 2.5.38 9/9: ARM trace support ?

Rusty Russell Re-xmit: kprobes for i386 ?

Zach Brown list_head debugging? ?

Ingo Molnar [patch] kksymoops, in-kernel symbolic oopser, 2.5.38-B0 ?

Greg KH USB changes for 2.4.20-pre7 ?

Greg KH USB changes for 2.5.38 ?

Greg KH PCI hotplug changes for 2.5.36 ?

Benjamin Herrenschmidt IDE probe & waiting for busy ?

Pavel Machek devicefs & sleep support for IDE ?

Patrick Mochel driver model updates ?

Larry Kessler README 1ST - New problem logging macros (2.5.38) ?

Larry Kessler 1 of 4 - New problem logging macros, Event Logging ?

Larry Kessler 2 of 4 - New problem logging macros, KBUILD_MODNAME ?

Larry Kessler [PATCH-RFC} 3 of 4 - New problem logging macros, plus template generation ?

Larry Kessler 4 of 4 - New problem logging macros, SCSI RAID device driver ?

Jeff Garzik alternate event logging proposal ?

Greg KH consolidate /sbin/hotplug call for pci and usb ?

tytso@mit.edu New version of the ext3 indexed-directory patch ?

tytso@mit.edu Add ext3 indexed directory (htree) support ?

Andi Kleen Nanosecond resolution for stat(2) ?

Jens Axboe deadline io scheduler ?

Rusty Russell Re-xmit: task_t removal patch ?

Andrew Morton remove page->virtual ?

Andrew Morton 2.5.37-mm1 ?

Andrew Morton 2.5.38-mm1 ?

Andrew Morton 2.5.38-mm2 ?

Marc Zyngier PF_PACKET filtering hooks for netfilter ?

Olaf Dietsche accessfs v0.4 - 2.5.34 ?

Olaf Dietsche accessfs v0.5 ported to LSM - 1/2 ?

Olaf Dietsche accessfs v0.5 ported to LSM - 2/2 ?

sds@tislabs.com 2.5.38-lsm-ipc.patch ?

sds@tislabs.com 2.5.38-lsm-misc.patch ?

Shawn Starr EXT3 Testing w/ rmap14a with contest results ?

Shawn Starr EXT2 vs EXT3 System calls via oprofile using contest 0.34 ?

Shawn Starr EXT3 vs EXT2 results with rmap14a and testing with contest 0.34 ?

Paolo Ciarrocchi chatroom benchmark version 1.0.1 results ?

Con Kolivas Greater resolution mem_load from contest ?

Con Kolivas 2.5.37 and 2.4.19-ac4 tested with contest 0.35 ?

Paolo Ciarrocchi LMbench2.0 results ?

Con Kolivas 2.5.38{-mm1} contest results ?

Con Kolivas 2.5.38-mm2 contest results ?

Con Kolivas Statistical representation of IO load results with contest ?

Con Kolivas fork_load module tested for contest ?

Karim Yaghmour Adeos nanokernel for 2.5.38 1/2: no-arch code ?

Karim Yaghmour Adeos nanokernel for 2.5.38 2/2: i386 code ?

Thomas Molina 2.5 Problem Report Status ?

Larry McVoy boring BK stats ?

Denis Vlasenko lk maintainers ?

Kernel development

Brief items

Kernel release status

Kernel development news

Some followups from last week

A new deadline I/O scheduler

Keeping disks busy

Doing things in the kernel

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Janitorial

Memory management

Networking

Security-related

Benchmarks and bugs

Miscellaneous