Kernel development [LWN.net]

Kernel release status

The current development kernel is still 2.5.59. Linus remains away from his keyboard.

The current stable kernel is 2.4.20; Marcelo released the fourth 2.4.21 prepatch on January 29. It includes some MTD driver fixes, a couple of netfilter bug fixes, the new IPMI driver, fixes for the ethernet information leakage vulnerability, an x86-64 update, and various other fixes and updates.

The latest prepatch from Alan Cox is 2.4.21-pre3-ac5; it has a few new IDE changes and a small set of fixes.

Comments (1 posted)

Anticipatory I/O scheduling

If an operating system is to perform well, it must get top performance out of its disk drives. The best use of disks is generally obtained by recognizing a couple of basic facts of life:

Disk seeks are very slow. Overall transfer rates go up significantly if requests can be ordered to minimize the number and distance of seek operations.
Write operations can (usually) happen whenever, but there is almost always a process waiting for the completion of a read. Prioritizing reads over writes will increase the parallelism and perceived responsiveness of a system.

Unfortunately, minimizing seeks and prioritizing reads can be somewhat contradictory goals. The way to keep seeks small and relatively rare is to maintain a sizeable backlog of requests; that way, nearby requests can be grouped and executed together. Getting the best performance on writes, in other words, requires delaying write operations for a period of time. If reads are to be handled quickly, however, they cannot be delayed and kept in the request queue in this way.

It is even worse than that, actually. A process which is writing a file can have several outstanding write requests at any given time, since that process usually does not care when any particular request is completed. Processes issuing reads, on the other hand, usually generate them one at a time. The next read operation will not be requested until the previous one completes. So, even if delaying read operations were an acceptible thing to do, accumulating a backlog of close read operations is an unlikely proposition.

The 2.4 kernel tends to mix reads in with the queue of write operations. Given the way things work (a read is not particularly likely to be close to an unrelated set of writes), reads tend to get put toward the end of the queue and executed slowly. That is one reason why 2.4 can be slow to respond when there is a lot of write activity going on.

In the 2.5 kernel, reads are not allowed to languish long before they are pushed to the head of the queue and executed. This change can improve performance significantly, but it still does not solve the whole problem. As Andrew Morton put it in the 2.5.59-mm5 patch set:

So far so good, but these fixes are still dumb. Because we're solving the dependent read problem by creating a seek storm. Every time someone submits a read, we stop writing, seek over and service the read, and then *immediately* seek back and start servicing writes again.

But in the common case, the application which submitted a read is about to go and submit another one, closeby on-disk to the first. So whoops, we have to seek back to service that one as well.

As a result, overall performance suffers, since the disk is spending too much time seeking.

2.5.59-mm5 contains a new "anticipatory I/O scheduler" by Nick Piggin which attempts to address this problem. The basic idea is simple: if the drive has just handled a read request, assume that there is another one coming behind it and simply wait for a little bit. In this case, the request queue is plugged, the I/O scheduler sets a timer, and no more requests are passed down to the drive for a millisecond or so. If a "close" request shows up during the wait time, it is serviced right away; the distance that the kernel considers "close" grows as time passes. Eventually the close requests will stop coming, or the kernel will decide that it's time to get around to some writes regardless; at that point normal request dispatching resumes.

Andrew reports a big improvement in performance (nearly a factor of six) over 2.5.59 for a simple test he was running. The code, it is said, still needs "quite some work," but it's a good start. 2.6 looks on track to be the most responsive Linux kernel yet.

Comments (4 posted)

Fast reader/writer locks

The Linux kernel contains a number of primitives for controlling mutual exclusion. Semaphores and spinlocks (in several varieties) have been around for a while, and the read-copy-update mechanism was added in the 2.5 series. Yet another mechanism, called "fast reader/writer locks," has found its way into Andrew Morton's -mm patch set, and appears likely to be forwarded on to Linus soon for inclusion. So this seems like as good a time as any to look at how "frlocks" work.

Frlocks, as implemented by Stephen Hemminger, are aimed at solving a couple of problems with the gettimeofday() system call. One is simple performance; gettimeofday() is not particularly slow, but some applications (including anything using the X Window System) call it frequently. It also turns out that the current implementation, which uses reader/writer spinlocks, is susceptible to a denial of service problem. Frequent calls to gettimeofday() can delay or lock out timer tick updates.

The frlock patch works by not blocking readers or writers at all. Code wishing write access to the protected data structure is given that access immediately (at least, in the absence of other writers), so there is no way that time updates can be blocked or delayed. Readers, too, get immediate access to the data structure. The catch is that readers must be prepared to retry the access if collides with a writer for access to the data.

The lock works by maintaining "pre" and "post" sequence numbers. A writer process does the following:

Take out a spinlock associated with the frlock
Increment the "pre" sequence
Mess around with the data structure
Increment the "post" sequence
Release the spinlock

Readers do something like the following:

Remember the lock's "post" sequence number
Grab the data of interest
Ensure that the lock's "pre" sequence matches the remembered "post" sequence. If not, go back to the beginning.

In other words, as long as the sequence numbers match, the reader knows that no writer changed the data while the reader was doing its thing. In practice, the reader side tends to be expressed in code like:

	do {
		seq = fr_read_begin(&some_lock);
		/* Copy the data */
	} while (seq != fr_read_end(&some_lock));

Frlocks, clearly, will not be suitable for lengthy calculations, or those which have immediate side effects. In cases where a small data structure is changed infrequently but read often, however, frlocks may be the key to improved performance. In the introduction to the latest set of frlock patches, Stephen claims an 18% improvement in the speed of gettimeofday() - and the elimination of the timer tick lockout problem.

Comments (17 posted)

The return of modversions

Perhaps the biggest bit of unfinished work with the new kernel module loader is the module versioning support. Module versioning is an attempt to make binary loadable modules work with multiple kernel versions. It works by attaching a checksum to each exported kernel symbol; the checksum is calculated from the prototype or declaration of the symbol. As long as the checksums in a module match those in the running kernel, it is assumed that the module can be safely loaded. Kernel hackers tend not to use module versioning, which explains why it took so long to get this feature fixed. Production kernels shipped by distributors need this feature, however; otherwise it is essentially impossible for vendors to support binary-only modules.

Kai Germaschewski has posted a modversions implementation which works with recent 2.5 kernels. The underlying idea is essentially the same as that found in previous implementations, but the implementation is entirely different.

The old scheme used the genksyms program to generate the checksums, and to create a bunch of include files (ending in .ver) which redefined the exported kernel names to include those checksums. The effect was to create a bunch of preprocessor definitions like:

	#define printk printk_R1b7d4074

(The actual definitions were a little more complicated). Loadable modules would thus be built to call printk_R1b7d4074() instead of printk(). The names were stored in that form in the kernel symbol table, so the insmod program simply needed to look for a direct match. If the interface had changed, the names would not match, and the module would refuse to load.

The new implementation does away with the include files. It does still use genksyms, but the output is reprocessed into a set of structure declarations. One structure (containing the symbol name and its checksum) is created for each symbol used by the module; the array of structures is then linked into the module in a special section. When a module is loaded into the kernel, the checksums are used to verify compatibility, and the special section can be discarded. Among other things, this approach makes it easier to force the loading of a module with mismatched symbols, should anybody be unwise enough to attempt such a thing.

Comments (9 posted)

J.A. Magallon Linux 2.4.21-pre3-jam3 ?

Martin J. Bligh 2.5.59-mjb2 (scalability / NUMA patchset) ?

Stephen Hemminger 2.5.59-dcl2 ?

Kai Germaschewski [PATCH] new modversions implementation ?

Stephen Hemminger (0/4) 2.5.59 fast reader/writer lock for gettimeofday ?

Stephen Hemminger (1/4) 2.5.59 fast reader/writer lock for gettimeofday ?

Stephen Hemminger (2/4) 2.5.59 fast reader/writer lock for gettimeofday ?

Stephen Hemminger (3/4) 2.5.59 fast reader/writer lock for gettimeofday ?

Stephen Hemminger (4/4) 2.5.59 fast reader/writer lock for gettimeofday ?

Inaky Perez-Gonzalez prioarray-1.0: Priority arrays as a separate data structure ?

Inaky Perez-Gonzalez pfutex-1.2: Priority-based/real-time futexes ?

Rusty Lynch Kirq Hook ?

Rusty Lynch mmio probes ?

Grover, Andrew ACPI update (20030122) ?

Yuri van Oers Announce: RivaTV 0.8.2 released ?

Jaroslav Kysela ALSA update ?

Denis Vlasenko lk maintainers with first 2003 entry :) ?

Matthew Wilcox willy's unreliable guide to softirqs, tasklets, timers and work queues ?

Oliver Xymoron Report write errors to applications (resend) ?

Andrew Morton 2.5.59-mm5 ?

Andrew Morton 2.5.59-mm6 ?

Jason Papadopoulos page coloring for 2.5.59 kernel, version 1 ?

Dave McCracken Speed up task exit ?

Yichen Xie 87 potential array bounds error/buffer overruns in 2.5.53 ?

Stephen D. Smalley CONFIG_SECURITY_NETWORK ?

Muli Ben-Yehuda ANN: syscalltrack 0.81 "Cruel Ducky" released ?

Kernel development

Brief items

Kernel release status

Kernel development news

Anticipatory I/O scheduling

Fast reader/writer locks

The return of modversions

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Janitorial

Memory management

Security-related

Miscellaneous