Kernel development [LWN.net]

Kernel release status

The current development kernel is 2.6.0-test2, which was released by Linus on July 27. It contains a lot of fixes, of course, including a bunch of forward-ported 2.4 patches, numerous architecture updates, some IDE fixes, an option to remove I/O schedulers from the kernel entirely, and a new local_t type for CPU-local data. See the long-format changelog for the details.

As of this writing, there are no patches beyond -test2 in Linus's BitKeeper repository.

The current stable kernel is 2.4.21. Marcelo has been busy with the prepatches, however; 2.4.22-pre8 was released on July 24, and 2.4.22-pre9 on July 29. Both patches limit themselves to fixes.

Comments (none posted)

Fixing interactive response in 2.6

While the performance improvements in the 2.6 kernel have impressed and pleased many users, there has been a constant level of complaining about the scheduler. In particular, many users are unhappy with the interactive feel of the 2.6 kernel; reports of jerky response and skipping audio are common. Some have gone as far as to compare the 2.6-test scheduler with the 2.4 virtual memory subsystem at the same point in the development cycle. There is some concern that scheduling could be one of the (few) embarrassments in the upcoming 2.6.0 release. Those worries are probably overdone, but the point is that 2.6.0-test scheduling still needs some work.

The good news is that said work is being done. Con Kolivas has been posting a set of interactivity patches for about a month now. Those wanting to try them out can find them in a recent -mm kernel; patches against the Linus kernels can be found on Con's web site. His most recent patch is 011int.

Con's work follows a familiar theme: improve interactive performance by giving a priority boost to interactive processes. All you have to do is figure out which processes are the interactive ones. Of course, that is rather easier said than done. Con's interactivity patches have been through several iterations in an attempt to find the best way to identify interactive processes and the proper amount of bonus to give them. This patch series may be converging on a result; several testers have reported good results.

The core idea (already part of the 2.6.0-test scheduler) is that an interactive process is one which sleeps much of the time. With Con's patches, any process which sleeps for at least one clock tick gets a priority bonus when it wakes up; processes which have done enough sleeping recently to be explicitly marked as "interactive" get a bigger bonus than others. Processes which run without sleeping for their entire time slice lose a point. In this way, CPU-hog processes sink to the lowest priorities, while a process reading from a terminal (or an audio stream) will get quick access to the processor. Additionally, processes which have maxed out their sleep bonus and are seen as truly interactive can hang out in the run queue for a while even after their time slice expires.

Life is not always so simple, however. Early versions of this patch tended to make life hard for newborn processes; it took them quite a while to build up enough of an interactivity bonus to be able to respond quickly on a loaded system. So things had to be tweaked to let new processes find their natural level quickly. There is also the issue of processes that sleep for a long time, then wake up to do some serious cranking. So processes that sleep for longer than one second lose their interactivity bonus, and end up at an "idle" level just below the interactive level. Work has also been required to balance priorities properly when an interactive process forks.

More recently, Ingo Molnar has also started looking at the interactivity problem; his sched-2.6.0-test1-G6 patch takes a different approach. Ingo starts by changing the scheduler to use nanosecond resolution in its timekeeping; his claim is that, by working with high-resolution time, audio skipping problems can be fixed. Then, the patch splits up time slices so that processes running at the same priority switch off with each other much more often, ensuring that none of them have to wait too long before getting some processor time. Finally, Ingo's patch extends the sleep bonus to include time that the process sits in the run queue, but does not actually get into the processor.

The two sets of patches are mostly orthogonal to each other; while it remains hard to apply both Con's and Ingo's patches to a single system, the two really are addressing different issues. Recent versions of Con's patches, however, also include some of Ingo's work (almost everything except the nanosecond resolution). In the end, code from both patches is likely to find its way into the kernel.

As a postscript, it's worth taking a look at this post from Daniel Phillips, where he states that the wrong approach is being taken for the audio skipping problem. Audio playback, says Daniel, is not an interactive task - it is a realtime task. What is really needed for the audio case is a bounded-latency soft realtime scheduler, not an endless series of interactive scheduler tweaks.

Comments (8 posted)

A different approach to module races

The topic of module unload races - where the kernel can end up calling into a module which has been removed - comes back occasionally. Much work has been done in 2.5 to reduce and eliminate these races. Part of that effort was moving module reference counting outside of the modules themselves. The result was a safer scheme, but one which imposes new requirements on kernel code which calls into modules. In some kernel subsystems (networking), the maintainers have decided that there is no need to worry about reference counting for modules; they simply ignore it.

Enter Rusty Russell. Since the reference counts are seen to be a pain, and some code isn't using them at all, why not simply get rid of them? He has submitted a patch which does exactly that.

Of course, the issue of how to safely remove modules remains. Without reference counts, how does the kernel know when it can actually get rid of a particular module? With Rusty's patch, a different approach is taken: modules are never actually removed. If an administrator invokes rmmod, the module's cleanup function will be called and all kernel knowledge of the module will go away - but the module code itself will remain in the kernel. The patch thus sacrifices some system memory on every unload as a way of avoiding unload races.

Some developers liked this patch, others didn't. For a kernel hacker who is debugging a module, a little lost memory for each load/unload cycle is probably not a big problem; the system will likely be rebooted soon anyway. The patch does present a bigger problem for Linux installers, however; many of these do hardware detection by loading almost every module available and seeing which ones actually find something. On a "small" system (that is, say, 64MB), it is possible that some distribution installers would simply run out of memory and die.

Rusty proposed adding a special rmmod option which would clean up memory left behind by deleted modules (while also marking the kernel tainted). For now, however, all of this has been made irrelevant by Linus, who decreed: "First off - we're not changing fundamental module stuff any more." This statement drew an amused response from Rusty ("OK. Who are you and what have you done with the real Linus?"), but the general sigh of relief from most kernel hackers could be heard worldwide. It seems that Linus is truly holding the line and keeping out potentially disruptive changes this time around.

Comments (4 posted)

Reiser4 is coming

The final part of the 2.3 development series featured a strong campaign to get the ReiserFS filesystem merged. That campaign was successful; ReiserFS was added in 2.4.1. Now it appears that history may repeat itself with the 2.6 kernel. Hans Reiser has posted a note asking that the soon-to-be-posted Reiser4 patch be merged into 2.6.0-test.

Reiser4 is not an updated version of ReiserFS; it is an entirely new filesystem. According to the posted benchmarks, Reiser4 outperforms ReiserFS and ext3 on several fronts. According to Hans, the performance of Reiser4 is now good enough to justify including it in 2.6-test.

The truly interesting part of Reiser4 is not limited to performance, however. Reiser4 is presented as a fully atomic filesystem - every operation either executes fully or not at all. It thus offers the same sort of crash resistence found in journaling filesystems, but with a couple of differences. One is that, it is claimed, the "wandering log" technique used in Reiser4 offers greater speed, since, unlike with other journaling schemes, it is not necessary to write data twice. And the other is that the "fully atomic" nature of the filesystem can extend beyond individual operations. Reiser4, in other words, can provide actual transactions.

A typical journaling filesystem works by writing all of the blocks to be changed in a given operation to a special journal file, followed by a "commit record." Once the operation is committed, the blocks can be copied from the journal to their real destination on the disk. If the system dies before the commit record is written, the operation is simply discarded and the filesystem is unchanged. If, instead, a fully committed operation is found in the journal, it can be replayed. With a scheme like this, an operation may be lost in a crash, but the filesystem itself will not be corrupted.

The Reiser4 wandering log technique works a little differently. It does not overwrite blocks in the filesystem; instead, blocks to be changed are relocated and the data is written in the new spot. The block pointers in the filesystem are changed in an (also relocated) directory block. This process continues up the filesystem tree until, with a single write pointing to the new root block, the whole operation is committed. The elimination of the need to write data separately to a journal file can increase performance, but this technique also has the potential to fragment files across the disk, hurting read performance. For that reason, Reiser4 allows for plugin modules which can look at operations and opt for a more normal journaling scheme when it makes sense. There will also be a "repacker" program which will go through occasionally and rearrange disks for better read performance.

The ability to perform multi-operation, multi-file transactions is what will make Reiser4 truly unique, however. A transactional capability will allow applications to perform complicated operations without the need to resort to tricks with fsync() and file renaming, and without the need to use a separate database manager. Of course, there are a few residual issues, like the fact that the standard Unix system calls make no provision for starting, committing, and rolling back transactions. So a new system call interface will be required. The Reiser4 developers are working on this interface, but have not yet posted it for wide review.

Linus has not committed himself with regard to merging Reiser4 into 2.6. It's worth noting that, when ReiserFS was merged, it had been stable and widely used for some time. That is not the case for Reiser4, which is still in an early stage. Chances are that Reiser4 will have a harder time getting into the kernel than ReiserFS did. (For more information on Reiser4, see this document on transactions, and this one on wandering logs, dancing trees, and other journaling topics).

Comments (12 posted)

Linus Torvalds Linux v2.6.0-test2 ?

Andrew Morton 2.6.0-test2-mm1 ?

Stephen Hemminger 2.6.0-test2-osdl1 ?

Martin J. Bligh 2.6.0-test2-mjb1 ?

Marcelo Tosatti Linux 2.4.22-pre9 ?

Marcelo Tosatti Linux 2.4.22-pre8 ?

J.A. Magallon Linux 2.4.22-pre8-jam1m ?

Rusty Russell Remove module reference counting. ?

Con Kolivas O9int for interactivity ?

Con Kolivas O10int for interactivity ?

Con Kolivas O11int for interactivity ?

Ingo Molnar sched-2.6.0-test1-G6, interactivity changes ?

Erich Focht [patch 2.6.0-test1] node affine NUMA scheduler extension ?

Erich Focht [patch] scheduler fix for 1cpu/node case ?

Fabian Frederick shm kobject model against 2.6t1 ?

Pavel Machek swsusp updates ?

Nigel Cunningham Annouce: swsusp 1.1-pre1 ?

Jaroslav Kysela ALSA update 0.9.6 ?

Benjamin Herrenschmidt Framebuffer: client notification mecanism & PM ?

Philip Graham Willoughby PATCH : LEDs - possibly the most pointless kernel subsystem ever ?

Krzysztof Halasa 2.6.0-test2 wanXL driver ?

Bagalkote, Sreenivas megaraid 2.00.6 driver ?

Denis Vlasenko lk maintainers ?

Bernardo Innocenti Make I/O schedulers optional (Was: Re: Kernel 2.6 size increase) ?

Lever, Charles Announcing Release Two of NFSv4 on Linux 2.4 ?

Christoph Hellwig remove the release timer from all pcmcia net drivers ?

Thomas Graf Extended Generic Packet Classifier ?

Olaf Dietsche 2.6.0-test2: access permission filesystem 0.16 ?

Martin J. Bligh 2.6.0-test2-mm1 results ?

Greg KH udev 0.2 release ?

Kernel development

Brief items

Kernel release status

Kernel development news

Fixing interactive response in 2.6

A different approach to module races

Reiser4 is coming

Patches and updates

Kernel trees

Core kernel code

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Networking

Security-related

Benchmarks and bugs

Miscellaneous