|
|
Subscribe / Log in / New account

Kernel development

Brief items

Current release status

The current development kernel is 2.5.23, which was announced by Linus on June 18. Says Linus:

I asked 'what more can you ask for' for 2.5.22, and somebody immediately piped up with raid5 working again. Well, here you have a big MD merge from Neil Brown, which may or may not get you there. Good luck.

Other stuff in this release includes an x86-64 merge, a number of VM/filesystem patches from Andrew Morton, some asynchronous I/O precursor patches from Ben LaHaise (see below), more kbuild tweaks, another set of IDE fixes, and numerous other changes. The long-format changelog is available for people wanting all the details.

Linus released 2.5.22 on June 16; this release included a big x86-64 merge, some important bug fixes, an IrDA update, another set of kbuild tweaks, more IDE work, and a bunch of other changes. Once again, the long-format changelog is also available.

The current prepatch from Dave Jones is 2.5.23-dj2. The patch has been pruned somewhat; various obsolete bits have been thrown out. It also features a visit by the "mad axemen," who have been carving up large, monolithic files (such as the MTRR code). A new, optimized select/poll implementation by Andi Kleen went in, along with a number of compile fixes.

Guillaume Boissiere's latest 2.5 status summary came out on June 19. It takes a quick look at what has been accomplished since the last kernel summit, and what remains to be discussed at the next one.

The current stable kernel is 2.4.18. There have been no 2.4.19 prepatches released since June 4. Rumor has it that Marcelo is too busy following the Brazilian team's fortunes in the World Cup, but that could not be confirmed.

Comments (7 posted)

Kernel development news

The beginning of the asynchronous I/O merge?

Ben LaHaise's asynchronous I/O patch has been waiting for inclusion for many months. Asynchronous I/O happens, of course, without blocking the calling process; it also goes directly to or from the user process buffer whenever possible. The feature is used by certain demanding applications, such as relational database systems. Ben's patch is working, and has been shipped in Red Hat's Advanced Server product. But it is not yet part of the mainline kernel.

There are a couple of apparent reasons for this patch's long wait for inclusion. One is that Linus is unconvinced about the value of asynchronous I/O; he thinks there are better ways to solve the problem (see the May 16, 2002 LWN Kernel Page). The other reason is that this patch reaches deeply into the kernel and changes some fundamental interfaces - for example, it changes the read and write functions provided by device drivers. Big changes make Linus (and others) nervous; it is considered preferable to break things multiple times in small pieces.

So now some of the structure needed for asynchronous I/O is being submitted in the requisite small chunks. The first patch simply splits the fput() function into two pieces to simplify its invocation (indirectly) from an interrupt handler.

The second patch is, perhaps, more interesting. Here the wait queue mechanism is being changed in fundamental ways. The first version of this patch simply added a callback function which would be invoked when a wakeup happens on the queue. This callback is needed for the asynchronous I/O subsystem; it needs to know when an I/O operation completes, but it can not block on the wait queue. Following suggestions from Linus, later revisions of the patch have moved some of the wakeup functionality to that callback function. There can even be different callbacks for "exclusive waits" (where only one process should be awakened even if many are waiting) and the standard "wake everybody" variety. By providing different callbacks, kernel subsystems can change the semantics of the wait operation.

Wait queues, in other words, are evolving from a mechanism that puts a process to sleep for a while into a more general event notification mechanism. The immediate application for this mechanism is asynchronous I/O, but it will be interesting to see what others turn up.

Comments (none posted)

Moving things on and off the kernel stack

The Linux kernel stack is a limited resource; it must fit into two pages of memory, which it shares with some process information. Overflowing the kernel stack can be a catastrophic event, and it can happen at surprising times, such as in interrupt handlers. After a recent Stanford Checker posting pointing out numerous places where large structures have been allocated on the stack, and with proposals to consider reducing the size of the stack, there has been an increase in interest in minimizing kernel stack usage.

One bit of code that caught Andries Brouwer's eye was the resolution of symbolic links. In the process of symlink resolution, the kernel can encounter new links which must also be resolved; this is handled by a recursive call into the resolution code. Each call, of course, requires kernel stack space, so recursive calls must be looked at with care - unless the recursion is carefully bounded, it can easily overflow the kernel stack. The symlink code handles this constraint by limiting the symlink depth to five.

Andries has posted a new symlink implementation that eliminates the recursion. Instead, it maintains its own stack - allocated separately - which contains the current state of symlink resolution. In this way, the five-level limit can be lifted without fear of overrunning the kernel stack. Of course, it is extremely rare that anybody actually hits the five-level limit; there are special cases, however, where users do interesting things with symbolic links.

Not all developments are oriented toward reducing kernel stack usage, however. Andi Kleen has posted a patch which does the opposite in order to make the select and poll system calls perform better. These calls (which share most of an internal implementation) allocate a couple of pages of kernel memory to hold the requisite data structures; they are sized to be able to handle situations where large numbers of file descriptors are being waited on. In reality, however, many (if not most) select and poll calls are given only a small number of file descriptors, so much of that memory is wasted.

Andi's patch works by setting up a separate fast path for when only a small number of file descriptors are in use. Rather than allocate those two pages, the fast path uses a small, in-stack array. The stack space usage is limited to 256 bytes, which will fit easily even on a reduced-size stack. The new implementation not only saves a couple of kernel pages for each process calling select (and there can be many on a typical Linux system), it's faster as well. The patch has been included in 2.5.23-dj2, and will likely find its way into the mainline before too long.

Comments (none posted)

Reverse mapping VM comes to 2.5

Rik van Riel's reverse-mapping virtual memory implementation (RMAP) has been under development for several months; it has attracted some attention as a possible way of improving Linux VM performance in the future. Thus far, however, RMAP has only been available for the 2.4 series, so it has been hard to evaluate as a possible addition to 2.5.

That situation has just changed, however: Craig Kulesa decided to port RMAP to the 2.5.23 kernel. He posted it in two forms: a full port which makes many changes, and a minimal version which add only the reverse mapping code itself. Craig's preliminary benchmark results show a respectable performance improvement in 2.5.23 when the RMAP code is added in.

A much more serious benchmarking effort will have to be done before any real conclusions about RMAP in 2.5 can be drawn. This port, however, has attracted a fair amount of interest. If more detailed numbers can be obtained soon, RMAP in 2.5 should be an active area of discussion at next week's kernel summit.

Comments (none posted)

2.5 and IDE development

It has been a few weeks since a "concerns about the IDE reimplementation process" article appeared here, so it must be about time. The conversation this time around started with a complaint that recent kernels can deadlock when reading partition tables; it included "a small plea for more testing" before IDE patches are unleashed upon the world. Dave Jones followed up with a remark of his own:

When the IDE carnage first started back circa 2.5.3, I had contemplated not merging *any* of the IDE patches, just so that people who want to work on other areas could have something solid to build upon. I regret not following through on that instinct.

Linus, however, remains unworried:

We're not supposed to be writing code and then releasing it when it is done. We _want_ incremental changes, and open breakage.

So the IDE process is likely to continue as it has. Be careful out there.

In a separate conversation, a user requested the restoration of the IDE taskfile operations. Those operations had been removed relatively early in Martin Dalecki's series of patches. He has not promised to restore them, but previous IDE maintainer Andre Hedrick jumped in with an interesting comment:

In the end, I will end up writing a closed ATA binary driver for sale as a replacement. I have had several requests to consider the option. As much as I do not like the idea, it is less offensive than the current direction.

It would be a shame if Linux users were driven to use a binary-only driver for such a fundamental subsystem due to lack of support for needed operations. The next stable kernel is still far away, however; plenty of time remains for these issues to be dealt with.

Jens Axboe has, meanwhile, released a version of his "tagged command queueing for IDE" patch, backported to the 2.4.19-pre kernel.

Comments (1 posted)

Patches and updates

Kernel trees

Chris Wright 2.4.19-pre10-lsm1 2.4.19-pre10 kernel with the Linux Security Module patch applied. (LSM patch also available separately). ?
Chris Wright 2.5.23-lsm1 ?

Core kernel code

Benjamin LaHaise 2.5.22 add __fput for aio A precursor patch providing a facility needed by the asynchronous I/O patch. ?
Benjamin LaHaise v2.5.22 - add wait queue function callback support Another piece of asynchronous I/O support. ?
Andi Kleen poll/select fast path Optimizes the select/poll system calls when the number of file descriptors is small. ?
Andi Kleen poll/select fast path A new implementation fixing some problems with the first version. ?
Ingo Molnar migration thread & hotplug fixes, 2.5.23 Make the migration code deal with nonlinear CPUs. ?

Device drivers

Adam J. Richter Patch (2.5.21): bio size fixes for ll_rw_kio and mpage.c Don't let the BIO layer make requests that are too big for the underlying device. ?
Martin Dalecki 2.5.21 IDE 88 ?
Martin Dalecki 2.5.21 IDE 89 ?
Martin Dalecki 2.5.21 IDE 90 ?
Martin Dalecki 2.5.21 IDE 91 ?
Martin Dalecki 2.5.21 ide 92 ?
Roland Dreier 2.4 add __dma_buffer alignment macro A macro for addressing the "DMA to small buffers on cache incoherent systems" problem discussed in <a href="/Articles/1783/">the June&nbsp;12 LWN Kernel Page</a>. ?
Roland Dreier use __dma_buffer for USB ?
Martin Schwidefsky 2.5.22: new xpram driver 2nd try. ?
Marc Boucher New hsflinmodem-5.03.03.L3mbsibeta02061700 release Conexant HSF "linmodem" driver. ?
Marc Boucher New hcflinmodem-0.95mbsibeta02061700 release Conexant HCF "linmodem" driver. ?
Kurt Garloff /proc/scsi/map Creates a <tt>/proc</tt> file listing SCSI devices with controller, target, and unit numbers. ?

Filesystems and block I/O

Andrew Morton go back to 256 requests per queue Raising the request queue size in 2.5.20 dropped dbench performance by 40%... ?
Jens Axboe block-highmem-all-19 Block I/O out of high memory without bounce buffers (this patch intended for a future 2.4.20 prepatch). ?
Martin Schwidefsky 2.5.22: ibm partition support. ?
Andries.Brouwer@cwi.nl symlink recursion An implementation of symbolic link resolution which is not recursive (and, thus, takes less kernel stack space). ?
Jens Axboe ide+block tag support, 2.4.19-pre10 Latest 2.4.19-pre backport of tagged command queueing for IDE. ?
Anton Altaparmakov NTFS 2.0.9 update ?

Janitorial

Matthew Wilcox Remove SCSI_BH Make the SCSI system use a tasklet instead. ?
Andrew Morton ext3 corruption fix ?
Andrew Morton take bio.h out of highmem.h ?
Andrew Morton rename get_hash_table() to find_get_block() "<q>get_hash_table() is too generic a name. Plus it doesn't even use a hash any more.</q>" ?

Memory management

Rik van Riel rmap VM 13b ?
Craig Kulesa (1/2) reverse mapping VM for 2.5.23 (rmap-13b) A thorough rmap port to 2.5.23 which changes a lot of things. ?
Craig Kulesa (2/2) reverse mappings for current 2.5.23 VM A "minimal" rmap port to 2.5.23. ?
Andrew Morton writeback tunables Adds five sysctl entries for tuning writeback behavior. ?
Andrew Morton Reduce the radix tree nodes to 64 slots Save some memory and avoid order-1 allocations. ?
Andrew Morton direct-to-BIO I/O for swapcache pages Major changes to the swap code. ?

Networking

Miscellaneous

Denis Vlasenko linld 0.95 A Linux boot loader. ?
Rusty Russell Initcall depends Updated version of the initialization order patch (covered in the <a href="/Articles/1783/">June&nbsp;13 Kernel Page</a>). ?

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2002, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds