Brief items
The current development kernel is 2.5.23, which was
announced by Linus on June 18. Says Linus:
I asked 'what more can you ask for' for 2.5.22, and somebody
immediately piped up with raid5 working again. Well, here you have
a big MD merge from Neil Brown, which may or may not get you
there. Good luck.
Other stuff in this release includes an x86-64 merge, a number of
VM/filesystem patches from Andrew Morton, some asynchronous I/O precursor
patches from Ben LaHaise (see below), more kbuild tweaks, another set of
IDE fixes, and
numerous other changes. The long-format
changelog is available for people wanting all the details.
Linus released 2.5.22 on June 16; this
release included a big x86-64 merge, some important bug fixes, an IrDA
update, another set of kbuild tweaks, more IDE work, and a bunch of other
changes. Once again, the long-format changelog is also available.
The current prepatch from Dave Jones is 2.5.23-dj2. The patch has been pruned somewhat;
various obsolete bits have been thrown out. It also features a visit by
the "mad axemen," who have been carving up large, monolithic files (such as
the MTRR code). A new, optimized select/poll implementation by Andi Kleen
went in, along with a number of compile fixes.
Guillaume Boissiere's latest 2.5 status
summary came out on June 19. It takes a quick look at what has
been accomplished since the last kernel summit, and what remains to be
discussed at the next one.
The current stable kernel is 2.4.18. There have been no 2.4.19
prepatches released since June 4. Rumor has it that Marcelo is too
busy following the Brazilian team's fortunes in the World Cup, but that
could not be confirmed.
Comments (7 posted)
Kernel development news
Ben LaHaise's asynchronous I/O patch has been waiting for inclusion for
many months. Asynchronous I/O happens, of course, without blocking the
calling process; it also goes directly to or from the user process buffer
whenever possible. The feature is used by certain demanding applications,
such as relational database systems. Ben's patch is working, and has been
shipped in Red Hat's Advanced Server product. But it is not yet part of
the mainline kernel.
There are a couple of apparent reasons for this patch's long wait for
inclusion. One is that Linus is unconvinced about the value of
asynchronous I/O; he thinks there are better ways to solve the problem (see
the May 16, 2002
LWN Kernel Page). The other reason is that this patch reaches deeply
into the kernel and changes some fundamental interfaces - for example, it
changes the read and write functions provided by device
drivers. Big changes make Linus (and others) nervous; it is considered
preferable to break things multiple times in small pieces.
So now some of the structure needed for asynchronous I/O is being submitted
in the requisite small chunks. The first
patch simply splits the fput() function into two pieces to
simplify its invocation (indirectly) from an interrupt handler.
The second patch is, perhaps, more
interesting. Here the wait queue mechanism is being changed in fundamental
ways. The first version of this patch simply added a callback function
which would be invoked when a wakeup happens on the queue. This callback is
needed for the asynchronous I/O subsystem; it needs to know when an I/O
operation completes, but it can not block on the wait queue. Following
suggestions from Linus, later revisions of the patch have moved some of the
wakeup functionality to that callback function. There can even be
different callbacks for "exclusive waits" (where only one process should be
awakened even if many are waiting) and the standard "wake everybody"
variety. By providing different callbacks, kernel subsystems can change
the semantics of the wait operation.
Wait queues, in other words, are evolving from a mechanism that puts a
process to sleep for a while into a more general event notification
mechanism. The immediate application for this mechanism is asynchronous
I/O, but it will be interesting to see what others turn up.
Comments (none posted)
The Linux kernel stack is a limited resource; it must fit into two pages of
memory, which it shares with some process information. Overflowing the
kernel stack can be a catastrophic event, and it can happen at surprising
times, such as in interrupt handlers. After a recent Stanford Checker
posting pointing out numerous places where large structures have been
allocated on the stack, and with proposals to consider reducing the size of
the stack, there has been an increase in interest in minimizing kernel
stack usage.
One bit of code that caught Andries Brouwer's eye was the resolution of
symbolic links. In the process of symlink resolution, the kernel can
encounter new links which must also be resolved; this is handled by a
recursive call into the resolution code. Each call, of course, requires
kernel stack space, so recursive calls must be looked at with care - unless
the recursion is carefully bounded, it can easily overflow the kernel
stack. The symlink code handles this constraint by limiting the symlink
depth to five.
Andries has posted a new symlink
implementation that eliminates the recursion. Instead, it maintains
its own stack - allocated separately - which contains the current state of
symlink resolution. In this way, the five-level limit can be lifted
without fear of overrunning the kernel stack. Of course, it is extremely
rare that anybody actually hits the five-level limit; there are special cases, however, where users do
interesting things with symbolic links.
Not all developments are oriented toward reducing kernel stack usage, however.
Andi Kleen has posted a patch which does the
opposite in order to make the select and poll system
calls perform better. These calls (which share most of an internal
implementation) allocate a couple of pages of kernel memory to hold the
requisite data structures; they are sized to be able to handle situations
where large numbers of file descriptors are being waited on. In reality,
however, many (if not most) select and poll calls are
given only a small number of file descriptors, so much of that memory is
wasted.
Andi's patch works by setting up a separate fast path for when only a small
number of file descriptors are in use. Rather than allocate those two
pages, the fast path uses a small, in-stack array. The stack space usage
is limited to 256 bytes, which will fit easily even on a reduced-size
stack. The new implementation not only saves a couple of kernel pages for
each process calling select (and there can be many on a typical Linux
system), it's faster as well. The patch has been included in 2.5.23-dj2,
and will likely find its way into the mainline before too long.
Comments (none posted)
Rik van Riel's reverse-mapping virtual memory implementation (RMAP) has
been under development for several months; it has attracted some attention
as a possible way of improving Linux VM performance in the future. Thus
far, however, RMAP has only been available for the 2.4 series, so it has
been hard to evaluate as a possible addition to 2.5.
That situation has just changed, however: Craig Kulesa decided to port RMAP
to the 2.5.23 kernel. He posted it in two forms: a full port which makes many changes, and a minimal version which add only the reverse
mapping code itself. Craig's preliminary benchmark results show a
respectable performance improvement in 2.5.23 when the RMAP code is added
in.
A much more serious benchmarking effort will have to be done before any
real conclusions about RMAP in 2.5 can be drawn. This port, however, has
attracted a fair amount of interest. If more detailed numbers can be
obtained soon, RMAP in 2.5 should be an active area of discussion at next
week's kernel summit.
Comments (none posted)
It has been a few weeks since a "concerns about the IDE reimplementation
process" article appeared here, so it must be about time. The conversation
this time around started with
a complaint
that recent kernels can deadlock when reading partition tables; it included
"a small plea for more testing" before IDE patches are unleashed upon the
world. Dave Jones
followed up with a remark
of his own:
When the IDE carnage first started back circa 2.5.3, I had
contemplated not merging *any* of the IDE patches, just so that
people who want to work on other areas could have something solid
to build upon. I regret not following through on that instinct.
Linus, however, remains unworried:
We're not supposed to be writing code and then releasing it when it
is done. We _want_ incremental changes, and open breakage.
So the IDE process is likely to continue as it has. Be careful out there.
In a separate conversation, a user requested
the restoration of the IDE taskfile operations. Those operations had been
removed relatively early in Martin Dalecki's series of patches. He has not
promised to restore them, but previous IDE maintainer Andre Hedrick jumped in with an interesting comment:
In the end, I will end up writing a closed ATA binary driver for
sale as a replacement. I have had several requests to consider the
option. As much as I do not like the idea, it is less offensive
than the current direction.
It would be a shame if Linux users were driven to use a binary-only driver
for such a fundamental subsystem due to lack of support for needed
operations. The next stable kernel is still far away, however; plenty of
time remains for these issues to be dealt with.
Jens Axboe has, meanwhile, released a version
of his "tagged command queueing for IDE" patch, backported to the
2.4.19-pre kernel.
Comments (1 posted)
Patches and updates
Kernel trees
- Chris Wright: 2.4.19-pre10-lsm1. 2.4.19-pre10 kernel with the Linux Security Module patch applied. (LSM patch also available separately).
(June 14, 2002)
Core kernel code
- Benjamin LaHaise: 2.5.22 add __fput for aio. A precursor patch providing a facility needed by the asynchronous I/O patch.
(June 17, 2002)
- Andi Kleen: poll/select fast path. Optimizes the select/poll system calls when the number of file descriptors is small.
(June 18, 2002)
- Andi Kleen: poll/select fast path. A new implementation fixing some problems with the first version.
(June 19, 2002)
Device drivers
- Roland Dreier: 2.4 add __dma_buffer alignment macro. A macro for addressing the "DMA to small buffers on cache incoherent systems" problem discussed in <a href="/Articles/1783/">the June 12 LWN Kernel Page</a>.
(June 13, 2002)
- Kurt Garloff: /proc/scsi/map. Creates a <tt>/proc</tt> file listing SCSI devices with controller, target, and unit numbers.
(June 19, 2002)
Filesystems and block I/O
- Jens Axboe: block-highmem-all-19. Block I/O out of high memory without bounce buffers (this patch intended for a future 2.4.20 prepatch).
(June 18, 2002)
- Andries.Brouwer@cwi.nl: symlink recursion. An implementation of symbolic link resolution which is not recursive (and, thus, takes less kernel stack space).
(June 18, 2002)
Janitorial
- Matthew Wilcox: Remove SCSI_BH. Make the SCSI system use a tasklet instead.
(June 17, 2002)
Memory management
- Andrew Morton: writeback tunables. Adds five sysctl entries for tuning writeback behavior.
(June 17, 2002)
Networking
Miscellaneous
- Denis Vlasenko: linld 0.95. A Linux boot loader.
(June 14, 2002)
- Rusty Russell: Initcall depends. Updated version of the initialization order patch (covered in the <a href="/Articles/1783/">June 13 Kernel Page</a>).
(June 17, 2002)
Page editor: Jonathan Corbet
Next page: Distributions>>