The current development kernel is 2.5.25
, which was announced
by Linus on July 5. It includes
a 1000 HZ internal clock on x86 processors (though that may change, the
real point of interest is that the internal clock has been detached from
the HZ seen in user space), some SCSI midlayer work (see last week's LWN Kernel Page
description of the plan for SCSI), a bunch of filesystem and VM layer
cleanups, an NTFS update, more kbuild tweaks, and many other changes.
Those wanting details can look at the long-format
Linus's BitKeeper tree for 2.5.26 contains only a small set of fixes as of
The latest prepatch from Dave Jones is 2.5.25-dj1, which catches up to the 2.5.25
kernel and throws in a number of fixes and a "fatfs crapectomy."
The latest 2.5 status summary from Guillaume
Boissiere is dated July 10.
The current stable kernel is 2.4.18; Marcelo has not released any
new 2.4.19 release candidates over the last week.
Alan Cox has released 2.4.19-rc1-ac1, which
catches up to the first 2.4.19 release candidate and adds a small set of
Comments (2 posted)
Kernel development news
Andrew Morton's "direct-to-BIO for O_DIRECT"
is another step in the process of converting the file I/O
subsystem over to the new BIO request structure. Files opened with
are a bit of a special case, in that I/O happens directly
to or from a userspace buffer. Andrew's patch sets up a BIO request
pointing directly to that buffer; for large operations, the result is a
That sort of optimization is certainly worthwhile. The really interesting
part of this patch, however, is that it shorts out the "kiobuf" layer for
O_DIRECT, and for the raw block I/O devices as well. Kiobufs were
initially implemented to support that sort of raw I/O; they were intended
to be a generic abstraction for a collection of physical pages in I/O
operations. Kiobufs have been gradually falling out of favor over the last
couple of years, however, as their limitations have come to light. They
are a relatively heavyweight data structure, with high setup and teardown
costs. Kiobufs also break down operations into relatively small chunks
which must be processed sequentially, slowing down large requests.
The direct-to-BIO patch has eliminated the original and largest use of
kiobufs within the kernel. That leads to the obvious question: is it time
to remove kiobufs from 2.5? The answer seems to be "yes," and some patches removing the last remaining uses of
kiobufs have started appearing. Kiobufs, it seems, are on the way out.
The only gap left if kiobufs are removed would be direct I/O support for
character devices. There are devices which can benefit from direct I/O:
consider the SCSI generic layer, video devices, or high-speed tape drives.
Requests have been posted for a function which would map a userspace buffer
into a "scatterlist," a data structure representing memory which has been
set up for DMA operations. This capability would take almost all of the
pain out of supporting direct I/O in character devices; no such patch has
yet been posted, though.
Comments (none posted)
The volume of the complaints about the 2.5 IDE subsystem is increasing.
Consider this posting
from Russell King:
If stuff in 2.5 wasn't soo broken (looking at IDE here) then more
people would be using it, and less people would be wanting the 2.5
features back ported to 2.4. IMHO, at the moment 2.5 has a major
problem. It is not getting the testing it deserves because things
like IDE and such like aren't reasonably stable enough.
...or this one from Andi Kleen...
Testing 2.5 (in this case with x86-64) is a major problem unless
you're lucky enough to find a SCSI adapter and a SCSI disk. IDE
just deadlocks and hangs too often. This prevents testing
everything else and stops development in 2.5 for many things.
The state of the IDE code is seen by many as a drag on the 2.5 development
process as a whole. For those who are concerned, there are a few things
worth looking at.
Part of the problem, apparently, is that the 2.5.25 kernel is missing
several of the more recent patches, which fix serious problems. As Martin Dalecki puts it:
My plan is to provide a 98 soon which will be cummulative against
2.5.25, just to geive people a chance to work on it again. But as
it stands - *plain* 2.5.25 is indeed very dangerous in this regard.
Martin's IDE-98 patch has not been posted as of this writing; those wanting
to run 2.5.25 on an IDE system in the mean time and actually keep their files
should apply this set of patches.
Interestingly, most of those patches were not posted
by Martin (who has been on vacation). Instead, the recent IDE patches have
been produced by Bartlomiej Zolnierkiewicz. Bartlomiej seems to take a bit
more cautious approach, and even has the
respect of former IDE maintainer Andre Hedrick. With luck, he will be
more involved in future IDE work. Few people contest the need to "clean
up" the IDE layer, but this work needs to be done in a very careful way.
Meanwhile, a different approach has been taken by Jens Axboe. It is normal
for interesting features in the current development series to be backported
to the previous stable kernel. Thus, for example, Alan Cox's 2.4.19-ac
patch includes the O(1) scheduler from 2.5. Jens has gone the other
direction and posted a patch (since updated) which "foreports" the 2.4 IDE layer to
2.5. His purpose was to have a stable platform to work on; the patch will
be maintained until the 2.5 IDE layer becomes a little more trustworthy.
It is not intended to be a long-term replacement for that layer.
With luck, the 2.5 IDE issues will settle out soon. Meanwhile, caution (or
a SCSI system) is suggested for people running 2.5.
Comments (none posted)
In the beginning
Alan Cox created the big
kernel lock (BKL), and Linux became SMP-capable. The BKL ensured that only
one processor could be running kernel code at any given time, thus keeping
the processors from stepping on each other. It was an effective way of
bringing SMP support to a kernel which had not been designed for multiple
The problem with the BKL, of course, is that multiple processors often want
to run concurrently in kernel code. Most of the time, those processors are
working on entirely different tasks and would not interfere with each
other. The more processors you have, the worse the problem gets; the Linux
kernel with just one big lock (i.e. 2.0) really did not function all that
well with more than two processors. Any additional CPUs would just spend
their time waiting to be able to get into the kernel code.
Scalability to larger systems, thus, requires finer locking. The BKL can
be split into a memory management lock, a networking lock, a filesystem
lock, etc. In the 2.1 development series, for example, the block I/O
subsystem adopted its own lock (io_request_lock) to keep the block
code and drivers from getting into trouble. Scalability was improved,
since the block code no longer needed the BKL, and could execute
concurrently with other kernel code.
But the io_request_lock serialized all block request handling. A
process submitting requests for one drive could not run concurrently with a
different process working with a different device. Floppy operations
contended for the same lock as performance-critical disk requests. The I/O
request lock improved scalability, but, once you get enough processors and
drives, it was still a bottleneck. So, one of the first steps in the 2.5
block subsystem work was to replace io_request_lock with a
per-queue lock, one for each device. The result will be better performance
on large, disk-intensive systems.
Most other kernel subsystems have been going through a similar development
process: global locks are replaced by multiple locks which protect smaller
data structures. This increasingly fine-grained locking makes the kernel
scalable to more and more processors, but it also brings some real costs.
For example, most of us do not run Linux on huge systems, and probably
never will. Embedded SMP systems are also rare.
All that locking will have a cost, even though the compiler
optimizes it out on uniprocessor systems.
The real cost, however, is in the complexity of the kernel code. As the
kernel becomes populated with thousands of little locks, it becomes
increasingly difficult to write correct kernel code. Which lock(s) must
you have to access a given data structure, or to call a given function? In
which order should locks be taken? Consider two code paths, both of which
need locks L1 and L2. The first thread takes L1, the second takes L2; each
then tries to take the other lock. The result is a deadlocked system.
Avoiding this problem requires specifying ordering relationships for every lock
in the system - and the number of those relationships grows exponentially
with the number of locks.
One can try to document the locking requirements of each data structure and
function in the kernel, and every lock ordering constraint.
But, even if one honestly believed that such a
document would be created (and, importantly, maintained), it would be a
very thick, complicated manual. A kernel with many locks will be a kernel
that is difficult to program.
Some people (i.e. Larry McVoy) have been arguing for years that Linux
should not chase the "scalability" goal too far. Down that road lies a
kernel that is twisted beyond maintainability, and, once you realize that
this has happened, it is too late to go back. For the most part,
scalability work has continued in the face of those warnings, but there are
signs that things are beginning to change. For example, a recent patch which removed the BKL from the driverfs
code was shouted down in a fairly strong way. Alexander Viro stated, in characteristic fashion:
"Zillion little spinlocks" means that kernel is scaled into
oblivion. Literally. If you want to play with resulting body -
feel free, but I like it less kinky.
So, while there has been no definitive statement of policy, it looks like
at least some kernel developers are thinking that locking in the kernel is
complex enough. There may be no 64-processor Linux in our future...
...at least, not in the classic SMP form. Larry McVoy has been pushing "cache-coherent clusters" as an alternative
approach for some time. A CC/cluster takes a large machine and divides it
into small group of (four, say) processors; each group runs an independent
Linux kernel. The kernels have minimal interactions with each other, so
locking issues fade to the background. Nobody has, yet, implemented such a
cluster, though a lot of the pieces are there. If somebody runs with this
idea, Linux could yet be the most scalable system of them all.
Comments (3 posted)
Patches and updates
Core kernel code
- William Lee Irwin III: lazy_buddy-2.5.25-1. Defers coalescing of adjacent pages in the buddy allocator as a way of making some operations go faster.
(July 10, 2002)
- Rusty Russell: cpu_mask_t. "<span>This fixes the last of my cpu_online_map damage, completing the abstraction.</span>"
(July 10, 2002)
- Jens Axboe: 2.4 IDE core for 2.5. "<span>I needed stable IDE for 2.5 testing and it
was/is clear that 2.5 just isn't quite there yet. I intend to maintain
this patch set until I deem 2.5 IDE stable enough (in code) that I'm
willing to spend time on that instead.</span>"
(July 9, 2002)
Filesystems and block I/O
- Vitaly Fertman: reiserfsprogs release. "<span>The most of changes are just bug fixes and speedups.</span>"
(July 10, 2002)
Page editor: Jonathan Corbet
Next page: Distributions>>