The current development kernel is 2.5.64
, unchanged from one week
ago. Linus has been busy, however; his BitKeeper tree includes more driver
model work, the continuing removal of unwanted stuff from devfs, a uClinux
update, an x86-64 update, some block layer cleanups (see below), scheduler
changes for improved interactive response (see below again), and a number
of other fixes.
Alan Cox has released 2.5.64-ac3 which adds
a new set of IDE updates. "Handle with care."
The current stable kernel is 2.4.20; Marcelo has not released any
2.4.21 prepatches over the last week.
Alan Cox's current 2.4.21 prepatch is 2.4.21-pre5-ac3. Here you'll find an even
newer set of IDE changes, along with quite a few other fixes and updates.
Comments (1 posted)
Kernel development news
The 2.5 kernel features a massively reworked scheduler which, among other
things, improves the interactive feel of a desktop system. It goes to
great lengths to try to separate interactive tasks from "background"
processes, and to give a priority boost to the former. One way that this
distinction is made is to look at how much time each process spends
sleeping. Processes that sleep a lot are generally waiting for humans to
do something, so the kernel tries to ensure that, when they wake up, they
get quick access to the processor.
This heuristic works well much of the time, but it also fails badly in some
situations. Consider, for example, the case of a user dragging a window
across the screen. That sort of operation can require a fair amount of
computation on the part of the X server. If the system is busy anyway
(with a kernel compilation, for example), the X server can end up using all
of the processor time that is available to it. When the server stops
sleeping, the kernel concludes that it is a compute-bound background task
and drops its priority. At that point, the pointer stops keeping up with
the mouse, and the desktop experience becomes generally unpleasant.
A classic solution (which predates Linux) for this problem is to raise the
priority of the X server. A higher-priority server can make things work
better for some users, but it ignores the fact that similar situations can
arise with other interactive processes that require
a fair amount of processor time. Streaming media applications tend to work
this way, for example. Raising the priority of the X server can make
things worse for this sort of application. Also, as Linus points out, tweaking priorities in this way is
an indication that the system has failed somehow:
Something is wrong, and we couldn't fix it, so here's the band-aid
to avoid that problem for hat particular case. It's acceptable as a
band-aid, but if you don't realize that it's indicative of a
problem, then you're just kidding yourself.
A few patches have gone into the 2.5.65 kernel which, by most reports, make
things a lot better. One of them, which originally came from Linus, is
based on the recognition that, if an interactive process is waiting for
another process to do something, that other process should be considered
interactive as well. The X server may be using a fair amount of CPU time,
but, since interactive processes (i.e. the clients that the user works
with) are waiting for it, the X server should still be seen as an
The ideal time to make this adjustment might be when an interactive process
goes to sleep waiting for an event. Unfortunately, that is hard to do; the
kernel has no way to know, in the general case, who will be waking up
processes that sleep on a particular queue. On the other hand, when the
wakeup actually occurs, the relationship is immediately obvious. So the
new scheduler will, at wakeup time, look at the interactivity bonus for the
process being awakened. If that process has maxed out its bonus (as
processes that sleep a lot will), the "excess" interactivity bonus is
given, instead, to the process which is performing the wakeup. Thus, a
sleeping mail client gives some of its bonus to the X server, which wakes
it up. This patch is said to improve the interactivity of X
Ingo Molnar has taken Linus's patch and merged it into a larger set of
scheduler changes (which, in turn, has gone into 2.5.65). Some of the
additional changes that have been made include:
- Various scheduler parameter tweaks. The maximum timeslice given to
any process has been reduced, for example (to 200ms).
- One process can preempt another with the same priority, if the former
has a longer remaining timeslice.
- The first wakeup of a newly-forked child has been made smarter,
resulting in less work being redone.
The end result of these changes is a kernel which provides a much more
satisfying interactive experience. Note, however, that some causes of X
server stalls - in particular, those related to disk I/O scheduling - still
have not been resolved. Work is ongoing, however.
(See also: Jim Houston's self-tuning scheduler
patch, which takes a different approach to scheduler improvement).
Comments (12 posted)
Long-suffering block driver maintainers will have to cope with a new change
in 2.5.65: this patch from Andries Brouwer
changes the prototype of register_blkdev()
, which is used by block
drivers to tell the kernel of their existence. The previous version of
this function took a struct block_device_operations
which contains some of the operations provided by the driver. That
parameter has not been used for some time (block operations are now
directly associated with disks, and are kept in the generic disk
structure), so Andries removed it.
Not everybody agreed with this change. With all of the work that has been
done in the block layer, register_blkdev() does not actually do
very much anymore. Its main remaining purpose is to associate a driver
name with a major number, so that it shows up in /proc/devices. A
block driver can now function nicely without calling
register_blkdev() at all. The long-term plan is to remove
register_blkdev() altogether. In the mean time, it was asked, why
bother changing the prototype of a doomed function? Even so, the change
was merged into 2.5.65.
The real purpose of Andries's patch, however, was to get rid of the static
blkdevs array used to keep track of block devices in the kernel.
blkdevs is about the only static array left in the block
subsystem, and thus is one of the remaining impediments to Andries's real
goal: the long-awaited expansion of dev_t to 32 bits.
The 32-bit dev_t is one of the final items on the 2.5
"todo" list. It is still considered important by many users: an Oracle
engineer mentions 4000-disk systems that
"want to go to Linux" but can't, and from IBM we hear about a 5000-drive system with waiting
customers. There appears to be little opposition to the adoption of a
larger dev_t, even at this late stage. But everybody agrees that
it would be best to get this change done sooner rather than later.
The amount of work remaining is said to be relatively small. The block
layer, for example, is almost ready for a larger dev_t now. The
subsystem could take more work - many drivers "know" that device numbers
(especially minor numbers) are only eight bits. So a detailed audit of
many drivers could be required. This suggestion
from Alan Cox could make life a little easier, though. The idea would
be to replace the venerable register_chrdev() function with a new
register_chr_device() which takes a parameter indicating the
largest minor number that the driver can deal with. A change to
all char drivers would still be required, but, by defaulting the maximum
minor number to 255, these drivers could be made safe without the need for
a larger "audit and fix" operation. The few drivers that actually need
more minor numbers could be fixed individually.
There are, of course, other issues to deal with before a larger
dev_t will be truly stable. Some protocols (i.e. NFSv2) aren't
prepared for large device numbers. The interface to user space may well
hold a surprise or two. And so on. These are all problems that can be
solved, but the process will take time.
(As an aside, Alexander Viro, who has been an active participant in the
block layer and dev_t work, has been absent from kernel
development for a few months. In a recent
message, however, he proclaimed "I'm finally back - hopefully for
good." Welcome back, Al).
Comments (none posted)
Another incomplete 2.5 development item is initramfs - an initial
filesystem attached to the kernel image. The plan is to move much of the
early boot code into initramfs, so that it can be run in user mode. But
there has not been a whole lot of progress in that direction.
One part of the process is klibc, a small C library to be used in initramfs
applications. A patch exists which adds a
working klibc to the 2.5.64 kernel, but Linus is
not ready to merge it:
However, I also have to say that klibc is pretty late in the game,
and as long as it doesn't add any direct value to the kernel build
the whole thing ends up being pretty moot right now. It might be
different if we actually had code that needed it (ie ACPI in user
space or whatever).
In other words, unless some code which really needs klibc does not show up
soon, it may not get merged into 2.5 at all. That would have the effect of
pushing the whole initramfs project back into the next development series.
There are people working on creating this code, but, as Linus says,
it's late in the game.
Comments (none posted)
is Dan Carpenter's project to
create a free version of the Stanford Checker. The project is making
progress, and smatch is now capable of finding several classes of bugs in
the Linux kernel. Some patches
found by smatch have already begun to appear.
The database of problems found by smatch is now hosted at kbugs.org. As of 2.5.64, there are just over
1000 potential bugs in the database. Many of them are certainly false
alarms, but others will be real. An interesting feature of the kbugs.org
site is the ability to "moderate" bugs as being real problems or not.
With this capability, interested volunteers can help to sift out the real
bugs, even if they don't feel able to contribute patches to fix them.
The smatch project is still in an early stage, but it is already showing
great promise as a tool which can help in the creation of a better kernel.
Comments (none posted)
The new epoll interface was covered here back in
. The epoll system calls offer a significant performance
improvement for applications which must frequently poll large numbers of
file descriptors. It does so by performing the setup work only once, and
then trapping new I/O events as they occur.
One aspect of the epoll interface is that it is edge-triggered; it
will only return a file descriptor as being available for I/O after a
change has happened on that file descriptor. In other words, if you tell
epoll to watch a particular socket for readability, and a certain amount of
data is already available for that socket, epoll will block anyway. It
will only flag that socket as being readable when new data shows
Edge-triggered interfaces have their own advantages and disadvantages. One
of their disadvantages, as epoll author Davide
Libenzi has discovered, would appear to be that many programmers do not
understand edge-triggered interfaces.. Additionally, most existing
applications are written for
level-triggered interfaces (such as poll() and
select()) instead. Rather than fight this tide, he has sent out
a new patch which switches epoll over to
level-triggered behavior. A subsequent
patch makes the behavior configurable on a per-file-descriptor basis.
The end result is a more flexible epoll interface that can be more easily
used in existing applications. The patch has not been merged as of this
writing, but there does not seem to be any reason why it shouldn't be.
After all, epoll has not yet appeared in a stable kernel release; now is
the best time to be making improvements to the interface.
Comments (10 posted)
Larry McVoy has announced
of the current BitKeeper kernel repository in CVS format. Things are still
stabilizing, but the plan is to have the current 2.4 and 2.5 repositories
available in CVS format in near real time. Almost all of the change and
commit information will be available, making it easy for people who are
unwilling or unable to run BitKeeper to peruse the kernel's revision
history and track current developments. Says Larry:
Our goal is to provide the data in a way that you can get at it
without being dependent on us or BK in any way. As soon as we have
this debugged, I'd like to move the CVS repositories to kernel.org
(if I can get HPA to agree) and then you'll have the revision
history and can live without the fear of the "don't piss Larry off
license". Quite frankly, we don't like the current situation any
better than many of you, so if this addresses your concerns that
will take some pressure off of us.
Of course, when dealing with this sort of topic, things are never that
easy. People will certainly be happy to have the CVS repository available,
but one other aspect of the announcement has made people nervous. It seems
that the near-SCCS file format used by BitKeeper is increasingly difficult
to work with; now that BitKeeper repositories can be accessed in CVS
format, the BitKeeper developers would like to move to a new, proprietary
format. And that idea does not fly with all developers; this complaint from Ben Collins has been echoed
by a few hackers:
You've made quite a marketing move. It's obvious to me, maybe not
to others. By providing this CVS gateway, you make it almost
pointless to work on an alternative client. Also by providing it,
you make it easier to get away with locking the revision history
into a proprietary format.
It is clear that, as long as BitKeeper is in use by the kernel development
community, some people are going to be unhappy. Nothing short of the
complete freeing of the BitKeeper source will satisfy some users, and that
does not appear to be in the cards. Fortunately this disagreement, while
noisy, hasn't really gotten in the way of continued kernel development.
fact, it hasn't even gotten in the way of BitKeeper as it improves the
kernel development process. Regardless of what one thinks of BitKeeper or
its license, the fact remains that kernel development has been working well
over the last year; an incredible stream of patches has been merged, and
the people involved have stayed sane. As sane as they were before,
(As an aside, Larry has suggested that the
license clause that forbids (free) BitKeeper use by people working on other
source management systems could be removed in the future "if we feel
we have pulled far enough ahead that everyone else is just playing
Comments (1 posted)
The first big, disruptive changes to the 2.6 kernel came from the reworking
of the block I/O layer. As one might guess, the result of all this work is
a great many changes as seen by driver authors - or anybody else who works
with block I/O. The transition may be painful for some, but it's worth it:
the new block layer is easier to work with and offers much better
performance than its predecessor.
Fully covering the changes that have been made will require a whole series
of articles. So we'll start with an overview which highlights the major
changes that have been made without getting into any sort of detail.
Subsequent articles will fill in the rest.
Note that parts of the block layer remain volatile - this development is
not yet complete. We'll keep up with further changes as they happen.
So, what has changed with the block layer?
- A great deal of old cruft is gone. For example, it is no longer
necessary to work
with a whole set of global arrays within block drivers. These arrays
(blk_size, blksize_size, hardsect_size,
read_ahead, etc.) have simply vanished. The kernel still
maintains much of the same information, of course, but the management
of that information is much improved.
- As part of the cruft removal, most of the <linux/blk.h>
macros (DEVICE_NAME, DEVICE_NR, CURRENT,
INIT_REQUEST, etc.) have been removed;
<linux/blk.h> is now empty. Any block driver
which used these macros to implement its request loop will have to be
rewritten. It is still possible to implement a simple request loop
for straightforward devices where performance is not a big issue, but
the mechanisms have changed.
- The io_request_lock is gone; locking is now done on a
- Request queues have, in general, gotten more sophisticated. Quite a
bit of work has been done in the area of fancy request scheduling
(though drivers don't generally need to know about that). There is
simple support for tagged command queueing, along with features like
request barriers and queue-time device command generation. Request
queues must be allocated dynamicly in 2.6.
- Buffer heads are no longer used in the block layer; they have been
replaced with the new "bio" structure. The new
representation of block I/O operations is designed for flexibility and
performance; it encourages keeping large operations intact. Simple
drivers can pretend that the bio structure does not exist,
but most performance-oriented drivers - i.e. those that want to
implement clustering and DMA - will need to be changed to work with
One of the most significant features of the
bio structure is that it represents I/O buffers directly with
page structures and offsets, not in terms of kernel virtual
addresses. By default, I/O buffers can be located in high memory, on
the assumption that computers equipped with that much memory will also
have reasonably modern I/O controllers. Support operations have been
provided for tasks like bio splitting and the creation of DMA
- Sector numbers can now be 64 bits wide, making it possible to support
very large block devices.
- The rudimentary gendisk ("generic disk") structure from 2.4
has been greatly improved in 2.6; generic disks are now used
extensively throughout the block layer. Among other things, each
generic disk has its own block_device_operations structure;
the operations are no longer directly associated with the driver. The
most significant change for block driver authors, though, may be the
fact that partition handling has been moved up into the block layer,
and drivers no longer need know anything about partitions. That is,
of course, the way things should always have been.
Subsequent articles will explore the above changes in depth; stay tuned.
Comments (1 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
- Rik van Riel: rmap 15e.
(March 12, 2003)
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>