The current development kernel is still 2.5.59
. Linus remains away
from his keyboard.
The current stable kernel is 2.4.20; Marcelo released the fourth 2.4.21 prepatch on January 29.
It includes some MTD driver fixes, a couple of netfilter bug fixes, the new
IPMI driver, fixes for the ethernet information leakage vulnerability, an
x86-64 update, and various other fixes and updates.
The latest prepatch from Alan Cox is 2.4.21-pre3-ac5; it has a few new IDE changes
and a small set of fixes.
Comments (1 posted)
Kernel development news
If an operating system is to perform well, it must get top performance out
of its disk drives. The best use of disks is generally obtained by
recognizing a couple of basic facts of life:
- Disk seeks are very slow. Overall transfer rates go up significantly
if requests can be ordered to minimize the number and distance of seek
- Write operations can (usually) happen whenever, but there is almost
always a process waiting for the completion of a read. Prioritizing
reads over writes will increase the parallelism and perceived
responsiveness of a system.
Unfortunately, minimizing seeks and prioritizing reads can be somewhat
contradictory goals. The way to keep seeks small and relatively rare is to
maintain a sizeable backlog of requests; that way, nearby requests can be
grouped and executed together. Getting the best performance on writes, in
other words, requires delaying write operations for a period of time.
If reads are to be handled quickly, however, they cannot be delayed and
kept in the request queue in this way.
It is even worse than that, actually. A process which is writing a file
can have several outstanding write requests at any given time, since that
process usually does not care when any particular request is completed.
Processes issuing reads, on the other hand, usually generate them one at a
time. The next read operation will not be requested until the previous one
completes. So, even if delaying read operations were an acceptible thing
to do, accumulating a backlog of close read operations is an unlikely
The 2.4 kernel tends to mix reads in with the queue of write operations.
Given the way things work (a read is not particularly likely to be close to
an unrelated set of writes), reads tend to get put toward the end of the
queue and executed slowly. That is one reason why 2.4 can be slow to
respond when there is a lot of write activity going on.
In the 2.5 kernel, reads are not allowed to languish long before they are
pushed to the head of the queue and executed. This change can improve
performance significantly, but it still does not solve the whole problem.
As Andrew Morton put it in the 2.5.59-mm5
So far so good, but these fixes are still dumb. Because we're
solving the dependent read problem by creating a seek storm. Every
time someone submits a read, we stop writing, seek over and service
the read, and then *immediately* seek back and start servicing
But in the common case, the application which submitted a read is
about to go and submit another one, closeby on-disk to the first.
So whoops, we have to seek back to service that one as well.
As a result, overall performance suffers, since the disk is spending too
much time seeking.
2.5.59-mm5 contains a new "anticipatory I/O scheduler" by Nick Piggin which
attempts to address this problem. The basic idea is simple: if the drive has
just handled a read request, assume that there is another one coming behind
it and simply wait for a little bit. In this case, the request queue is
plugged, the I/O scheduler sets a timer, and no more requests are passed
down to the drive for a millisecond or so. If a "close" request shows up
during the wait
time, it is serviced right away; the distance that the kernel considers
"close" grows as time passes. Eventually the close requests will stop
coming, or the kernel will decide that it's time to get around to some
writes regardless; at that point normal request dispatching resumes.
Andrew reports a big improvement in performance (nearly a factor of six)
over 2.5.59 for a simple test he was running. The code, it is said, still
needs "quite some work," but it's a good start. 2.6 looks on track to be
the most responsive Linux kernel yet.
Comments (4 posted)
The Linux kernel contains a number of primitives for controlling mutual
exclusion. Semaphores and spinlocks (in several varieties) have been
around for a while, and the read-copy-update mechanism was added in the 2.5
series. Yet another mechanism, called "fast reader/writer locks," has
found its way into Andrew Morton's -mm patch set, and appears likely to be
forwarded on to Linus soon for inclusion. So this seems like as good a
time as any to look at how "frlocks" work.
Frlocks, as implemented by Stephen Hemminger, are aimed at solving a couple
of problems with the gettimeofday() system call. One is simple
performance; gettimeofday() is not particularly slow, but some
applications (including anything using the X Window System) call it
frequently. It also turns out that the current implementation, which uses
reader/writer spinlocks, is susceptible to a denial of service problem.
Frequent calls to gettimeofday() can delay or lock out timer tick
The frlock patch works by not blocking readers or writers at all. Code
write access to the protected data structure is given that access
immediately (at least, in the absence of other writers), so there is no way
that time updates can be blocked or delayed. Readers, too, get immediate
access to the data structure. The catch is that readers must be prepared
to retry the access if collides with a writer for access to the data.
The lock works by maintaining "pre" and "post" sequence numbers. A writer
process does the following:
- Take out a spinlock associated with the frlock
- Increment the "pre" sequence
- Mess around with the data structure
- Increment the "post" sequence
- Release the spinlock
Readers do something like the following:
- Remember the lock's "post" sequence number
- Grab the data of interest
- Ensure that the lock's "pre" sequence matches the remembered "post"
sequence. If not, go back to the beginning.
In other words, as long as the sequence numbers match, the reader knows
that no writer changed the data while the reader was doing its thing.
In practice, the reader side tends to be expressed in code like:
seq = fr_read_begin(&some_lock);
/* Copy the data */
} while (seq != fr_read_end(&some_lock));
Frlocks, clearly, will not be suitable for lengthy calculations, or those which
have immediate side effects. In cases where a small data structure is
changed infrequently but read often, however, frlocks may be the key to
improved performance. In the introduction
to the latest set of frlock patches, Stephen claims an 18% improvement in
the speed of gettimeofday() - and the elimination of the timer
tick lockout problem.
Comments (17 posted)
Perhaps the biggest bit of unfinished work with the new kernel module
loader is the module versioning support. Module versioning is an attempt
to make binary loadable modules work with multiple kernel versions. It
works by attaching a checksum to each exported kernel symbol; the checksum
is calculated from the prototype or declaration of the symbol. As long as
the checksums in a module match those in the running kernel, it is assumed
that the module can be safely loaded. Kernel hackers tend not to use
module versioning, which explains why it took so long to get this feature
fixed. Production kernels shipped by distributors need this feature,
however; otherwise it is essentially impossible for vendors to support
Kai Germaschewski has posted a modversions
implementation which works with recent 2.5 kernels. The underlying idea is
essentially the same as that found in previous implementations, but the
implementation is entirely different.
The old scheme used the genksyms program to generate the
checksums, and to create a bunch of include files (ending in .ver)
which redefined the exported kernel names to include those checksums. The
effect was to create a bunch of preprocessor definitions like:
#define printk printk_R1b7d4074
(The actual definitions were a little more complicated). Loadable modules
would thus be built to call printk_R1b7d4074() instead of
printk(). The names were stored in that form in the kernel symbol
table, so the insmod program simply needed to look for a direct
match. If the interface had changed, the names would not match, and the
module would refuse to load.
The new implementation does away with the include files. It does
still use genksyms, but the output is reprocessed into a set of
structure declarations. One structure (containing the symbol name and its
checksum) is created for each symbol used by the module; the array of
structures is then linked into the module in a special section. When a
module is loaded into the kernel, the checksums are used to verify
compatibility, and the special section can be discarded. Among other
things, this approach makes it easier to force the loading of a module with
mismatched symbols, should anybody be unwise enough to attempt such a thing.
Comments (9 posted)
Patches and updates
Core kernel code
Page editor: Jonathan Corbet
Next page: Distributions>>