The current development kernel is 2.6.0-test7
; there have been no
development kernel releases in the last week.
Linus's BitKeeper tree does contain a pile of patches, most of which are
stability fixes as one would expect. It also includes a (controversial)
patch to allow kernel threads to handle signals properly, a fix for a
possible interrupt handling deadlock, and a workaround for the AMD Opteron
The current stable kernel is 2.4.22. Marcelo released 2.4.23-pre7 on October 9; it includes
Jens Axboe's laptop mode patch, a new MegaRAID driver, BIOS enhanced disk
detection support, USB gadget support, and various other fixes and
updates. The plan is apparently to get the first release candidate out
within a month.
Comments (1 posted)
Kernel development news
Some attention has been given to the "2.7
" list which has been circulating on linux-kernel. Looking
forward to what can be done in the next development series can be an
interesting exercise. In this case, though, the exercise has mostly been
carried out by people who will not actually be doing the work; as a result,
the list has been dismissed by a few kernel hackers; one called it
"crackpot wishlist gunk."
So what are the crackpots wishing for? Some of the items they want (marked
"mandatory features" on the list) are already in the works; these include
support for CPU hotplugging, full NTFS support and virtual machine
support. Others are somewhat vague, including "complete user quota
centralization" and "improve kobject model for security, quota rendering."
And some will never happen; there is just not a whole lot of call for
features like an in-kernel Gopher server or a /proc implementation
of the loadable module tools.
Kernel hackers have far more respect for code (and those who produce it)
than they do for list makers. The 2.7 thoughts list may yet inspire
somebody to do some hacking, but its influence on the development process
is likely to remain small.
A more interesting view into what could happen with 2.7 might be found in a
conversation between Linus and Joel Becker of Oracle. The discussion turned
to what information was needed from the kernel to perform direct I/O, which lead to this outburst from Linus:
Have you ever noticed that O_DIRECT is a piece of crap? The
interface is fundamentally flawed, it has nasty security issues, it
lacks any kind of sane synchronization, and it exposes stuff that
shouldn't be exposed to user space.
Linus went on to wish an early death upon disk-based databases; he seems to
think that all but the largest databases should just be done in-memory.
Direct I/O does bring its share of problems. It is hard to keep the kernel
page cache in a coherent condition when I/O operations are allowed to
circumvent it; page cache confusion can lead to corrupted data. Getting
good performance out of direct I/O is hard unless asynchronous I/O is used
as well. Direct I/O can also confuse the disk I/O scheduler by creating
request patterns (especially overlapping requests) which don't otherwise
happen. In other words, the direct I/O idea is hard to get right for both
kernel and user space.
But systems like Oracle do need some of the capabilities that direct I/O
provides. They need to be able to move large amounts of data without
polluting the page cache with stuff that will not be used. Databases which
use shared storage need to be able to force data to be reread from disk
when another system has changed it. Large applications also tend to have a
better idea of how their access patterns work than the kernel does; they
know when a particular block of data will not be used any more. The need
for the level of control and performance direct I/O can provide will
persist, whether it is a "piece of crap" or not.
Linus seems to understand this need; he would just like to push development
toward what he sees as a better interface. Such an interface would work
with the page cache, rather than trying to circumvent it. Some of his
thoughts, as expressed in this posting,
- A mechanism for moving pages between user space and the page cache.
An application wishing to do a direct write would then just transfer
ownership of the pages containing the data to the kernel, which would
put them into the page cache. A simple flush finishes the job.
- A way for an application to tell the kernel that certain pages in the
cache are stale and should not be used. This mechanism could also be
used to tell the kernel about pages which are no longer needed and can
be dropped from the cache. The fadvise() system call already
does part of this task.
- The ability to mark I/O on a particular file descriptor (or by a
particular process) as being a one-shot affair that should not be
cached. This idea was suggested in response to a description of performance
problems triggered by the PostgreSQL vacuum operation, which
touches much of the database exactly once.
Much time and effort over the 2.5 development series went into making
direct I/O work well. This work helped to close a gap between Linux and
some proprietary Unix systems. It could well be that, in 2.7, that effort
goes into coming up with a better way of solving the problem altogether.
Comments (6 posted)
Certain kernel subsystems - journaling filesystems in particular - have some
strict requirements about how their disk I/O operations are ordered. Open
transactions must be committed to the journal before the actual filesystem
structure can be touched. If this requirement is not met, the integrity of
the filesystem could be lost if a crash happens at the wrong time.
One way to implement ordering is to explicitly wait on the buffers that
must make it to disk. If no new operations are submitted before the old
ones complete, the ordering requirements will be met (though write caching
in disk drives can create problems of their own). This waiting is hard on
performance, however; the filesystem would be better off setting up more
requests than waiting for the old ones.
As a way of improving journaling filesystem performance, the design goals for
the block layer rework in 2.5 included write barriers. A write barrier is
simply a specially marked I/O request; the block layer will not reorder any
other request past a barrier request in either direction. In this way, all
requests issued prior to the barrier request are guaranteed to be completed
before any requests issued after the barrier are begun. With this feature,
a journaling system can simply issue a barrier request when it commits its
journal, then go on with implementing the next transaction.
The problem is that barriers don't actually work yet. That little
shortcoming shouldn't last much longer, however, now that Jens Axboe has dusted off his write barrier patch and is
actively working on it again.
Barrier requests still work pretty much as described in the LWN Driver Porting series. A driver which
honors barriers must now inform the block layer of that fact, however, with
a call to:
void blk_queue_ordered(request_queue_t *queue, int flag);
where flag is QUEUE_ORDERED_NONE if the device does not
support barriers (the default), QUEUE_ORDERED_TAG if barriers are
implemented with ordered command tags, or QUEUE_ORDERED_FLUSH if
an explicit hardware flush command is used. If higher-level code attempts
to create a barrier request for a device which does not support them, the
block layer will return an error.
The code does not currently appear to care which of the two methods a
driver says it implements, as long as it picks one.
Also included with the patch is a barrier implementation for IDE drives
(using QUEUE_ORDERED_FLUSH) and simple patches to a couple of
filesystems to make them use the barrier feature. Now it's mostly a matter
of waiting to see whether Linus considers barriers to be a
Comments (5 posted)
William Lee Irwin recently tried the 2.6.0-test
on a system limited to 16MB of memory. In the modern world,
that is a shockingly small amount of RAM, just slightly above storing your
data on an abacus. There are people out there, however, who are doing
their best to get work done on limited hardware, and, as Andrew Morton
says, "we should try to not suck in this situation." William's results
indicate that some work is still required for 2.6 to perform adequately on
One of the more striking results from this test is that a substantial chunk
of the system's memory is consumed by the inode and dentry caches. Those
caches, in fact, took up over 10% of the memory which was available at boot
time. If some way could be found to reduce the size of the inode and
dentry caches, enough memory would be freed to make a noticeable difference
on low-memory systems.
The culprit in this case is sysfs. Each entry in sysfs creates an inode
and a directory entry, and both are pinned into memory for the life of the
system. Pinning the
entries is a standard way of creating virtual filesystems in the kernel; it
frees the code from the need to create any sort of backing store for the
filesystem. This scheme works less well when a filesystem can have
thousands of entries, however. Even a minimal system's sysfs directory can
have several hundred files and directories, and there is a clear intent to
add many more.
One approach to the problem is to simply get rid of sysfs; Andrew Morton
has posted a patch which adds a
"nosysfs" boot-time option. This capability may be of interest to
creators of embedded systems and such, but it is hard to see its utility
extending much beyond that. Sysfs is becoming an increasingly important
communications channel between user and kernel space; it can't just be
ripped out without breaking things.
So the kernel hackers will have to figure out how to preserve sysfs while
trimming its memory requirements. One set of patches posted recently tried
to achieve this goal by adding a real, in-kernel backing store for sysfs.
The patch did not get very far, however, because it made the kobject
structure significantly bigger. The real solution will probably involve a
bit of clever filesystem hacking. The internal kobject hierarchy contains
the information that is really needed to implement sysfs; the existing
cached inodes and dentries just make it work easily. But those cached
entries - especially those for the attributes that make up the bottom
leaves of sysfs - could be generated on demand when user space
actually needs them. It will take some work, but users of small systems
will doubtless be thankful for the result.
Comments (1 posted)
The Linux kernel tries to save power by, among other things, halting the
processor when there is no work to be done. The processor's sleep can be
fitful, however; even when there is no work, the timer interrupt will
continue to wake the processor every 1/1000 to 1/100 second.
George Anzinger's new variable scheduling
timeouts (VST) patch
seeks to solve this problem by eliminating timer
interrupts when there is nothing for that interrupt to do.
The kernel timer interrupt is responsible for keeping track of time for
the kernel by updating the value of jiffies and handling other
housekeeping and process accounting functions.
When processing the timer interrupt, the kernel will periodically also check
the timer list to see if any kernel timers have expired and if so, call
the completion function for that timer. Timers in the kernel are one of
the mechanisms used to schedule work that needs to be done in the future.
In the absence of a running process, the only real work that needs to be
done in the timer interrupt is the maintenance of the timer list.
When no processes are running,
the VST patch causes the idle task to scan the timer list and delay the
timer interrupt if there are no timers that will expire in the next timer
tick. It does this by changing the value in the Programmable Interrupt
Timer (PIT) to generate an interrupt when the next timer is set to
expire. The resolution of the PIT only allows values up to about 50ms
and thus that is currently the limit of how long a timer interrupt can be
held off, but
there are plans to use the Real Time Clock hardware in the future
to remove this restriction. When the timer interrupt eventually occurs,
the VST code will update jiffies and do the necessary housekeeping
to handle the amount of time that has been missed.
If the system is idle, there are no runnable tasks currently active, but
an interrupt from the hardware could change that situation. To handle this
case, the VST patch
hooks into the low-level interrupt handling code to re-enable the timer
interrupt when another interrupt occurs. It also runs the timer interrupt
at that time to update the kernel time information
as if the timer interrupts had occurred normally.
The benefit of this patch is that when the system is idle
the kernel can halt the processor in order to
conserve power. Eliminating needless timer interrupts help to keep the
processor idle longer.
The result is that battery operated Linux based devices
can operate longer on a single charge, which should make PDA and laptop
users happier. As of this writing, there are no hard numbers on how
well this patch reduces power consumption, hopefully some information on
that will be forthcoming.
Comments (6 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>