User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.0-test7; there have been no development kernel releases in the last week.

Linus's BitKeeper tree does contain a pile of patches, most of which are stability fixes as one would expect. It also includes a (controversial) patch to allow kernel threads to handle signals properly, a fix for a possible interrupt handling deadlock, and a workaround for the AMD Opteron prefetch bug.

The current stable kernel is 2.4.22. Marcelo released 2.4.23-pre7 on October 9; it includes Jens Axboe's laptop mode patch, a new MegaRAID driver, BIOS enhanced disk detection support, USB gadget support, and various other fixes and updates. The plan is apparently to get the first release candidate out within a month.

Comments (1 posted)

Kernel development news

Looking forward to 2.7

Some attention has been given to the "2.7 thoughts" list which has been circulating on linux-kernel. Looking forward to what can be done in the next development series can be an interesting exercise. In this case, though, the exercise has mostly been carried out by people who will not actually be doing the work; as a result, the list has been dismissed by a few kernel hackers; one called it "crackpot wishlist gunk."

So what are the crackpots wishing for? Some of the items they want (marked "mandatory features" on the list) are already in the works; these include support for CPU hotplugging, full NTFS support and virtual machine support. Others are somewhat vague, including "complete user quota centralization" and "improve kobject model for security, quota rendering." And some will never happen; there is just not a whole lot of call for features like an in-kernel Gopher server or a /proc implementation of the loadable module tools.

Kernel hackers have far more respect for code (and those who produce it) than they do for list makers. The 2.7 thoughts list may yet inspire somebody to do some hacking, but its influence on the development process is likely to remain small.

A more interesting view into what could happen with 2.7 might be found in a conversation between Linus and Joel Becker of Oracle. The discussion turned to what information was needed from the kernel to perform direct I/O, which lead to this outburst from Linus:

Have you ever noticed that O_DIRECT is a piece of crap? The interface is fundamentally flawed, it has nasty security issues, it lacks any kind of sane synchronization, and it exposes stuff that shouldn't be exposed to user space.

Linus went on to wish an early death upon disk-based databases; he seems to think that all but the largest databases should just be done in-memory.

Direct I/O does bring its share of problems. It is hard to keep the kernel page cache in a coherent condition when I/O operations are allowed to circumvent it; page cache confusion can lead to corrupted data. Getting good performance out of direct I/O is hard unless asynchronous I/O is used as well. Direct I/O can also confuse the disk I/O scheduler by creating request patterns (especially overlapping requests) which don't otherwise happen. In other words, the direct I/O idea is hard to get right for both kernel and user space.

But systems like Oracle do need some of the capabilities that direct I/O provides. They need to be able to move large amounts of data without polluting the page cache with stuff that will not be used. Databases which use shared storage need to be able to force data to be reread from disk when another system has changed it. Large applications also tend to have a better idea of how their access patterns work than the kernel does; they know when a particular block of data will not be used any more. The need for the level of control and performance direct I/O can provide will persist, whether it is a "piece of crap" or not.

Linus seems to understand this need; he would just like to push development toward what he sees as a better interface. Such an interface would work with the page cache, rather than trying to circumvent it. Some of his thoughts, as expressed in this posting, include:

  • A mechanism for moving pages between user space and the page cache. An application wishing to do a direct write would then just transfer ownership of the pages containing the data to the kernel, which would put them into the page cache. A simple flush finishes the job.

  • A way for an application to tell the kernel that certain pages in the cache are stale and should not be used. This mechanism could also be used to tell the kernel about pages which are no longer needed and can be dropped from the cache. The fadvise() system call already does part of this task.

  • The ability to mark I/O on a particular file descriptor (or by a particular process) as being a one-shot affair that should not be cached. This idea was suggested in response to a description of performance problems triggered by the PostgreSQL vacuum operation, which touches much of the database exactly once.

Much time and effort over the 2.5 development series went into making direct I/O work well. This work helped to close a gap between Linux and some proprietary Unix systems. It could well be that, in 2.7, that effort goes into coming up with a better way of solving the problem altogether.

Comments (6 posted)

Making write barriers actually work

Certain kernel subsystems - journaling filesystems in particular - have some strict requirements about how their disk I/O operations are ordered. Open transactions must be committed to the journal before the actual filesystem structure can be touched. If this requirement is not met, the integrity of the filesystem could be lost if a crash happens at the wrong time.

One way to implement ordering is to explicitly wait on the buffers that must make it to disk. If no new operations are submitted before the old ones complete, the ordering requirements will be met (though write caching in disk drives can create problems of their own). This waiting is hard on performance, however; the filesystem would be better off setting up more requests than waiting for the old ones.

As a way of improving journaling filesystem performance, the design goals for the block layer rework in 2.5 included write barriers. A write barrier is simply a specially marked I/O request; the block layer will not reorder any other request past a barrier request in either direction. In this way, all requests issued prior to the barrier request are guaranteed to be completed before any requests issued after the barrier are begun. With this feature, a journaling system can simply issue a barrier request when it commits its journal, then go on with implementing the next transaction.

The problem is that barriers don't actually work yet. That little shortcoming shouldn't last much longer, however, now that Jens Axboe has dusted off his write barrier patch and is actively working on it again.

Barrier requests still work pretty much as described in the LWN Driver Porting series. A driver which honors barriers must now inform the block layer of that fact, however, with a call to:

    void blk_queue_ordered(request_queue_t *queue, int flag);

where flag is QUEUE_ORDERED_NONE if the device does not support barriers (the default), QUEUE_ORDERED_TAG if barriers are implemented with ordered command tags, or QUEUE_ORDERED_FLUSH if an explicit hardware flush command is used. If higher-level code attempts to create a barrier request for a device which does not support them, the block layer will return an error. The code does not currently appear to care which of the two methods a driver says it implements, as long as it picks one.

Also included with the patch is a barrier implementation for IDE drives (using QUEUE_ORDERED_FLUSH) and simple patches to a couple of filesystems to make them use the barrier feature. Now it's mostly a matter of waiting to see whether Linus considers barriers to be a stability-related patch.

Comments (5 posted)

Sysfs and small memory machines

William Lee Irwin recently tried the 2.6.0-test kernel on a system limited to 16MB of memory. In the modern world, that is a shockingly small amount of RAM, just slightly above storing your data on an abacus. There are people out there, however, who are doing their best to get work done on limited hardware, and, as Andrew Morton says, "we should try to not suck in this situation." William's results indicate that some work is still required for 2.6 to perform adequately on low-end hardware.

One of the more striking results from this test is that a substantial chunk of the system's memory is consumed by the inode and dentry caches. Those caches, in fact, took up over 10% of the memory which was available at boot time. If some way could be found to reduce the size of the inode and dentry caches, enough memory would be freed to make a noticeable difference on low-memory systems.

The culprit in this case is sysfs. Each entry in sysfs creates an inode and a directory entry, and both are pinned into memory for the life of the system. Pinning the entries is a standard way of creating virtual filesystems in the kernel; it frees the code from the need to create any sort of backing store for the filesystem. This scheme works less well when a filesystem can have thousands of entries, however. Even a minimal system's sysfs directory can have several hundred files and directories, and there is a clear intent to add many more.

One approach to the problem is to simply get rid of sysfs; Andrew Morton has posted a patch which adds a "nosysfs" boot-time option. This capability may be of interest to creators of embedded systems and such, but it is hard to see its utility extending much beyond that. Sysfs is becoming an increasingly important communications channel between user and kernel space; it can't just be ripped out without breaking things.

So the kernel hackers will have to figure out how to preserve sysfs while trimming its memory requirements. One set of patches posted recently tried to achieve this goal by adding a real, in-kernel backing store for sysfs. The patch did not get very far, however, because it made the kobject structure significantly bigger. The real solution will probably involve a bit of clever filesystem hacking. The internal kobject hierarchy contains the information that is really needed to implement sysfs; the existing cached inodes and dentries just make it work easily. But those cached entries - especially those for the attributes that make up the bottom leaves of sysfs - could be generated on demand when user space actually needs them. It will take some work, but users of small systems will doubtless be thankful for the result.

Comments (1 posted)

Letting sleeping processors lie

October 15, 2003

This article was contributed by Jake Edge.

The Linux kernel tries to save power by, among other things, halting the processor when there is no work to be done. The processor's sleep can be fitful, however; even when there is no work, the timer interrupt will continue to wake the processor every 1/1000 to 1/100 second. George Anzinger's new variable scheduling timeouts (VST) patch seeks to solve this problem by eliminating timer interrupts when there is nothing for that interrupt to do.

The kernel timer interrupt is responsible for keeping track of time for the kernel by updating the value of jiffies and handling other housekeeping and process accounting functions. When processing the timer interrupt, the kernel will periodically also check the timer list to see if any kernel timers have expired and if so, call the completion function for that timer. Timers in the kernel are one of the mechanisms used to schedule work that needs to be done in the future. In the absence of a running process, the only real work that needs to be done in the timer interrupt is the maintenance of the timer list.

When no processes are running, the VST patch causes the idle task to scan the timer list and delay the timer interrupt if there are no timers that will expire in the next timer tick. It does this by changing the value in the Programmable Interrupt Timer (PIT) to generate an interrupt when the next timer is set to expire. The resolution of the PIT only allows values up to about 50ms and thus that is currently the limit of how long a timer interrupt can be held off, but there are plans to use the Real Time Clock hardware in the future to remove this restriction. When the timer interrupt eventually occurs, the VST code will update jiffies and do the necessary housekeeping to handle the amount of time that has been missed.

If the system is idle, there are no runnable tasks currently active, but an interrupt from the hardware could change that situation. To handle this case, the VST patch hooks into the low-level interrupt handling code to re-enable the timer interrupt when another interrupt occurs. It also runs the timer interrupt service routine at that time to update the kernel time information as if the timer interrupts had occurred normally.

The benefit of this patch is that when the system is idle the kernel can halt the processor in order to conserve power. Eliminating needless timer interrupts help to keep the processor idle longer. The result is that battery operated Linux based devices can operate longer on a single charge, which should make PDA and laptop users happier. As of this writing, there are no hard numbers on how well this patch reduces power consumption, hopefully some information on that will be forthcoming.

Comments (6 posted)

Patches and updates

Kernel trees

Core kernel code

Device drivers


Filesystems and block I/O


Benchmarks and bugs


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds