Kernel development [LWN.net]

Kernel release status

The current 2.6 development kernel is 2.6.25-rc7, released on March 25. Says Linus: "The shortlog has more details, but it boils down to some reverts, some docbook fixes, some sparse annotation fixups, a number of trivial patches, and a healthy sprinkling of small fixups. Give it a good testing, because we're hopefully now well on our way towards that eventual real 2.6.25 release!" Said shortlog can be found in the announcement, or see the long-format changelog for the details.

The current stable 2.6 kernel is 2.6.24.4, released on March 24. This release contains a large number of patches for significant problems in the 2.6.24 kernel.

Comments (1 posted)

Quotes of the week

I think I preferred it when people just stared blankly when I told them what I do.

-- Val Henson

When you reject useful patches based on "this is not our preferred style", you piss people off. That is a significant reason why people choose to spend their time elsewhere. In certain cases having people abandon the kernel may be a net gain, in many it is a loss.

-- Jörn Engel

[M]y experience with checkpatch.pl is the exact opposite of what you fear: it _widened_ the contributor base: a good number of newbies felt encouraged that an objective piece of tool reports an "error" in a file that was written by otherwise "much more knowledgable" kernel hackers. checkpatch.pl is basically the "yes, really, you are right, this piece of code in the Linux kernel is indeed crap" review tool that reinforces newbies. It lowers the bar of entry to kernel hacking, and it does so for exactly those pieces of code that we want newbies to be active on: barely maintained source code.

-- Ingo Molnar

Comments (19 posted)

Kernel markers and binary-only modules

By Jonathan Corbet
March 24, 2008

Kernel markers are a mechanism which allows developers to put static tracepoints into the kernel. Once placed, these markers can be used by operations staff to trace well-known events in running systems without that staff having to know about kernel code. Solaris provides a long list of static tracepoints for use with Dtrace, but Linux, thus far, has none. That situation should eventually change - static markers were only merged into the mainline in 2.6.24. But, as the developers start to look more seriously at markers, some interesting issues are coming up.

One of those emerged as a result of this patch from Mathieu Desnoyers which allows proprietary modules to contain markers. The fact that current kernels do not recognize markers in binary-only modules is mostly an accident: markers are disabled in modules with any sort of taint flag set as a way to prevent kernel crashes - a kernel oops being a rather heavier-weight marker than most people wish to encounter. Matthieu tightened that test in a way that allows markers in proprietary modules, saying "let's see how people react". Needless to say, he saw.

One might well wonder why the kernel developers, not known for their sympathy toward proprietary modules in general, would want to consider letting those modules include static tracepoints. The core argument here is that static markers allow proprietary modules to export a bit more internal information to the kernel, and to their users. It is seen as a sort of (very) small opening up on the part of the proprietary module writer. Mathieu says:

I think it's only useful for the end user to let proprietary modules open up a bit, considering that proprietary module writers can use the markers as they want in-house, but would have to leave them disabled on shipped kernels.

The idea is that, by placing these tracepoints, module authors can help others learn more about what's going on inside the module and help people track down problems. The result should be a more stable kernel which - whether proprietary modules have been loaded or not - is generally considered to be a good thing.

On the other hand, there's no shortage of developers who are opposed to extending any sort of helping hand to binary module authors. Giving those modules more access to Linux kernel internals, it is argued, only leads to trouble. Ingo Molnar put it this way:

Why are we even arguing about this? Binary modules should be as isolated as possible - it's a totally untrusted entity and history has shown it again and again that the less infrastructure coupling we have to them, the better.

Ingo also worries that allowing binary modules to use markers will serve to make the marker API that much harder to change in the future. Since that API is quite young, chances are good that changes will happen. As much as the kernel developers profess not to care about binary-only modules, the fact of the matter is that there are good reasons to avoid breaking those modules. The testing community certainly gets smaller when testers cannot load the modules they need to make their systems work in the manner to which they have become accustomed. So it is possible that allowing proprietary modules to use markers could make the marker API harder to fix in future kernel releases.

The grumbles have been loud enough that Matthieu's patch will probably not be merged for 2.6.25. The idea is likely to come back again, but not necessarily right away: the marker feature may have been merged in 2.6.24, but it would appear that 2.6.25 will be released with no actual markers defined in the source. It's not clear that binary-only module authors are pushing to add tracepoints when none of the other developers are doing so. Until somebody starts actually using static markers, debates on where they can be used will continue to be of an academic nature.

Comments (none posted)

Predictive ELF bitmaps

By Jake Edge
March 26, 2008

When the kernel executes a program, it must retrieve the code from disk, which it normally does by demand paging it in as required by the execution path. If the kernel could somehow know which pages would be needed, it could page them in more efficiently. Andi Kleen has posted an experimental set of patches that do just that.

Programs do not know about their layout on disk, nor is their path through the executable file optimized to reduce seeking, but with some information about which pages will be needed, the kernel can optimize the disk accesses. If one were to gather a list of the pages that get faulted in as a program runs, that information could be saved for future runs. It could then be turned into a bitmap indicating which of the pages should be prefetched.

Once you have such a bitmap, where to store it becomes a problem. Kleen's method uses a "hack" to the ELF format on disk, putting the bitmap at the end of the executable. This has a number of drawbacks: a seek to get the info, modifying the executable each time you train, and only allowing a single usage pattern system-wide. It does have one very nice attribute, though, the bitmap and executable stay in sync; if the executable changes, due to an upgrade for instance, the bitmap would get cleared in the process. Alternative bitmap storage locations—somewhere in users' home directories for example—do not have this property.

Andrew Morton questions whether this need be done in the kernel at all:

Can't this all be done in userspace? Hook into exit() with an LD_PRELOAD, use /proc/self/maps and the new pagemap code to work out which pages of which files were faulted in, write that info into the elf file (or a separate per-executable shadow file), then use that info the next time the app is executed, either with an LD_PRELOAD or just a wrapper.

Ulrich Drepper does not want to see the ELF format abused in the fashion it was for this patch, Kleen doesn't either, but used it as an expedient. Drepper thinks the linker should be taught to emit a new header type which would store the bitmap. It would be near the beginning of the ELF file, eliminating the seek. A problem with that approach is that old binaries would not be able to take advantage of the technique; a re-linking would be required.

Then the question arises, how does that bitmap get initialized? Drepper suggests that systemtap be used:

To fill in the bitmaps one can have separate a separate tool which is explicitly asked to update the bitmap data. To collect the page fault data one could use systemtap. It's easy enough to write a script which monitors the minor page faults for each binary and writes the data into a file. The binary update tool and can use the information from that file to generate the bitmap.

Kleen's patch walks the page tables for a process when it is exiting, setting a bit in the bitmap if that page has been faulted in. Drepper sees this as suboptimal:

Over many uses of a program all kinds of pages will be needed. Far more than in most cases. The prefetching should really only cover the commonly used code paths in the program. If you pull in everything, this will have advantages if you have that much page cache to spare. In that case just prefetching the entire file is even easier. No, such an improved method has to be more selective.

The problem is in finding the balance between just prefetching the entire executable—which might be very wasteful—and prefetching the subset of pages that are most commonly used. It will take some heuristics to make that decision. As Drepper points out, recording the entire runtime of a program "will result in all the pages of a program to be marked (unless you have a lot of dead code in the binary and it's all located together)."

The place where Drepper sees a need for kernel support is in providing a bitmap interface to madvise() so that any holes in the pages that get prefetched do not get filled by the readahead mechanism. The current interface would require a call to madvise() for each contiguous region, which could be add up to a large number of calls. Both he and Morton favor the bulk of the work being done in user space.

Overall, there is lots more work to do before "predictive bitmaps" make their way into a Linux system—if they ever do. To start with, some benchmarking will have to be done to show that performance improves enough to consider making a change like this. David Miller expresses some pessimism about the approach:

I wrote such a patch ages ago as well.

Frankly, based upon my experiences then and what I know now, I think it's a lose to do this.

It is an interesting idea though, one that will likely crop up again if this particular incarnation does not go anywhere. Since the biggest efficiency gain is from reducing seeks, though, it may not be interesting long-term. As Morton says, "solid-state disks are going to put a lot of code out of a job."

Comments (20 posted)

Atomic context and kernel API design

By Jonathan Corbet
March 25, 2008

An API should refrain from making promises that it cannot keep. A recent episode involving the kernel's in_atomic() macro demonstrates how things can go wrong when a function does not really do what it appears to do. It is also a good excuse to look at an under-documented (but fundamental) aspect of kernel code design.

Kernel code generally runs in one of two fundamental contexts. Process context reigns when the kernel is running directly on behalf of a (usually) user-space process; the code which implements system calls is one example. When the kernel is running in process context, it is allowed to go to sleep if necessary. But when the kernel is running in atomic context, things like sleeping are not allowed. Code which handles hardware and software interrupts is one obvious example of atomic context.

There is more to it than that, though: any kernel function moves into atomic context the moment it acquires a spinlock. Given the way spinlocks are implemented, going to sleep while holding one would be a fatal error; if some other kernel function tried to acquire the same lock, the system would almost certainly deadlock forever.

"Deadlocking forever" tends not to appear on users' wishlists for the kernel, so the kernel developers go out of their way to avoid that situation. To that end, code which is running in atomic context carefully follows a number of rules, including (1) no access to user space, and, crucially, (2) no sleeping. Problems can result, though, when a particular kernel function does not know which context it might be invoked in. The classic example is kmalloc() and friends, which take an explicit argument (GFP_KERNEL or GFP_ATOMIC) specifying whether sleeping is possible or not.

The wish to write code which can work optimally in either context is common, though. Some developers, while trying to write such code, may well stumble across the following definitions from <linux/hardirq.h>:

    /*
     * Are we doing bottom half or hardware interrupt processing?
     * Are we in a softirq context? Interrupt context?
     */
    #define in_irq()	   (hardirq_count())
    #define in_softirq()   (softirq_count())
    #define in_interrupt() (irq_count())

    #define in_atomic()	   ((preempt_count() & ~PREEMPT_ACTIVE) != 0)

It would seem that in_atomic() would fit the bill for any developer trying to decide whether a given bit of code needs to act in an atomic manner at any specific time. A quick grep through the kernel sources shows that, in fact, in_atomic() has been used in quite a few different places for just that purpose. There is only one problem: those uses are almost certainly all wrong.

The in_atomic() macro works by checking whether preemption is disabled, which seems like the right thing to do. Handlers for events like hardware interrupts will disable preemption, but so will the acquisition of a spinlock. So this test appears to catch all of the cases where sleeping would be a bad idea. Certainly a number of people who have looked at this macro have come to that conclusion.

But if preemption has not been configured into the kernel in the first place, the kernel does not raise the "preemption count" when spinlocks are acquired. So, in this situation (which is common - many distributors still do not enable preemption in their kernels), in_atomic() has no way to know if the calling code holds any spinlocks or not. So it will return zero (indicating process context) even when spinlocks are held. And that could lead to kernel code thinking that it is running in process context (and acting accordingly) when, in fact, it is not.

Given this problem, one might well wonder why the function exists in the first place, why people are using it, and what developers can really do to get a handle on whether they can sleep or not. Andrew Morton answered the first question in a relatively cryptic way:

in_atomic() is for core kernel use only. Because in special circumstances (ie: kmap_atomic()) we run inc_preempt_count() even on non-preemptible kernels to tell the per-arch fault handler that it was invoked by copy_*_user() inside kmap_atomic(), and it must fail.

In other words, in_atomic() works in a specific low-level situation, but it was never meant to be used in a wider context. Its placement in hardirq.h next to macros which can be used elsewhere was, thus, almost certainly a mistake. As Alan Stern pointed out, the fact that Linux Device Drivers recommends the use of in_atomic() will not have helped the situation. Your editor recommends that the authors of that book be immediately sacked.

Once these mistakes are cleared up, there is still the question of just how kernel code should decide whether it is running in an atomic context or not. The real answer is that it just can't do that. Quoting Andrew Morton again:

The consistent pattern we use in the kernel is that callers keep track of whether they are running in a schedulable context and, if necessary, they will inform callees about that. Callees don't work it out for themselves.

This pattern is consistent through the kernel - once again, the GFP_ flags example stands out in this regard. But it's also clear that this practice has not been documented to the point that kernel developers understand that things should be done this way. Consider this recent posting from Rusty Russell, who understands these issues better than most:

This flag indicates what the allocator should do when no memory is immediately available: should it wait (sleep) while memory is freed or swapped out (GFP_KERNEL), or should it return NULL immediately (GFP_ATOMIC). And this flag is entirely redundant: kmalloc() itself can figure out whether it is able to sleep or not.

In fact, kmalloc() cannot figure out on its own whether sleeping is allowable or not. It has to be told by the caller. This rule is unlikely to change, so expect a series of in_atomic() removal patches starting with 2.6.26. Once that work is done, the in_atomic() macro can be moved to a safer place where it will not create further confusion.

Comments (27 posted)

Linus Torvalds Linux 2.6.25-rc7 ?

Chris Wright Linux 2.6.24.4 ?

Steven Rostedt 2.6.24.4-rt4 ?

Andi Kleen Readd rdmsrl_safe v2 ?

Yinghai Lu x86_64: early memtest to find bad ram ?

Suresh Siddha srat, x86_64: Add support for nodes spanning other nodes ?

Glauber Costa dma_ops for i386 ?

Jon Tollefson 16G huge page support for powerpc ?

Rafael J. Wysocki PM: Introduce new top level suspend and hibernation callbacks (rev. 2) ?

Rafael J. Wysocki PM: Introduce new top level suspend and hibernation callbacks (rev. 3) ?

Paul E. McKenney Add call_rcu_sched() ?

Mike Travis NR_CPUS: third reduction of NR_CPUS memory usage ?

Mike Travis cpumask: reduce stack pressure from local/passed cpumask variables ?

Thomas Gleixner debugobject infrastructure V3 ?

Mark Brown WM97xx touchscreen drivers ?

Magnus Damm i2c: SuperH Mobile I2C Bus Controller V2 ?

Alex Chiang PCI, ACPI: Physical PCI slot objects ?

Tejun Heo libata: modularize SFF support, take #1 ?

Tejun Heo libata: modularize PMP support, take #1 ?

Ben Hutchings New driver "sfc" for Solarstorm SFC4000 controller (try #9) ?

William Cohen Draft: SystemTap Instrumentation Script Cataloging and Archiving ?

Kiyoshi Ueda request stacking and request-based dm-multipath ?

Takashi Sato freeze feature ver 1.0 ?

zooko announcing allmydata.org "Tahoe", the Least Authority Filesystem, v1.0 ?

Miklos Szeredi vfs: mountinfo (v3) ?

Peter Zijlstra Swap over NFS -v17 ?

Christoph Lameter Virtual Compound Page Support V3 ?

Nitin Gupta : compcache: Compressed Caching ?

Patrick McHardy : SIP helper update ?

Serge E. Hallyn file capabilities: remove cap_task_kill() (-git) ?

Vasily Tarasov cgroups: block: cfq: I/O bandwidth controlling subsystem for CGroups based on CFQ ?

Balbir Singh Virtual address space control for cgroups (v2) ?

Rusty Russell Inter-guest virtio I/O example with lguest ?

sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org Cloning PTS namespace ?

Rafael J. Wysocki 2.6.25-rc6-git6: Reported regressions from 2.6.24 ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quotes of the week

Kernel markers and binary-only modules

Predictive ELF bitmaps

Atomic context and kernel API design

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Benchmarks and bugs