Kernel development
Brief items
Kernel release status
The current 2.6 development kernel is 2.6.25-rc7, released on March 25. Says Linus: "The shortlog has more details, but it boils down to some reverts, some docbook fixes, some sparse annotation fixups, a number of trivial patches, and a healthy sprinkling of small fixups. Give it a good testing, because we're hopefully now well on our way towards that eventual real 2.6.25 release!" Said shortlog can be found in the announcement, or see the long-format changelog for the details.
The current stable 2.6 kernel is 2.6.24.4, released on March 24. This release contains a large number of patches for significant problems in the 2.6.24 kernel.
Kernel development news
Quotes of the week
Kernel markers and binary-only modules
Kernel markers are a mechanism which allows developers to put static tracepoints into the kernel. Once placed, these markers can be used by operations staff to trace well-known events in running systems without that staff having to know about kernel code. Solaris provides a long list of static tracepoints for use with Dtrace, but Linux, thus far, has none. That situation should eventually change - static markers were only merged into the mainline in 2.6.24. But, as the developers start to look more seriously at markers, some interesting issues are coming up.
One of those emerged as a result of this
patch from Mathieu Desnoyers which allows proprietary modules to
contain markers. The fact that current kernels do not recognize markers in binary-only
modules is mostly an accident: markers are disabled in modules with any sort
of taint flag set as a way to prevent kernel crashes - a kernel oops being
a rather heavier-weight marker than most people wish to encounter.
Matthieu tightened that test in a way that allows markers in proprietary
modules, saying "let's see how people react
". Needless to
say, he saw.
One might well wonder why the kernel developers, not known for their sympathy toward proprietary modules in general, would want to consider letting those modules include static tracepoints. The core argument here is that static markers allow proprietary modules to export a bit more internal information to the kernel, and to their users. It is seen as a sort of (very) small opening up on the part of the proprietary module writer. Mathieu says:
The idea is that, by placing these tracepoints, module authors can help others learn more about what's going on inside the module and help people track down problems. The result should be a more stable kernel which - whether proprietary modules have been loaded or not - is generally considered to be a good thing.
On the other hand, there's no shortage of developers who are opposed to extending any sort of helping hand to binary module authors. Giving those modules more access to Linux kernel internals, it is argued, only leads to trouble. Ingo Molnar put it this way:
Ingo also worries that allowing binary modules to use markers will serve to make the marker API that much harder to change in the future. Since that API is quite young, chances are good that changes will happen. As much as the kernel developers profess not to care about binary-only modules, the fact of the matter is that there are good reasons to avoid breaking those modules. The testing community certainly gets smaller when testers cannot load the modules they need to make their systems work in the manner to which they have become accustomed. So it is possible that allowing proprietary modules to use markers could make the marker API harder to fix in future kernel releases.
The grumbles have been loud enough that Matthieu's patch will probably not be merged for 2.6.25. The idea is likely to come back again, but not necessarily right away: the marker feature may have been merged in 2.6.24, but it would appear that 2.6.25 will be released with no actual markers defined in the source. It's not clear that binary-only module authors are pushing to add tracepoints when none of the other developers are doing so. Until somebody starts actually using static markers, debates on where they can be used will continue to be of an academic nature.
Predictive ELF bitmaps
When the kernel executes a program, it must retrieve the code from disk, which it normally does by demand paging it in as required by the execution path. If the kernel could somehow know which pages would be needed, it could page them in more efficiently. Andi Kleen has posted an experimental set of patches that do just that.
Programs do not know about their layout on disk, nor is their path through the executable file optimized to reduce seeking, but with some information about which pages will be needed, the kernel can optimize the disk accesses. If one were to gather a list of the pages that get faulted in as a program runs, that information could be saved for future runs. It could then be turned into a bitmap indicating which of the pages should be prefetched.
Once you have such a bitmap, where to store it becomes a problem. Kleen's method uses a "hack" to the ELF format on disk, putting the bitmap at the end of the executable. This has a number of drawbacks: a seek to get the info, modifying the executable each time you train, and only allowing a single usage pattern system-wide. It does have one very nice attribute, though, the bitmap and executable stay in sync; if the executable changes, due to an upgrade for instance, the bitmap would get cleared in the process. Alternative bitmap storage locations—somewhere in users' home directories for example—do not have this property.
Andrew Morton questions whether this need be done in the kernel at all:
Ulrich Drepper does not want to see the ELF format abused in the fashion it was for this patch, Kleen doesn't either, but used it as an expedient. Drepper thinks the linker should be taught to emit a new header type which would store the bitmap. It would be near the beginning of the ELF file, eliminating the seek. A problem with that approach is that old binaries would not be able to take advantage of the technique; a re-linking would be required.
Then the question arises, how does that bitmap get initialized? Drepper suggests that systemtap be used:
Kleen's patch walks the page tables for a process when it is exiting, setting a bit in the bitmap if that page has been faulted in. Drepper sees this as suboptimal:
The problem is in finding the balance between just prefetching the entire
executable—which might be very wasteful—and prefetching the
subset of pages that are most commonly used. It will take some heuristics
to make that decision. As Drepper points out, recording the entire runtime
of a program "will result in all the pages of a
program to be marked (unless you have a lot of dead code in the binary
and it's all located together).
"
The place where Drepper sees a need for kernel support is in providing a bitmap interface to madvise() so that any holes in the pages that get prefetched do not get filled by the readahead mechanism. The current interface would require a call to madvise() for each contiguous region, which could be add up to a large number of calls. Both he and Morton favor the bulk of the work being done in user space.
Overall, there is lots more work to do before "predictive bitmaps" make their way into a Linux system—if they ever do. To start with, some benchmarking will have to be done to show that performance improves enough to consider making a change like this. David Miller expresses some pessimism about the approach:
Frankly, based upon my experiences then and what I know now, I think it's a lose to do this.
It is an interesting idea though, one that will likely crop up again if
this particular incarnation does not go anywhere. Since the biggest efficiency
gain is from reducing seeks, though, it may not be interesting long-term.
As Morton says, "solid-state disks are going to put a lot of code out
of a
job.
"
Atomic context and kernel API design
An API should refrain from making promises that it cannot keep. A recent episode involving the kernel's in_atomic() macro demonstrates how things can go wrong when a function does not really do what it appears to do. It is also a good excuse to look at an under-documented (but fundamental) aspect of kernel code design.Kernel code generally runs in one of two fundamental contexts. Process context reigns when the kernel is running directly on behalf of a (usually) user-space process; the code which implements system calls is one example. When the kernel is running in process context, it is allowed to go to sleep if necessary. But when the kernel is running in atomic context, things like sleeping are not allowed. Code which handles hardware and software interrupts is one obvious example of atomic context.
There is more to it than that, though: any kernel function moves into atomic context the moment it acquires a spinlock. Given the way spinlocks are implemented, going to sleep while holding one would be a fatal error; if some other kernel function tried to acquire the same lock, the system would almost certainly deadlock forever.
"Deadlocking forever" tends not to appear on users' wishlists for the kernel, so the kernel developers go out of their way to avoid that situation. To that end, code which is running in atomic context carefully follows a number of rules, including (1) no access to user space, and, crucially, (2) no sleeping. Problems can result, though, when a particular kernel function does not know which context it might be invoked in. The classic example is kmalloc() and friends, which take an explicit argument (GFP_KERNEL or GFP_ATOMIC) specifying whether sleeping is possible or not.
The wish to write code which can work optimally in either context is common, though. Some developers, while trying to write such code, may well stumble across the following definitions from <linux/hardirq.h>:
/* * Are we doing bottom half or hardware interrupt processing? * Are we in a softirq context? Interrupt context? */ #define in_irq() (hardirq_count()) #define in_softirq() (softirq_count()) #define in_interrupt() (irq_count()) #define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != 0)
It would seem that in_atomic() would fit the bill for any developer trying to decide whether a given bit of code needs to act in an atomic manner at any specific time. A quick grep through the kernel sources shows that, in fact, in_atomic() has been used in quite a few different places for just that purpose. There is only one problem: those uses are almost certainly all wrong.
The in_atomic() macro works by checking whether preemption is disabled, which seems like the right thing to do. Handlers for events like hardware interrupts will disable preemption, but so will the acquisition of a spinlock. So this test appears to catch all of the cases where sleeping would be a bad idea. Certainly a number of people who have looked at this macro have come to that conclusion.
But if preemption has not been configured into the kernel in the first place, the kernel does not raise the "preemption count" when spinlocks are acquired. So, in this situation (which is common - many distributors still do not enable preemption in their kernels), in_atomic() has no way to know if the calling code holds any spinlocks or not. So it will return zero (indicating process context) even when spinlocks are held. And that could lead to kernel code thinking that it is running in process context (and acting accordingly) when, in fact, it is not.
Given this problem, one might well wonder why the function exists in the first place, why people are using it, and what developers can really do to get a handle on whether they can sleep or not. Andrew Morton answered the first question in a relatively cryptic way:
In other words, in_atomic() works in a specific low-level situation, but it was never meant to be used in a wider context. Its placement in hardirq.h next to macros which can be used elsewhere was, thus, almost certainly a mistake. As Alan Stern pointed out, the fact that Linux Device Drivers recommends the use of in_atomic() will not have helped the situation. Your editor recommends that the authors of that book be immediately sacked.
Once these mistakes are cleared up, there is still the question of just how kernel code should decide whether it is running in an atomic context or not. The real answer is that it just can't do that. Quoting Andrew Morton again:
This pattern is consistent through the kernel - once again, the GFP_ flags example stands out in this regard. But it's also clear that this practice has not been documented to the point that kernel developers understand that things should be done this way. Consider this recent posting from Rusty Russell, who understands these issues better than most:
In fact, kmalloc() cannot figure out on its own whether sleeping is allowable or not. It has to be told by the caller. This rule is unlikely to change, so expect a series of in_atomic() removal patches starting with 2.6.26. Once that work is done, the in_atomic() macro can be moved to a safer place where it will not create further confusion.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page:
Distributions>>