The current 2.6 development kernel is 2.6.25-rc7
on March 25. Says
Linus: "The shortlog has more details, but it boils down to some
reverts, some docbook fixes, some sparse annotation fixups, a number of
trivial patches, and a healthy sprinkling of small fixups. Give it a good
testing, because we're hopefully now well on our way towards that eventual
real 2.6.25 release!
" Said shortlog can be found in the
announcement, or see the
for the details.
The current stable 2.6 kernel is 184.108.40.206, released on March 24. This
release contains a large number of patches for significant problems in the
Comments (1 posted)
Kernel development news
I think I preferred it when people just stared blankly when I told
them what I do.
-- Val Henson
When you reject useful patches based on "this is not our preferred
style", you piss people off. That is a significant reason why
people choose to spend their time elsewhere. In certain cases
having people abandon the kernel may be a net gain, in many it is a
-- Jörn Engel
[M]y experience with checkpatch.pl is the exact opposite of what you
fear: it _widened_ the contributor base: a good number of newbies
felt encouraged that an objective piece of tool reports an "error"
in a file that was written by otherwise "much more knowledgable"
kernel hackers. checkpatch.pl is basically the "yes, really, you
are right, this piece of code in the Linux kernel is indeed crap"
review tool that reinforces newbies. It lowers the bar of entry to
kernel hacking, and it does so for exactly those pieces of code
that we want newbies to be active on: barely maintained source
-- Ingo Molnar
Comments (19 posted)
mechanism which allows developers to put static tracepoints into the
kernel. Once placed, these markers can be used by operations staff to
trace well-known events in running systems without that staff having to
know about kernel code. Solaris provides a long list of static tracepoints
for use with Dtrace, but Linux, thus far, has none. That situation should
eventually change - static markers were only merged into the mainline in
2.6.24. But, as the developers start to look more seriously at markers,
some interesting issues are coming up.
One of those emerged as a result of this
patch from Mathieu Desnoyers which allows proprietary modules to
contain markers. The fact that current kernels do not recognize markers in binary-only
modules is mostly an accident: markers are disabled in modules with any sort
of taint flag set as a way to prevent kernel crashes - a kernel oops being
a rather heavier-weight marker than most people wish to encounter.
Matthieu tightened that test in a way that allows markers in proprietary
modules, saying "let's see how people react." Needless to
say, he saw.
One might well wonder why the kernel developers, not known for their
sympathy toward proprietary modules in general, would want to consider
letting those modules include static tracepoints. The core argument here
is that static markers allow proprietary modules to export a bit more
internal information to the kernel, and to their users. It is seen as a
sort of (very) small opening up on the part of the proprietary module
writer. Mathieu says:
I think it's only useful for the end user to let proprietary
modules open up a bit, considering that proprietary module writers
can use the markers as they want in-house, but would have to leave
them disabled on shipped kernels.
The idea is that, by placing these tracepoints, module authors can help
others learn more about what's going on inside the module and help people
track down problems. The result should be a more stable kernel which -
whether proprietary modules have been loaded or not - is generally
considered to be a good thing.
On the other hand, there's no shortage of developers who are opposed to
extending any sort of helping hand to binary module authors. Giving those
modules more access to Linux kernel internals, it is argued, only leads to
trouble. Ingo Molnar put it this way:
Why are we even arguing about this? Binary modules should be as
isolated as possible - it's a totally untrusted entity and history
has shown it again and again that the less infrastructure coupling
we have to them, the better.
Ingo also worries that allowing binary modules to use markers will serve to
make the marker API that much harder to change in the future. Since that
API is quite young, chances are good that changes will happen. As much as
the kernel developers profess not to care about binary-only modules, the
fact of the matter is that there are good reasons to avoid breaking those
modules. The testing community certainly gets smaller when testers cannot
load the modules they need to make their systems work in the manner to
which they have become accustomed. So it is possible that allowing
proprietary modules to use markers could make the marker API harder to fix
in future kernel releases.
The grumbles have been loud enough that Matthieu's patch will probably not
be merged for 2.6.25. The idea is likely to come back again, but
not necessarily right away: the marker feature may have been merged in
2.6.24, but it would appear that 2.6.25 will be released with no actual
markers defined in the source. It's not clear that binary-only module
authors are pushing to add tracepoints when none of the other developers
are doing so. Until somebody starts actually using static markers, debates
on where they can be used will continue to be of an academic nature.
Comments (none posted)
When the kernel executes a program, it must retrieve the code from disk,
which it normally does by demand paging it in as required by the execution
path. If the kernel could somehow know which pages would be needed, it
could page them in more efficiently. Andi Kleen has posted an experimental set of patches that do just that.
Programs do not know about their layout on disk, nor is their path through
the executable file optimized to reduce seeking, but with some information
about which pages will be needed, the kernel can optimize the disk
accesses. If one were to gather a list of the pages that get faulted in
as a program runs, that information could be saved for future runs. It
could then be turned into a bitmap indicating which of the pages should
Once you have such a bitmap, where to store it becomes a problem. Kleen's
method uses a "hack" to the ELF format on disk, putting the bitmap at the
end of the executable. This has a number of drawbacks: a seek to get
the info, modifying the executable each time you train, and only allowing a
single usage pattern system-wide. It does have one very nice attribute,
though, the bitmap and executable stay in sync; if the executable changes,
due to an upgrade for instance, the bitmap would get cleared in the
process. Alternative bitmap storage locations—somewhere in users'
home directories for example—do not have this property.
Andrew Morton questions whether this need be done in the kernel
Can't this all be done in userspace? Hook into exit() with an LD_PRELOAD,
use /proc/self/maps and the new pagemap code to work out which pages of
which files were faulted in, write that info into the elf file (or a
separate per-executable shadow file), then use that info the next time the
app is executed, either with an LD_PRELOAD or just a wrapper.
does not want to see the ELF format abused in the fashion it was for this
patch, Kleen doesn't either, but used it as an expedient. Drepper thinks the linker
should be taught to emit a new header type which would store the bitmap. It
would be near the beginning of the ELF file, eliminating the seek. A
problem with that approach is that old binaries would not be able to take
advantage of the technique; a re-linking would be required.
question arises, how does that bitmap get initialized? Drepper suggests that systemtap be used:
To fill in the bitmaps one can
have separate a separate tool which is explicitly asked to update the
bitmap data. To collect the page fault data one could use systemtap.
It's easy enough to write a script which monitors the minor page
faults for each binary and writes the data into a file. The binary
update tool and can use the information from that file to generate the
Kleen's patch walks the page tables for a process when it is exiting,
setting a bit in the bitmap if that page has been faulted in. Drepper sees
this as suboptimal:
Over many uses of a program all kinds of
pages will be needed. Far more than in most cases. The prefetching
should really only cover the commonly used code paths in the program.
If you pull in everything, this will have advantages if you have that
much page cache to spare. In that case just prefetching the entire
file is even easier. No, such an improved method has to be more
The problem is in finding the balance between just prefetching the entire
executable—which might be very wasteful—and prefetching the
subset of pages that are most commonly used. It will take some heuristics
to make that decision. As Drepper points out, recording the entire runtime
of a program "will result in all the pages of a
program to be marked (unless you have a lot of dead code in the binary
and it's all located together)."
The place where Drepper sees a need for kernel support is in providing a
bitmap interface to madvise() so that any holes in the pages that
get prefetched do not get filled by the readahead mechanism. The current interface
would require a call to madvise() for each contiguous region, which
could be add up to a large number of calls. Both he and Morton favor the
bulk of the work being done in user space.
Overall, there is lots more work to do before "predictive bitmaps" make
their way into a Linux system—if they ever do. To start with, some benchmarking will have to be done
to show that performance improves enough to consider making a change like
this. David Miller expresses some pessimism about the
I wrote such a patch ages ago as well.
Frankly, based upon my experiences then and what I know now, I think
it's a lose to do this.
It is an interesting idea though, one that will likely crop up again if
this particular incarnation does not go anywhere. Since the biggest efficiency
gain is from reducing seeks, though, it may not be interesting long-term.
As Morton says, "solid-state disks are going to put a lot of code out
Comments (20 posted)
An API should refrain from making promises that it cannot keep. A recent
episode involving the kernel's in_atomic()
macro demonstrates how
things can go wrong when a function does not really do what it appears to
do. It is also a good excuse to look at an under-documented (but
fundamental) aspect of kernel code design.
Kernel code generally runs in one of two fundamental contexts. Process
context reigns when the kernel is running directly on behalf of a (usually)
user-space process; the code which implements system calls is one example.
When the kernel is running in process context, it is allowed to go to sleep
if necessary. But when the kernel is running in atomic context, things
like sleeping are not allowed. Code which handles hardware and software
interrupts is one obvious example of atomic context.
There is more to it than that, though: any kernel function moves into
atomic context the moment it acquires a spinlock. Given the way spinlocks
are implemented, going to sleep while holding one would be a fatal error;
if some other kernel function tried to acquire the same lock, the system
would almost certainly deadlock forever.
"Deadlocking forever" tends not to appear on users' wishlists for the
kernel, so the kernel developers go out of their way to avoid that
situation. To that end, code which is running in atomic context carefully follows a
number of rules, including (1) no access to user space, and,
crucially, (2) no sleeping. Problems can result, though, when a
particular kernel function does not know which context it might be invoked
in. The classic example is kmalloc() and friends, which take an
explicit argument (GFP_KERNEL or GFP_ATOMIC) specifying
whether sleeping is possible or not.
The wish to write code which can work optimally in either context is
common, though. Some developers, while trying to write such code, may well
stumble across the following definitions from
* Are we doing bottom half or hardware interrupt processing?
* Are we in a softirq context? Interrupt context?
#define in_irq() (hardirq_count())
#define in_softirq() (softirq_count())
#define in_interrupt() (irq_count())
#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != 0)
It would seem that in_atomic() would fit the bill for any
developer trying to decide whether a given bit of code needs to act in an
atomic manner at any specific time. A quick grep through the kernel
sources shows that, in fact, in_atomic() has been used in quite a
few different places for just that purpose.
There is only one problem: those uses are almost certainly all wrong.
The in_atomic() macro works by checking whether preemption is
disabled, which seems like the right thing to do. Handlers for events like
hardware interrupts will disable preemption, but so will the
acquisition of a spinlock. So this test appears to catch all of the cases
where sleeping would be a bad idea. Certainly a number of people who have
looked at this macro have come to that conclusion.
But if preemption has not been configured into the kernel in the first
place, the kernel does not raise the "preemption count" when spinlocks are
acquired. So, in this situation (which is common - many distributors still
do not enable preemption in their kernels), in_atomic() has no way
to know if the calling code holds any spinlocks or not. So it will return
zero (indicating process context) even when spinlocks are held. And that
could lead to kernel code thinking that it is running in process context
(and acting accordingly) when, in fact, it is not.
Given this problem, one might well wonder why the function exists in the
first place, why people are using it, and what developers can really do to
get a handle on whether they can sleep or not. Andrew Morton answered the first question in a relatively
in_atomic() is for core kernel use only. Because in special
circumstances (ie: kmap_atomic()) we run inc_preempt_count() even
on non-preemptible kernels to tell the per-arch fault handler that
it was invoked by copy_*_user() inside kmap_atomic(), and it must
In other words, in_atomic() works in a specific low-level
situation, but it was never meant to be used in a wider context. Its
placement in hardirq.h next to macros which can be used
elsewhere was, thus, almost certainly a mistake. As Alan Stern pointed out, the fact that Linux
Device Drivers recommends the use of in_atomic() will not have
helped the situation. Your editor recommends that the authors of that book
be immediately sacked.
Once these mistakes are cleared up, there is still the question of just
how kernel code should decide whether it is running in an atomic context or
not. The real answer is that it just can't do that. Quoting Andrew Morton again:
The consistent pattern we use in the kernel is that callers keep
track of whether they are running in a schedulable context and, if
necessary, they will inform callees about that. Callees don't
work it out for themselves.
This pattern is consistent through the kernel - once again, the GFP_
flags example stands out in this regard. But it's also clear that this practice has
not been documented to the point that kernel developers understand that
things should be done this way. Consider this recent
posting from Rusty Russell, who understands these issues better than
This flag indicates what the allocator should do when no memory is
immediately available: should it wait (sleep) while memory is freed
or swapped out (GFP_KERNEL), or should it return NULL immediately
(GFP_ATOMIC). And this flag is entirely redundant: kmalloc() itself
can figure out whether it is able to sleep or not.
In fact, kmalloc() cannot figure out on its own whether sleeping
is allowable or not. It has to be told by the caller. This rule is
unlikely to change, so expect a series of in_atomic() removal
patches starting with 2.6.26. Once that work is done, the
in_atomic() macro can be moved to a safer place where it will not
create further confusion.
Comments (26 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>