The current development kernel is 2.6.39-rc5
on April 26. According to
We have slightly fewer commits than in -rc4, which is good. At the
same time, I have to berate some people for merging some dubious
regression fixes. Sadly, the 'people' I have to berate is me,
because -rc5 contains what technically _is_ a regression, but it's
a performance thing, and it's a bit scary. It's the patches from
Andi (with some editing by Eric) to make it possible to do the
whole RCU pathname walk even if you have SElinux enabled.
full changelog for all the details.
Stable updates: the 188.8.131.52 update
was released on April 21; 184.108.40.206 and 220.127.116.11 followed one day later; all contain
another long list of important fixes.
The 18.104.22.168 and 22.214.171.124 updates are in the review process as
of this writing; they can be expected on or after April 28.
Comments (none posted)
Can't be helped. No one has ever written a polite application
regarding disk usage. Applications are like seagulls, scanning for
free disk blocks and chanting "Mine! Mine!".
-- Casey Schaufler
That works. But Greg might see us doing it, so some additional
mergeable patches which *need* that export will keep him happy.
(iow, you're being extorted into doing some kernel cleanup work)
-- Andrew Morton
I'd been offline since Mar 25 for a very nasty reason - popped
aneurysm in right choroid artery. IOW, a hemorrhagic stroke. A
month in ICU was not fun, to put it very mildly. A shitty local
network hadn't been fun either... According to the hospital folks
I've ended up neurologically intact, which is better (for me) than
Said state is unlikely to continue if I try to dig through ~15K
pending messages in my mailbox; high pressure is apparently _the_
cause for repeated strokes.
-- Al Viro
's welcome return
Comments (3 posted)
The dentry cache scalability patch set
merged for the 2.6.38 kernel; it works by attempting to perform pathname
lookup with no locks held at all. The read-copy-update (RCU) mechanism is
used to ensure that dentry structures remain in existence for long enough
to perform the lookup. This patch set has removed a significant
scalability problem from the kernel, improving lookup times considerably.
Except, as it turns out, it doesn't always work that way. A set of patches
merged for 2.6.39-rc5 - rather later in the development cycle than one
would ordinarily expect - has helped to address this problem.
The fact that the pathname lookup fast path runs under RCU means that no
operation can block. Should it turn out that the lookup cannot be
performed without blocking (if a directory entry must be read from disk, for
example), the fastpath lookup is aborted and the whole process starts over
in the slow mode. In the 2.6.38 lookup code, the mere fact that security
modules have been built into the kernel will force a fallback to slow mode,
even if no actual security module is active. Things were done this way
because nobody had taken the time to verify whether the security module
inode_permission() checks were RCU-safe or not. So, if security
modules are enabled, the result is not just that the scalability advantages
over 2.6.37 are not available; in fact, the code runs slower than it
did in 2.6.37.
Enterprise distributions have a tendency to enable security modules, so
this performance problem is a real concern. In response, Andi Kleen took a look at
the code and found that improving the situation was not that hard; his patches led to what was merged for 2.6.39.
Andi started by allowing individual security modules to decide whether they
could perform the inode permissions check safely in the RCU mode or not, with the
default being to fall
back to slow mode. Since the default inode_permission() check
does nothing, it could easily be made RCU safe; with just that change,
systems with security modules enabled but with no module active can make
use of the fast lookup path.
Looking further, Andi discovered that both SELinux and SMACK already use
RCU for their permissions checking. Given that the code is already
RCU-safe, extending it to do RCU-safe permission checks was relatively
straightforward. The only remaining glitch is situations where auditing is
enabled; auditing is not RCU-safe, so things will still slow down on such
systems. Otherwise, though, the advantages of the dcache scalability work
should now have been extended to systems with security modules enabled -
assuming that the late-cycle patches do not result in regressions that
cause them to be reverted.
Comments (3 posted)
Kernel development news
Back in 2007, LWN readers learned about the SEEK_HOLE and SEEK_DATA
options to the
system call. These options allow an application to map
out the "holes" in a sparsely-allocated file; they were originally
implemented in Solaris for the ZFS filesystem. At that time, this
extension was rejected for Linux; the Linux filesystem developers thought
they had a better way to solve the problem. In the end, though, it may
have turned out that the Solaris crew had the better approach.
Filesystems on POSIX-compliant systems are not required to allocate blocks
for files if those blocks would contain nothing but zeros. A range within
a file for which blocks have not been allocated is called a "hole."
Applications which read from a hole will get lots of zeros in response;
most of the time, applications will not be aware that the actual underlying
storage has not been allocated. Files with holes are relatively rare, but
some applications do create "sparse" files which are more efficiently
stored if the holes are left out.
Most of the time, applications need not care about holes, but there are
exceptions. Backup utilities can save storage space if they notice and
preserve the holes in files. Simple utilities like cp can also,
if made aware of holes, ensure that those holes are not filled in any
copies made of the relevant files. Thus, it makes sense for the system to
provide a way for applications which care to learn about where the holes in
a file - if any - may be found.
The interface created at Sun used the lseek() system call, which
is normally used to change the read/write position within a file. If the
SEEK_HOLE option is provided to lseek(), the offset will
be moved to the beginning of the first hole which starts after the
specified position. The SEEK_DATA option, instead, moves to the
beginning of the first non-hole region which starts after the given
position. A "hole," in this case, is defined as a range of zeroes which
need not correspond to blocks which have actually been omitted from the
file, though in practice it almost certainly will. Filesystems are not
required to know about or report holes; SEEK_HOLE is an
optimization, not a means for producing a 100% accurate map of every range
of zeroes in the file.
When Josef Bacik posted his 2007 SEEK_HOLE patch, it was received
with comments like:
I stand by my belief that SEEK_HOLE/SEEK_DATA is a lousy interface.
It abuses the seek operation to become a query operation, it
requires a total number of system calls proportional to the number
holes+data and it isn't general enough for other similar uses
(e.g. total number of contiguous extents, compressed extents,
offline extents, extents currently shared with other inodes,
extents embedded in the inode (tails), etc.)
So this patch was not merged. What we got instead was a new
ioctl() operation called FIEMAP. There can be no doubt
that FIEMAP is a more powerful operation; it allows the precise
mapping of the extents in the file, with knowledge of details like extents
which have been allocated but not written to and those which have been
written to but which do not, yet, have exact block numbers assigned.
Information for multiple extents can be had with a single system call.
With an interface like this, it was figured, there is no need for something
Recently, though, Josef has posted a new
SEEK_HOLE patch with the comment:
Turns out using fiemap in things like cp cause more problems than
it solves, so lets try and give userspace an interface that doesn't
A quick search on the net will turn up a long list of bug reports related
to FIEMAP. Some of them are simply bugs in specific filesystem
implementations, like the problems related to
delayed allocation that were discovered in February. Others have to do
with the rather complicated semantics of some of the FIEMAP
options and whether, for example, the file in question must be synced to
the disk before the operation can be run. And others just seem to be
related to the complexity of the system call itself. The end result has
been a long series of reports of corrupted files - not the sort of thing
filesystem developers want to find in their mailboxes.
It seems that FIEMAP is a power tool with sharp
edges which has been given to applications which just wanted a
butter knife. For the purpose of simply finding out which parts of a file
need not be copied, a simple interface like SEEK_HOLE seems to be
more appropriate. So, one assumes, this time the interface will likely get
into the kernel.
That said, it looks like a few tweaks will be needed first. The API as
posted by Josef does not exactly match what Solaris does; to add an API
which is not compatible with the existing Solaris implementation makes
little sense. There is also the question of what happens when the
underlying filesystem does not implement the SEEK_HOLE and
SEEK_DATA options; the current patch returns EINVAL in
this situation. A proposed alternative is to have a VFS-level
implementation which just assumes that the file has no holes; that makes
the API appear to be supported on all filesystems and eliminates one error
case from applications.
Once these details are worked out - and appropriate man pages written -
SEEK_HOLE should be set to be merged this time around.
FIEMAP will still exist for applications which need to know more
about how files are laid out on disk; tools which try to optimize readahead
at bootstrap time are one example of such an application. For everything
else, though, there should be - finally - a simpler alternative.
Comments (29 posted)
As the effort to bring proper abstractions to the ARM architecture and
remove duplicated code continues, one clear problem area that has arisen is
in the area of DMA memory management. The ARM architecture brings some
unique challenges to this area, but the problems are not all ARM-specific.
We are also seeing an interesting view into a future where more complex
hardware requires new mechanisms within the kernel to operate properly.
One development in the ARM sphere is the somewhat belated addition of I/O
memory management units (IOMMUs) to the architecture. An IOMMU sits
between a device and main memory, translating addresses between the two.
One obvious application of an IOMMU is to make physically scattered memory
look contiguous to the device, simplifying large DMA transfers. An IOMMU
can also restrict DMA access to a specific range of memory, adding a layer
of protection to the system. Even in the absence of security worries,
a device which can scribble on random memory can cause no end of
As this feature has come to ARM systems, developers have, in the classic
ARM fashion, created special interfaces for
the management of IOMMUs. The
only problem is that the kernel already has an interface for the management
of IOMMUs - it's the DMA API. Drivers which use this API should work on
just about any architecture; all of the related problems, including cache
coherency, IOMMU programming, and bounce buffering, are nicely hidden. So
it seems clear that the DMA API is the mechanism by which ARM-based
drivers, too, should work with IOMMUs; ARM maintainer Russell King recently
made this point in no
That said, there are some interesting difficulties which arise when using
the DMA API on the ARM architecture. Most of these problems have their
roots in the architecture's inability to deal with multiple mappings to a
page if those mappings do not all share the same attributes. This is a
problem which has come up before; see this
article for more information. In the DMA context, it is quite easy to
create mappings with conflicting attributes, and performance concerns are
likely to make such conflicts more common.
Long-lasting DMA buffers are typically allocated with
dma_alloc_coherent(); as might be expected from the name, these
are cache-coherent mappings. One longstanding problem (not just on ARM) is
that some drivers need large, physically-contiguous DMA areas which can be
hard to come by after the system has been running for a while. A number of
solutions to this problem have been tried; most of them, like the CMA allocator, involve setting aside memory at
boot time. Using such memory on ARM can be tricky, as it may end up being
mapped as if it were device memory, and may run afoul of the conflicting
More recently, a different problem has come up: in some cases, developers
want to establish these DMA areas as uncached memory. Since main memory is
already mapped into the kernel's address space as cached, there is no way
to map it as uncached in another context without breaking the rules. Given
this conflict, one might well wonder (as some developers did) why uncached
DMA mappings are wanted. The reason, as explained
by Rebecca Schultz Zavin, has to do with graphics. It's common for
applications to fill memory with images and textures, then hand them over
to the GPU without touching them further. In this situation, there's no
advantage to having the memory represented in the CPU's cache; indeed,
using cache lines for that memory can hurt performance. Going uncached
(but with write combining) turns out to give a significant performance
But nobody will appreciate the higher speed if the CPU behaves strangely in
response to multiple mappings with different attributes. Rebecca listed
a few possible solutions to that problem that she had thought of; some
have been tried before, and none are seen as ideal. One is to set aside
memory at boot time - as is sometimes done to provide large buffers - and
never map that memory into the kernel's address space. Another approach is
to use high memory for these buffers; high memory is normally not mapped
into the kernel's address space. ARM-based systems have typically not
needed high memory, but as the number of systems with 1GB (or more) memory
are shipped, we'll see more use of high memory. The final alternative would
be to tweak the attributes in the kernel's mapping of the affected memory.
That would be somewhat tricky; that memory is mapped with huge pages which
would have to be split apart.
These issues - and others - have been summarized in a "to do" list by Arnd
Bergmann. There's clearly a lot of work to be done to straighten out this
interface, even given the current set of problems. But there is another
cloud on the horizon in the form of the increasing need to share these
buffers between devices. One example can be found in this patch, which is an attempt to establish
graphical overlays as proper objects in the kernel mode setting graphics
environment. Overlays are a way of displaying (usually) high-rate graphics
on top of what the window system is doing; they are traditionally used for
tasks like video playback. Often, what is wanted is to take frames
directly from a camera and show them on the screen, preferably without
copying the data or involving user space. These new overlays, if properly
tied into the Video4Linux layer's concept of overlays, should allow that to
Hardware is getting more sophisticated over time, and, as a result, device
drivers are becoming more complicated. A peripheral device is now often a
reasonably capable computer in its own right; it can be programmed and left
to work on its own for extended periods of time. It is only natural to
want these peripherals to be able to deal directly with each other. Memory
is the means by which these devices will communicate, so we need an
allocation and management mechanism that can work in that environment.
There have been suggestions
that the GEM memory manager - currently used with GPUs - could be
generalized to work in this mode.
So far, nobody has really described how all this could work, much less
posted patches. Working all of these issues out is clearly going to take
some time. It looks like a fun challenge for those who would like to help
set the direction for our kernels in the future.
Comments (none posted)
Thomas Gleixner gets asked regularly about a "roadmap" for getting the
realtime Linux (aka PREEMPT_RT) patches into the mainline. As readers of
LWN will know, it has been a multiple-year effort to move pieces of the
realtime patchset into the mainline—and one that has been predicted to complete several times, though
not for a few years now. Gleixner presented an update on the realtime
patches at this year's Embedded Linux Conference. In the talk, he showed a
roadmap—of sorts—but more importantly described what is still
lurking in that tree, and what approach the realtime developers will be
taking to get those pieces into the mainline.
Gleixner started out by listing the parts of the realtime tree that have
already made it into the mainline. That includes high-resolution timers,
the mutex infrastructure, preemptible and hierarchical RCU, threaded
interrupt handlers, and more. Interrupt handlers can now be forced to run
as threads by using a kernel command line option.
There have also been cleanups done in lots of
places to make it easier to bring in features from the realtime tree,
including cleaning up the locking namespace and infrastructure "so
spinlocks becomes a more moderate sized patch", he said.
What's left are the "tough ones" as all of the changes that
are "halfway easy to do" are already in the mainline. The next
piece that will likely appear is the preemptible mmu_gather patches, which will
allow much of the memory management code to be preemptible. Gleixner said
that it was hoped that code could make it into 2.6.39; that didn't happen,
but it should go in for 2.6.40.
Per-CPU data structures are a current problem that "makes me scratch
my head a lot", Gleixner said. The whole idea is to keep the data
structures local to a particular CPU and avoid cache contention between
CPUs, which requires
that any code modifying those data structures stay running on that CPU. In
order to do that, the code disables preemption while modifying the per-CPU
data. If that code "just did a little fiddling" with
preemption disabled, it would not be a problem, but currently there are
often thousands of lines of code executed. The realtime developers have
talked with the per-CPU folks and they "see our pain". The
right solution is use inline functions to annotate the real atomic
accesses, so that the preemption-disabled window can be
reduced. "Right now, there is a massive amount of code protected by
preempt_disable()", he said.
The next area that needs to be addressed is preemptible memory and page
allocators. Right now, the realtime tree uses SLAB because the others are
"too hard to deal with". There has been talk about creating a
memory allocator specifically for the realtime tree, but some recent
developments in the SLUB allocator may have removed the need for that.
SLUB has been converted to be completely lockless for the fast path and
Christoph Lameter has promised to deal with the slow path, which is
"good news" for the realtime developers. The page allocator
problem is "not that hard to solve", Gleixner said. Some
developers have claimed that a fully preemptible, lockless page allocator
is possible, so he is not worried about that part.
Another area "that we still have to twist our brain around" is
software interrupts, he said. Those currently disable preemption, but then
can be interrupted themselves, leading to unbounded latency. One
possibility is to split up the software interrupts into different threads
and to wake them up when an interrupt is generated, whether it comes from
kernel or user space. There are performance implications with that,
however, because there is a context switch associated with the interrupt.
There are some other "nasty implications" as well, because it
will be difficult to tune the priorities of the interrupt threads
Another possibility would be to add an argument to
local_bh_disable() that would indicate which software interrupts
should be held off. But cleaning up the whole tree to add those new
arguments is "nothing I can do right now", he said. There are
tools to help with adding the argument itself, but figuring out which
software interrupts should be disabled is a much bigger task.
The "last thing" that is still pending in the realtime tree is
sleeping spinlocks. That work is fairly straightforward he said, only
requiring adding one file and patching three others. But that will only
come once the other problems have been solved, he said.
So, when will the merge to mainline be finished? That's a question
Gleixner and the other realtime developers have been hearing for seven
years or so. The patchset is huge and "very intrusive in many
ways", he said. It has been slowly getting into the mainline piece
by piece, but it will probably never be complete, because people keep
coming up with new features at roughly the same rate as things move into
the mainline. As always, Gleixner said, "it will be done by the end
of next year".
Gleixner used a 2010 quote from Linus Torvalds ("The RT people have actually been pretty good at slipping their stuff in,
in small increments, and always with good reasons for why they aren't
crazy.") to illustrate the approach taken by the realtime
developers. The realtime changes are slipped into "nice Trojan
horses" that are useful for more than just realtime. Torvalds is
"well aware that we are cheating, but he doesn't care" because
the changes fix other problems as well.
The realtime tree has been pinned to kernel 2.6.33 for some time now (with
126.96.36.199-rt having been released just prior to Gleixner's talk). There are
plans to update to 2.6.38 soon. There a several reasons why the realtime
tree is not updated very rapidly, starting with a lack of developer time.
The tree also requires a long stabilization phase, partly because
"some of the bugs we find are very complex race conditions",
and those bugs can have serious impacts on filesystems or other parts of the
kernel. Typically the problem is not fixing those kinds of bugs, but
finding them as they can be quite hard to reproduce.
Another problem is that because the realtime changes aren't in the
mainline Gleixner "can't yell at people yet" when they break
things. Also, other upstream work and merging other code often takes
priority over work in the realtime tree. But he is "tired of
maintaining that thing out of tree", so work will progress. Often
getting a piece of the realtime tree accepted requires lots of work
elsewhere in the tree, which consumes a lot of time and brain power.
"People ship crap faster than you can fix it", he said.
There are about 20 active contributors to the realtime tree, as well as
large testing efforts going on at Red Hat, IBM, OSADL, and Gleixner's
Looking beyond the current code, Gleixner outlined two potential future
features. The first is non-priority-based scheduling, which is needed to
solve certain kinds of problems, but brings with it a whole new set of
problems. Even though priorities are not used, there are still
"priority-inversion-like problems" that will have to be solved
with mechanisms similar to priority inheritance. Academics have proved
that such schedulers can work on uni-processor systems, but have just now
started to "understand that there is this thing called SMP".
Though there is a group in Pisa, Italy (working on deadline scheduling) that Gleixner specifically excluded
from his complaints about academic researchers.
The other new feature is CPU isolation, which is not exactly realtime work,
but the realtime
developers have been asked to look into it. The idea is to hand over a CPU
to a particular task, so that it gets the full use of that CPU. In order
to do that, the CPU must be removed from the timer interrupt and the RCU
pool among other things. The problem isn't so much that users want to be
able to run undisturbed for an hour on a CPU or core, but that they then
want to be able to interact with the rest of the kernel to send data over
the network or write to disk. In general, it's fairly clear what needs to
be done to implement CPU isolation, he said.
It is obvious that Gleixner is tired of being asked for a roadmap for
the realtime patches. Typically it isn't engineers working on devices or
other parts of the kernel who ask for it, but is, instead, their managers
who are looking for such a thing. There are several reasons why there is
no roadmap, starting with the fact that kernel developers don't use
PowerPoint. More seriously, though, the realtime developers are making
their own road into the kernel, so they are looking for a road to follow
themselves. But, so that it could no longer be said that he hadn't shown a
roadmap, Gleixner presented one (shown at right) to much laughter.
He also fielded quite a few audience questions about the realtime tree,
what others can do to help it progress, and why some of the troublesome
Linux features couldn't be eliminated to make it easier to get the code
merged. In terms of help, the biggest need is for more testing. In
particular, Gleixner encouraged people to test the realtime patches atop
Greg Kroah-Hartman's 2.6.33
Software interrupts are still required in various places in the kernel, in
particular the network and block layers. Any change to try to remove them
would require changes in too much code. On the other hand, counting
semaphores are mostly gone, though some uses come in through the staging
tree. Those are mostly cleaned up before the staging code moves out of
that tree, he said. From time to time, he looks through the staging tree
for significant new users of counting semaphores and doesn't really find
any, so he is not concerned about those, but is more concerned about
As for the choice of 2.6.38 as the basis for the next realtime tree,
Gleixner said that he picks the "most convenient" tree when
making that decision. It depends on what is pending for the mainline, and
what went into the various kernel versions, because he does not want to
backport things into the realtime tree: "I'm not insane", he said.
The realtime tree got started partially because of a conference he attended
in 2004 where various academics gathered there agreed that it was not
possible to turn a general purpose operating system into a realtime one.
He started working on it because of that technical challenge. Along the
same lines, when asked what he would do with all the free time he would
have once the realtime code was upstream, Gleixner replied that he would
like to eliminate jiffies in the kernel. He has a "strong affinity
to mission impossible", he said.
One should be careful about choosing the realtime kernel and only use it if
you need the latency guarantees, he said. So smartphone kernels might not
have any real need for such a kernel, he said. But if the baseband stack
were to move to the main CPU, then it might make sense to look at using the
realtime code. One "should only run such a beast if you really need
it". That said, he rattled off a number of different projects that
were using the realtime kernel, including military, banking, and automation
applications. He closed with a short description of a gummy bear sorting
machine that used the realtime kernel, and was quite fancy, but after
watching it for a bit, you wouldn't want to see gummy bears again for a year.
Comments (2 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>