The current development kernel is 2.6.35-rc3
on June 11. "So
I've been hardnosed now for a week - perhaps overly so - and hopefully that
means that 2.6.35-rc3 will be better than -rc2 was. Not only do we have a
number of regressions handled, we don't have that silly memory corruptor
that bit so many people with -rc2 and confused people with its many varied
forms of bugs it seemed to take, depending on just what random memory it
happened to corrupt.
" The short-form changelog is in the
announcement, or see the
for all the details. Linus now evidently goes offline
for a little while, so the flow of changes into the mainline will slow
Stable updates: there have been no stable updates in the last week.
Comments (3 posted)
The kernel's whole approach to messaging is pretty haphazard and
lame and sad. There have been various proposals to improve the
usefulness and to rationally categorise things in way which are
more useful to operators, but nothing seems to ever get over the
-- Andrew Morton
I do fairly commonly see patches where the description can be
summarised as "change lots and lots of stuff to no apparent end"
and one does have to push and poke to squeeze out the thinking and
the reasons. It's a useful exercise and will sometimes cause the
originator to have a rethink, and sometimes reveals that it just
wasn't a good change.
-- Andrew Morton
Comments (none posted)
Back in May, Jan Kara posted a VFS patch
that fixed a regression and he sent the patch to the stable tree folks as
well. Linus Torvalds noted that it had
been introduced in the merge window, so it wasn't relevant for the stable
tree. That led to a discussion about how to figure out which kernel
version includes a particular patch. While the conversation is a month old,
the advice is pretty much timeless.
Andrew Morton's method is rather
sub-optimal: "I just keep lots of kernel trees around and poke about with `patch
--dry-run'. PITA." Christoph Hellwig and James Bottomley both
suggested git-describe <revid>, which will show the tag of the version a
patch was applied to, or was pulled into if you use the --contains
flag. As one might guess, though, Torvalds had some more elaborate
suggestions. One can use git name-rev in much the same way as
git-describe --contains, but a more "obscure" way to
get the same kind of information is:
git log --tags --source --author=viro --oneline fs/namei.c
which shows commits by Al Viro of fs/namei.c
along with the tagged
each commit was included into. On a recent kernel tree, the start of that
output looks like:
d83c49f v2.6.34 Fix the regression created by "set S_DEAD on unlink()..." commit
3e297b6 v2.6.34-rc3 Restore LOOKUP_DIRECTORY hint handling in final lookup on op
781b167 v2.6.34-rc2 Fix a dumb typo - use of & instead of &&
1f36f77 v2.6.34-rc2 Switch !O_CREAT case to use of do_last()
While the specific example Torvalds gave might not be widely applicable,
idea behind it is. Using git-blame to track down the commit where
a particular change was made is often useful, but the dates in the log can
be misleading with regards to which kernel(s) the change ended up in.
Using some combination of describe and log will make
figuring those kinds of things out much easier.
Comments (27 posted)
The Managed Runtime Initiative
has recently announced
. This group is dedicated to making "managed runtime"
code (Java programs in particular) run faster on Linux systems. MRI's
effort might not seem like a suitable topic for the Kernel Page, except for
one thing: this group has just released thousands of lines of
questionable code which, it claims, it plans to push upstream.
The specific problem that the MRI people (actually Azul Systems employees)
have set out to solve appears to be application pauses caused by garbage
collection. Their solution is implemented at several levels, some of which
are found in the kernel. For the curious, the patches can be found on the MRI download page,
helpfully packaged as a tarball filled with source-RPM files. They have
also thoughtfully included all of Red Hat's patches; look for files
containing "az" to pick the new stuff out of the noise.
The first kernel patch adds an interface for loadable memory management
modules. With this in place, loadable modules can create and claim their
own VMAs which they manage. The Azul-supplied module creates a special
device which provides a few dozen ioctl() operations for the
management of memory within those VMAs. What is actually done by this
module is on the
obscure side; it involves dividing memory into "accounts" with names like
"GC Pause Prevention." There appears to be code to provide transparent
hugepage access to interested applications. There is also some sort of
relaxed locking done within the special VMAs designed to improve
Then, there is the pluggable scheduler patch, creating a new SCHED_ALT
scheduling class which sits between CFS and the realtime classes. The
actual scheduler module's purpose is described as:
The Azul scheduler is designed to provide a cpu resource guarantee
on Linux: specifically that any process with 'committed' cpus and
runnable threads available for those cpus will have its threads
running on those cpus within 10ms.
It allows the partitioning of the system into "committed" and ordinary
CPUs, with special applications getting priority access to the committed
The MRI web page claims that "it is the initiative's goal to upstream
those related contributions into existing and complementary OSS projects
(e.g. kernel.org and openjdk.org)," but the kernel-related code has
never, to your editor's knowledge, been seen on any kernel-related mailing
list. It is heavy with #ifdefs, light on comments, and it adds exports
for large numbers of low-level functions in the scheduler and VM code. Plus there is the
little detail that the development community is unlikely to agree with this
code's fundamental purpose. Pluggable schedulers have been rejected in the
past; until now nobody has even dared to suggest pluggable memory management
In other words, we have a bunch of hackish code which was developed in
total isolation; one wonders how many customers it has been shipped to. If
Azul Systems and the MRI are serious about wanting to upstream it, they
might just want to start talking with the development community fairly
soon. One expects that they might just have a few changes to make.
Comments (100 posted)
Kernel development news
Interrupts are a device's way of telling the kernel that something
interesting has happened. One of the key benefits of using interrupts is
that they free the kernel from the need to poll a device to learn what its
state is. Like any other part of a computer, though, interrupts can go
wrong, leading to situations where the system is overwhelmed by a flood of
spurious interrupts - or, instead, left waiting for an interrupt which will
never arrive. The kernel has some defensive mechanisms in its generic
interrupt layer for dealing with situations like these; Tejun Heo has now
posted a patch series
intended to improve those mechanisms. As it happens, the necessary
response when interrupts go bad is returning to polling.
One problem which is familiar to driver authors is missing interrupts. A
driver will typically set up an I/O operation, get it started, then wait
until an interrupt indicating completion arrives. If that interrupt never
shows up, the driver can end up waiting for a very long time. Missing
interrupts can have a number of causes, including flaky devices or an
interrupt routing problem somewhere in the system. Either way, if the
driver author has not anticipated this situation and taken the appropriate
measures - setting a timeout, for example - things will not end well.
Waiting for interrupt timeouts will slow a device's performance
considerably, though. That problem can be mitigated by polling the device
state frequently, but rapid polling has its own costs. In an attempt to
obtain the best results consistently, Tejun's patch adds a new driver API:
struct irq_expect *init_irq_expect(unsigned int irq, void *dev_id);
void expect_irq(struct irq_expect *exp);
void unexpect_irq(struct irq_expect *exp, bool timedout);
A call to init_irq_expect() will allocate an opaque token to be
used with the other two functions; it should be passed the interrupt number
of interest and the same dev_id value as was used to allocate the
interrupt initially. When the driver initiates an action which should
result in a device interrupt, it should make a call to
expect_irq(). When the operation is completed,
unexpect_irq() should be called, with timedout indicating
whether the operation timed out (the interrupt did not arrive). Note that
it's not necessary for the driver to free the struct irq_expect
structure; that will happen automatically when the interrupt is released.
A call to expect_irq() will initiate polling on the given
interrupt line, where "polling" means making an occasional call to the
device's interrupt handler. Initially, that polling is quite slow. If it
turns out that the device is dropping interrupts (as indicated by the
timedout parameter to unexpect_irq()), the polling
frequency will be increased - up to once every millisecond. Working
devices should interrupt before the slow poll period passes, so the result
should be no real polling at all on reliable devices. If there is a
problem with interrupt delivery, though, the kernel will automatically take
responsibility for poking the interrupt handler when interrupts are
This interface works well if the driver knows when to expect interrupts,
but not all devices work that way. For hardware which can interrupt at any
time, there is an "IRQ watching" API instead:
void watch_irq(unsigned int irq, void *dev_id);
This function will begin polling of the specified interrupt line; it will
also initiate tracking of interrupt delivery status. If it determines that
interrupts are being lost (as determined by an IRQ_HANDLED return
status from a polled call to the handler), it will continue to poll at a
higher frequency. Otherwise, eventually, interrupt delivery will be
deemed to be reliable and polling will be turned off.
Tejun's patch also changes the way that the kernel responds to spurious
interrupts - those which no driver is interested in. Current kernels count
the number of interrupts on each line for which no handler returned
IRQ_HANDLED; if 99,000 out of 100,000 interrupts are spurious, the
kernel loses patience, disables the interrupt line forevermore, and starts
polling the line instead. There is a real cost to this action, which is
why the kernel allows spurious interrupts to get to such a high proportion
of the total. Once the response is triggered, there is no going back, even
if the spurious interrupts were the result of a brief hardware glitch.
With the adaptive polling mechanisms put into place to support the above
features, the kernel is also able to take a more flexible approach to
handling of spurious interrupts. 9,900 bad interrupts out of 10,000 are
now enough to cause the spurious interrupt handling mechanism to kick in;
as before, it disables the interrupt and begins polling. After a period,
though, the new code will reenable the interrupt line, just to see what
happens. If the source of spurious interrupts has stopped, the interrupt
can be used as before. If, instead, spurious interrupts are still being
delivered, the line will be blocked again for a longer period of time.
There has not been a lot of discussion of this patch set so far; one comment worried that polling could cause users
not to realize that there are problems in their systems. But Tejun says
that this kind of response is required to get reasonably solid behavior out
of flaky hardware, and nobody seems to want to challenge that claim. So it
seems fairly likely that a future version of this patch will find its way
into the mainline at some point.
Comments (4 posted)
Since 2005, the realtime preemption project has worked to provide
deterministic response times in stock Linux kernels. Over that time,
though, it has come to appear that there is no guaranteed latency with
regard to when all of this code will actually be merged. At LinuxTag 2010,
realtime hacker Thomas Gleixner talked about the state of this patch set,
what's coming, and, yes, when it might actually be merged in its entirety.
Don't hold your breath.
In truth, the realtime preemption code has been going into the mainline,
piece by piece, for years. Some recently-merged pieces include threaded interrupt handlers and
the sleeping spinlock precursor patches. The threaded handlers make a
number of driver tasks simpler (regardless of any realtime needs) by
eliminating much of the need for tasklets and workqueues. They have also
proved to be useful in providing support for some strange i2c-attached
interrupt controller hardware. The spinlock changes do not affect the
generated code (in mainline kernels), but they are useful for annotating
the type of each lock.
Recent movements of code into the mainline notwithstanding, the realtime
patchset isn't getting any smaller. It seems that the realtime developers
have an interesting problem: the realtime kernel is a really good place to
try out a wide variety of new features. So, despite the fact that code
occasionally moves to the mainline, new stuff keeps getting added to the
This tree's attractiveness for the testing of new code comes from the fact
that it tends to reveal scalability problems much more quickly than
mainline kernels do. The extra preemptibility offered by this kernel comes
at a cost: the price for lock contention is much higher. So the realtime
tree shows scalability issues at lower levels of contention than
non-realtime kernels. The important point is that the scalability
bottlenecks encountered by realtime kernels are not unique to realtime;
they just come sooner than the same bottlenecks will show up with the
mainline. So realtime kernels can be used to look forward to the problems
that the mainline kernel will be experiencing next year.
Thus, for example, realtime kernels exhibit scalability problems in the
virtual filesystem layer that are otherwise only seen in big-iron
torture-test labs. That makes them useful for testing features, and
especially useful for testing scalability improvements. That is why code
like the VFS scalability patch
set currently makes its home in that tree. Eventually, most of these
pieces will get merged into the mainline. Thomas says that it will all be
in by the end of the year - but which year is not something he is
willing to commit to.
The next patch set to move to the mainline might be Peter Zijlstra's memory management preemptibility
series, which solves some long latencies in the memory management
code; the current plan is to push these patches for 2.6.36. Another bit of
code which might make the move is an option to force all drivers to use
threaded interrupt handlers regardless of whether they explicitly request
them. This option would almost certainly not be turned on for most
production kernels, but it makes the testing of drivers with involuntarily
threaded handlers easier.
The realtime tree also suffers from a few unsolved problems. One of them
is latencies in the slab allocator, which runs with preemption disabled for
long periods of time. The SLQB
allocator had raised hopes for a while, but it appears that it will not
be pushed for merging anytime soon. So the realtime hackers have to find a
way to fix one of the existing allocators, or give up and write a slab
allocator of their own. Thomas noted that there are still a few letters
left in the SL?B namespace, so there might just be an SLRB in the future.
That is all quite vague at this point, though; Thomas admitted that he has
no idea how this problem will be resolved.
Another ongoing problem is the increasing use of per-CPU data. In
throughput-oriented environments, per-CPU data increases scalability by
eliminating contention between processors. But use of per-CPU data
necessarily requires that preemption be disabled while the data is being
manipulated; to do otherwise is to risk that the process working with that
data will be preempted or moved to another processor, making a mess of
things. Disabling preemption is anathema in an environment where
everything is always supposed to be preemptable, though. So the realtime
patch set currently puts a lock around per-CPU data accesses, eliminating
the preemption problem but wrecking scalability. Here, too, a real
solution has not yet been found.
Thomas finished with a bit of talk about testing of the realtime tree.
Quite a bit of "enterprise-class" testing is done in the well-furnished
labs at companies like IBM and Red Hat. At the embedded level, the Open Source Automation Development Lab has a
lab of its own. But there's another interesting source of testing: the
Linux audio community has been enthusiastic in its use of the realtime
kernel and has helped find a number of issues. There's also a growing set
of tools maintained in the rt-tests collection.
All told, the picture painted by Thomas was one of a healthy project, even
if we still don't know when it will all get into the mainline. Even in the
realtime world, there are things we simply have to wait for.
Comments (5 posted)
The kernel tree for the ARM architecture is large and fairly complicated.
Because of the large number of ARM system-on-chip (SoC) variants, as well
as different versions of the ARM CPU itself, there is something of a
combinatorial explosion occurring in the architecture tree. That, in turn,
led to something of an explosion from Linus
as he is getting tired of "pointless churn" in the tree.
A pull request from Daniel Walker for some
updates to arch/arm/mach-msm was the proximate cause of
Torvalds's unhappiness, but it goes deeper than that. He responded to
Walker's request, by pointing out a problem he sees with ARM:
There's something wrong with ARM development. The amount of pure noise in
the patches is incredibly annoying. Right now, ARM is already (despite me
not reacting to some of the flood) 55% of all arch/ changes since 2.6.34,
and it's all pointless churn in
and at a certain point in the merge window I simply could not find it in
me to care about it any more.
He goes on to note that the majority of the diffs are
"mind-deadening" because they aren't sensibly readable by
humans. He further analyzes the problem by
comparing the sizes of the x86 and ARM trees, with the latter being some
800K lines of "code"—roughly
three times the size of x86. Of that, 200K
lines are default config (i.e. defconfig) files for 170+ different SoCs.
To Torvalds, those
files are "pure garbage".
In fact, he is "actually considering just getting rid of all the 'defconfig'
files entirely". Each of those files represents the configuration
choices someone made when building a kernel for a specific ARM SoC, but
keeping them around is just a waste, he said:
And I suspect that it really is best to just remove the existing defconfig
files. People can see them in the history to pick up what the heck they
did, but no way will any sane model ever look even _remotely_ like them,
so they really aren't a useful basis for going forward.
Another problem that Torvalds identified is the proliferation of
platform-specific drivers, which could very likely be combined into shared
drivers in the drivers/ tree or coalesced in other ways.
Basically, "we need somebody who cares, and
doesn't just mindlessly aggregate all the crud". Ben Dooks agreed that there is a problem, but that "many of the big company players have
yet to really see the necessity" of combining drivers. He also
noted that at least some of the defconfig files were being used in
automated build testing, but did agree that there are older defconfigs that
should be culled.
Dooks also had a longer description of
the problems that ARM maintainers have in trying to
support so many different SoCs, while also trying to reduce the size and
complexity of the sub-architecture trees. Essentially, the maintainers are
swamped and "until it
hits these big companies in the pocket it [is] very difficult to get them
to actually pay" for cleaning up the ARM tree and keeping it clean
in the future.
Because Torvalds said that he was planning to remove the ARM (and other)
files, ARM maintainer Russell King posted a
warning to the linux-arm-kernel mailing list:
Linus doesn't appear to be listening to reason, so I see now this as
a fait accompli. It'll [apparently] happen at the next merge window.
So, don't send anything which contains a defconfig file or updates to
it. It's pointless.
That set off a separate discussion on that mailing list—King's and
others' attempts to redirect it back to linux-kernel
notwithstanding—about ways to reduce the amount of mostly redundant
information carried around in the defconfig files. Ryan Mallon is in favor
of proactively eliminating some defconfigs,
while others discussed various ways to only keep the deltas between the
config files for various SoCs.
Based on Torvalds's comments on linux-kernel, some kind of delta scheme is
unlikely to fly. His main complaint is that the defconfig files are neither
readable nor writable by humans, as they are
generated by various tools. He made some specific suggestions of alternatives
that would still allow the generation of those config files, using
Kconfig files that are usable by humans.
Reducing the number of defconfigs, as Mallon suggested, may be helpful, but
King at least is convinced that it doesn't go far enough. He believes that Torvalds has already made up his
mind to remove the defconfigs in the next merge window and that the ARM
community better be ready with something else:
I believe the only acceptable solution is to get an [alternative] method
in place - no matter what it is - and remove the all but one of the
defconfig files from the mainline kernel. _And_, most importantly,
kautobuild needs to be fixed so that we still get build coverage.
The loss of kautobuild is a major concern here, and I believe it trumps
everything else for the next merge window. Kautobuild is an extremely
important resource that we simply can not afford to lose.
The discussion ranged from possible solutions to the immediate defconfig
problem to the larger issue of reducing the duplication throughout the ARM
trees. There is an effort underway to produce a single kernel that
would support multiple ARM platforms for Ubuntu 10.10, which will
likely help consolidate various sub-architectures. Given that Canonical is
working closely with the newly formed Linaro
organization—founded to simplify ARM Linux—there is reason to believe that things will get better.
Meanwhile, though, back on linux-kernel, Torvalds started a new thread to
his ideas for a hierarchical collection of Kconfig files that would
essentially take the place of the defconfigs. After some back and forth,
Torvalds gave an example of exactly what
he is suggesting:
Let's say that I want a x86 configuration that
has USB enabled. I can basically _ask_ the Kconfig machinery to generate
that with something like this:
- create a "Mykconfig" file:
and then I just do
KBUILD_KCONFIG=Mykconfig make allnoconfig
and look what appears in the .config file.
He goes on to describe a theoretical Kconfig.omap3_evm file that
sets the specific requirements for that platform and then includes
Kconfig.omap3. That file sets up whatever is required for the
OMAP3 platform and includes Kconfig.arm. That would allow
developers or tools like kautobuild to generate the necessary config files
without having to carry them around in the kernel tree. Those
Kconfig files would also be much more readable and any diffs would be
understandable, which is important to Torvalds.
That solves a significant subset of the problem, but there is still a fly
in the ointment: dependencies. In Torvalds's example, CONFIG_USB
requires CONFIG_USB_SUPPORT, so that would need to be added to
Mykconfig. Not accounting for dependencies will get you a kernel
that doesn't build or, worse yet, won't run
There are a number of possible solutions to the dependency problem, though,
from Catalin Marinas's patch to track unmet
dependencies of options used in select statements to Vegard
of code project to add a
satisfiability solver into the configuration editors (menuconfig, etc.).
It certainly seems likely that defconfig files will be removed from the kernel
tree in the 2.6.36 merge window. Whether there is another
solution—based on Torvalds's ideas or something else—to replace
them is really up to the architecture teams as Torvalds is perfectly happy
to move on without them. ARM, PowerPC, MIPS, and
others all have lots of defconfig files, but unless he changes his
mind, they won't in a few short months. They can keep maintaining those
files in a separate repository somewhere, or find an acceptable method to
generate them. While it may be painful in the
short term, it will reduce the size of the kernel tree and make Torvalds's
job easier, both of which are worth striving for.
Comments (8 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>