The 3.1 kernel is out
Linus on October 24. Headline features in this somewhat delayed release include
improved Xen memory management, enhancements to process tracing (the
command), enhancements to
to ease the finding of holes in files, OpenRISC
architecture support, and more.
As of this writing, some 4400 patches have been pulled into the mainline
for the 3.2 release. Trees pulled thus far include networking, USB,
staging, and security; a full merge
window summary will appear in next week's edition.
Stable updates: 3.0.8 was released
on October 25 with the usual pile of important fixes.
Comments (none posted)
What worries me more than the kernel summit is just that the 3.1
release cycle has dragged out longer than usual, so I'm a bit afraid
that the 3.2 merge window will just be more chaotic than usual just
because there might be more stuff there to be merged. But that's
independent of any KS issues, and I also suspect that the added time
for development has been largely nullified by the productivity lost
due to the k.org mess.
-- Linus Torvalds
I wasn't going to do this... but then I did. I think that
sometimes coding is a bit like chocolate.
-- Neil Brown
Comments (2 posted)
Kernel development news
The realtime Linux minisummit was held on October 22 in Prague on the third
day of the 13th Realtime Linux Workshop. This year's gathering featured a
relatively unstructured discussion that does not lend itself to a tidy
writeup. What has been attempted here is a report on the most important
themes, supplemented with material from some of the workshop talks.
One of the biggest challenges for the realtime preemption patch set from
the beginning has been the handling of per-CPU data. Throughput-oriented
developers are quite fond of per-CPU variables; they facilitate lockless
data access and improve the scalability of the code. But working with
per-CPU data usually requires disabling preemption, which runs directly
counter to what the realtime preemption patch set has been trying to do.
It is a classic example of how high throughput and low latency can
sometimes be incompatible goals.
The way the realtime patch set has traditionally dealt with per-CPU data is
to put a lock around it and treat it like any other shared resource. The
get_cpu_var() API has supported that mode fairly well. In more
recent times, though, that API has been pushed aside by a new set of
functions with names like this_cpu_read() and
this_cpu_write(). The value of these functions is speed,
especially on the x86 architecture where most of the operations boil down
to a single machine instruction. But this new API has also caused some
headaches, especially (but not exclusively) in the realtime camp.
The first problem is that there is no locking mechanism built into this
API; it does not even disable preemption. So a value returned from
this_cpu_read() may not correspond to the CPU that the calling
thread is actually running on by the time it does something with the value
it reads. Without some sort of external preemption disable, a sequence of
this_cpu_read() followed by this_cpu_write() can write a
value back on a different CPU than where the value was read, corrupting the
resulting data structure; this kind of problem has already been experienced
a few times.
The bigger problem, though, is that the API does not in any way indicate
where the critical section involving the per-CPU data lies in the code.
So there is no way to put in extra protection for realtime operation, and
there is no way to put in any sort of debugging infrastructure to verify
that the API is being used correctly. There have already been some heated
discussions on the mailing list about these functions, but the simple fact
is that they make things faster, so they are hard to argue against. There
is real value in the this_cpu API; Ingo Molnar noted that the API is not a
problem as such - it is just an incomplete solution.
Incomplete or not, usage of this API is growing by hundreds of call sites
in each kernel release, so something needs to be done. A lot of time was
spent talking about the naming of these functions, which is seen as
confusing. What does "this CPU" mean in a situation where the calling
thread can be migrated at any time? There may be an attempt to change the
names to discourage improper use of the API, but that is clearly not a
The first step may be to figure out how to add some instrumentation to the
interface that is capable of finding buggy usage. If nothing else, real
evidence of bugs will help provide a justification for any changes that
need to be made. There was also talk of creating some sort of "local lock"
that can be used to delineate per-CPU critical sections, disable migration
(and possibly preemption) in those sections, and, on realtime, to simply
take a lock. Along with hopefully meeting everybody's goals, such an
interface would allow the proper user of per-CPU data to be verified with
the locking checker.
Meanwhile, the realtime developers have recently come up with a new
approach to the management of per-CPU data in general. Rather than
disabling preemption while working with such data, they simply disable
migration between processors instead. If (and this is a bit of a big "if")
the per-CPU data is properly protected by locking, preventing migration
should be sufficient to keep processes from stepping on each other's toes
while preserving the ability for high-priority processes to preempt
lower-priority processes when the need arises.
The fear that comes with disabling migration is that processes could get
stuck and experience long latencies even though there are free CPUs
available. If a low-priority process disables migration, then is
preempted, it cannot be moved to an available processor. So that process
will simply wait until the preempting process either completes or gets
migrated itself. Some work has been done to try to characterize the
problem; Paul McKenney talked for a bit about some modeling work he has
done in that area. At this point, it's really not clear how big the
problem really is, but some developers are clearly worried about how this
technique may affect some workloads.
For a long time, the realtime preemption patch set has split software
interrupt ("softirq") handling into a separate thread, making it
preemptable like everything else. More recent patches have done away with
that split, though, for a couple of reasons: to keep the patches closer to
the mainline, and to avoid delaying network processing. The
networking layer depends heavily on softirqs to keep the data moving;
shifting that processing to a separate thread can add latency in an
But there had been a good reason for the initial split: software interrupts
can introduce latencies of their own. So there was some talk of bringing
the split back in a partial way. Most softirqs could run in a separate
thread but, perhaps, the networking softirqs would continue to run as
"true" software interrupts. It is not a solution that pleases anybody, but
it would make things better in the short term.
There appears to be agreement on the form of the long-term solution:
software interrupts simply need to go away altogether. Most of the time,
softirqs can be replaced by threaded interrupt handlers; as Thomas Gleixner
put it, the networking layer's NAPI driver API is really just a poor-man's
form of threaded IRQ handling. There is talk of moving to threaded
handlers in the storage subsystem as well; that evidently holds out some
hope of simplifying the complex SCSI error recovery code. That is not a
trivial change in either storage or networking, though; it will take a long
time to bring about. Meanwhile, partially-split softirq handling will
probably return to the realtime patches.
An interesting problem related to threaded softirq handlers is the
prevalence of "trylock" operations in the kernel. There are a number of
sites where the code will attempt to obtain a lock; if the attempt fails,
the IRQ handler will stop with the idea that it will try again on the next
softirq run. If the handler is running as a thread, though, chances are
good that it will be the highest-priority thread, so it will run again
immediately. A softirq handler looping in this manner might prevent the
scheduling of the thread that actually holds the contended lock, leading to
a livelock situation. This problem has been observed a few times on real
systems; fixing it may require a restructuring of the affected code.
There was a long discussion on how the realtime developers planned to get
most or all of the patch set upstream. It has been out of tree for some
time now - the original preempt-rt patch was announced almost
exactly seven years ago. Vast amounts of code have moved from the
realtime tree into the mainline over that time, but there are still some
300 patches that remain out of the mainline. Thomas suggested a couple of
times that he has maintained those patches for about long enough, and does
not want or plan to do it forever.
Much of what is in the realtime tree seems to be upstreamable, though
possibly requiring some cleanup work first. There are currently about 75
patches queued to go in during the 3.2 merge window, and there is a lot
more that could be made ready for 3.3 or 3.4. The current goal is to merge
as much code as possible in the next few development cycles; that should
serve to reduce the size of the realtime patch set considerably, making the
remainder easier to maintain.
There are a few problem areas. Sampling of randomness is disabled in the
realtime tree; otherwise contention for the entropy pool causes unwanted
latencies. It seems that contribution of entropy is on the decline
throughout the kernel; as processors add hardware entropy generators, it's
not clear that maintaining the software pool has value. The seqlock API
has a lot of functions with no users in the kernel; the realtime patch set
could be simplified if those functions were simply removed from the
mainline. There are still a lot of places where interrupts are disabled,
probably unnecessarily; fixing those will require auditing the code to
understand why interrupts were disabled in the first place. Relayfs has
problems of its own, but there is only one in-tree user; for now, it will
probably just be disabled for realtime. The stop_machine()
interface is seen to be a pain, and is probably also unnecessary most of
And so on. A number of participants went away from this session with a
list of patches to fix up and send upstream; with luck, the out-of-tree
realtime patch set will shrink considerably.
Once the current realtime patch set has been merged, is the job done? In
the past, shrinking the patch set has been made harder by the simple fact
that the developers continue to come up with new stuff. At the moment,
though, it seems that there is not a vast amount of new work on the
horizon. There are only a couple of significant additions expected in the
- Deadline scheduling. That patch is in a condition that is "not
too bad," but the developer has gotten distracted with new things. It
looks like the deadline scheduler may be picked up by a new developer
and progress will resume.
- CPU isolation - the ability to run one or more processors with no
clock tick or other overhead. Fredéric Weisbecker continues to
work on those patches, but there is a lot to to be done and no
pressing need to merge them right away. So this is seen as a
Once upon a time, the addition of deterministic response to a
general-purpose operating system was seen as an impossible task: the
resulting kernel would not work well for anybody. While some people were
saying that, though, the realtime preemption developers simply got to work;
they have now mostly achieved that goal. The realtime work may never be
"done" - just like the Linux kernel as a whole is never done - but this
project looks like one that has achieved the bulk of its objectives. Your
editor has learned not to make predictions about when everything will be
merged; suffice to say that the developers themselves appear to be
determined to get the job done before too long. So, while there may always
be an out-of-tree realtime patch set, chances are it will shrink
considerably in the coming development cycles.
Comments (8 posted)
The 2011 Kernel Summit was held in Prague on October 23-25. The
organization of the event was changed somewhat this year; the first day was
dedicated to a small number of minisummits. We do not currently have
coverage from those events - that is a gap we hope to fill in the near
The Kernel Summit proper started on October 24 with the traditional closed
session. The topics discussed there were:
- Kernel.org report; including some more
information about the compromise along with what is being done to secure
this critical infrastructure going forward.
- Tracing for large data centers looked
at Google's needs in the tracing area, but then the discussion moved into
whether moving the tools into the kernel tree will help alleviate some of
the problems with ABIs and backward compatibility.
- Structured error logging; a
presentation (only partly delivered) on improving kernel error logging.
- Coming to love cgroups; while love for
control groups may not have been the outcome, some more understanding of
cgroups, controllers, and the problems with the latter did.
- Memory management issues; a session on
the progress the subsystem has made over the last year along with a list of
patches that are in need of review and rework with an eye to eventually
- Preemption disable and verifiable APIs
discussed calls to the this_cpu*() family as
well as preempt_disable() calls multiplying throughout the kernel,
which cause problems, and not just for the realtime kernel.
- Scheduler testing; a reworked version
of the LinSched scheduling simulator was discussed and proposed to be added
to the kernel tools/ directory. It may provide the long-sought
ability to more reliably test scheduler changes.
- Patch review; in a wide-ranging
problems in the area of patch review were covered. The discussion
eventually turned to Android's kernel patches and, in particular, suspend
blockers with a, perhaps, surprising conclusion.
- Development process issues; there were
fewer problems in this area than there have been in recent years. Linus is
happy with how things are going, overall, though the growth of complexity
in the kernel is somewhat worrisome.
The second day of the 2011 Kernel Summit featured a changed format: the session
was open to all attendees. Coverage for this day has
been split into two parts:
- Morning: reports from a large number
of minisummits, more on kernel.org security and the web of trust, and
- Afternoon: shared libraries, failure
handling, the media controller, the kernel build and configuration
subsystem, and the future of the event itself.
Needless to say, coverage would not be complete without a picture of the
expanded group arranged into a rather compressed space:
(Your editor would like to thank the Linux Foundation for supporting his
travel to Prague for this event).
Comments (5 posted)
Patches and updates
- Linus Torvalds: Linux 3.1 .
(October 24, 2011)
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>