The current development kernel is 3.7-rc2
on October 20. Linus comments:
Anyway, it's been roughly a week, and -rc2 is out. The most
noticeable thing tends to be fixing various fallout issues -
there's lots of patches to finish up (and fix the fallout) from the
UAPI include file reorganization, for example, but there's also
some changes to how module signing is done etc etc.
Stable updates: 3.0.47,
3.6.3 were released on October 21;
each contains another set of important fixes. Note that 3.4.15 and 3.6.3
an ext4 data corruption bug (as do their immediate predecessors and 3.5.7)
so waiting for the next update might be advisable. 3.0.47, instead, contains a
block subsystem patch that "could cause problems"; 3.0.48 was released on October 22 with a
Meanwhile, 3.2.32 was released on
Comments (none posted)
When I was on Plan 9, everything was connected and uniform. Now
everything isn't connected, just connected to the cloud, which
isn't the same thing. And uniform? Far from it, except in
mediocrity. This is 2012 and we're still stitching together little
microcomputers with HTTPS and ssh and calling it revolutionary. I
sorely miss the unified system view of the world we had at Bell
Labs, and the way things are going that seems unlikely to come back
any time soon.
— Rob Pike
local_irq_save() and local_irq_restore() were mistakes :( It's
silly to write what appears to be a C function and then have it
operate like Pascal (warning: I last wrote some Pascal in 66 B.C.).
— Andrew Morton
So this is it. The big step why we did all the work over the past
kernel releases. Now everything is prepared, so nothing protects us
from doing that big step.
| | \ \ nnnn/^l | |
| | \ / / | |
| '-,.__ => \/ ,-` => | '-,.__
| O __.´´) ( .` | O __.´´)
~~~ ~~ `` ~~~ ~~
— Jiri Slaby
So really Rasberry Pi and Broadcom - get a big FAIL for even bothering to make a press release for this, if they'd just stuck the code out there and gone on with things it would have been fine, nobody would have been any happier, but some idiot thought this crappy shim layer deserved a press release, pointless.
— Dave Airlie
Comments (6 posted)
Stable kernel updates are supposed to be just that — stable. But they are
not immune to bugs, as a recent ext4 filesystem problem has shown. In
short: ext4 users would be well advised to avoid versions 3.4.14, 3.4.15,
3.5.7, 3.6.2, and 3.6.3; they all contain a patch which can, in some
situations, cause filesystem corruption.
The problem, as explained in this note from Ted
Ts'o, has to do with how the ext4 journal is managed. In some
situations, unmounting the filesystem fails to truncate the journal,
leaving stale (but seemingly valid) data there. After a single
unmount/remount (or reboot) cycle little harm is done; some old
transactions just get replayed unnecessarily. If the filesystem is quickly
unmounted again, though, the journal can be left in a corrupted state; that
corruption will be helpfully replayed onto the filesystem at the next
Fixes are in the works. The ext4 developers are taking some time, though,
to be sure that the problem has been fully understood and completely fixed;
there are signs that the bug may have roots
far older than the patch that actually caused it to bite people. Once that
process is complete, there should be a new round of stable updates
(possibly even for 3.5, which is otherwise at end-of-life) and the world
will be safe for ext4 users again.
(Thanks are due to LWN reader "nix" who alerted
readers in the comments and reported the bug to the ext4 developers).
Update: Ted now thinks that his
initial diagnosis was incomplete at best; the problem is not as well
understood as it seemed. Stay tuned.
Comments (46 posted)
The Raspberry Pi Foundation has announced
source code for its video driver is now available under the BSD license.
"If you’re not familiar with the status of open source drivers on ARM
SoCs this announcement may not seem like such a big deal, but it does
actually mean that the BCM2835 used in the Raspberry Pi is the first
ARM-based multimedia SoC with fully-functional, vendor-provided (as opposed
to partial, reverse engineered) fully open-source drivers, and that
Broadcom is the first vendor to open their mobile GPU drivers up in this
Comments (60 posted)
Kernel development news
The Linux scheduler, in its various forms, has always been optimized for
the (sometimes conflicting) goals of throughput and interactivity.
Balancing those two objectives across all possible workloads has proved to
be enough of a challenge over the years; one could argue that the last thing
the scheduler developers need is yet another problem to worry about. In
recent times, though, that is exactly what has happened: the scheduler is
now expected to run the workload while also minimizing power consumption.
system lives in a pocket or in a massive data center, the owner is almost
certainly interested in more power-efficient operation. This problem has
proved to be difficult to solve, but Vincent Guittot's recently posted small-task packing patch set
may be a step in
the right direction.
A "small task" in this context is one that uses a relatively small amount
of CPU time; in particular, small tasks are runnable less than 25% of the
time. Such tasks, if they are spread out across a multi-CPU system, can
cause processors to stay awake (and powered up) without actually using
those processors to any great extent. Rather than keeping all those CPUs
running, it clearly makes sense to coalesce those small tasks onto a
smaller number of processors, allowing the remaining processors to be
The first step toward this goal is, naturally, to be able to identify those
small tasks. That can be a challenge: the scheduler in current kernels
does not collect the information needed to make that determination. The
good news is that this problem has already been solved by Paul Turner's per-entity load tracking patch set, which
allows for proper tracking of the load added to the system by every
"entity" (being either a process or a control group full of processes) in
the system. This patch set has been out-of-tree for some time, but the
clear plan is to merge it sometime in the near future.
The kernel's scheduling domains mechanism
represents the topology of the underlying system; among other things, it is
intended to help the scheduler decide when it makes sense to move a process
from one CPU to another. Vincent's patch set starts by adding a new flag
bit to indicate when two CPUs (or CPU groups, at the higher levels) share
the same power line. In the shared case, the two CPUs cannot be powered
down independently of each other. So, when two CPUs live in the same power domain,
moving a process from one to the other will not significantly change the
system's power consumption. By default, the "shared power line" bit is set
for all CPUs; that preserves the scheduler's current behavior.
The real goal, from the power management point of view, is to vacate all
CPUs on a given power line so the whole set can be powered down. So the
scheduler clearly wants to use the new information to move small tasks out
of CPU power domains. As we have recently
seen, though, process-migration code needs to be written carefully lest
it impair the performance of the scheduler as a whole. So, in particular,
it is important that the scheduler not have to scan through a (potentially
long) list of CPUs when contemplating whether a small task should be moved
or not. To that end, Vincent's patch assigns a "buddy" to each CPU at
system initialization time. Arguably "buddy" is the wrong term to use,
since the relationship is a one-way affair; a CPU can dump small tasks onto
its buddy (and only onto the buddy), but said buddy cannot reciprocate.
Imagine, for a moment, a simple two-socket, four-CPU system that looks
(within the constraints of your editor's severely limited artistic
capabilities) like this:
For each CPU, the scheduler tries to find the nearest suitable CPU
on a different power line to buddy it with. The most "suitable" CPU is
typically the lowest-numbered one in each group, but, on heterogeneous
systems, the code will pick the CPU with the lowest power consumption on
the assumption that it is the most power-efficient choice.
So, if each CPU and each
socket in the above system could be powered down independently, the buddy
assignments would look like this:
Note that CPU 0 has no buddy, since it is the lowest-numbered processor in
the system. If CPUs 2 and 3 shared a power line, the buddy
assignments would be a little different:
In each case, the purpose is to define an easy path by which an
alternative, power-independent CPU can be chosen as the new home for a
With that structure in place, the actual changes to the scheduler are quite
small. The normal load-balancing code is unaffected for the simple reason
that small tasks, since they are more likely than not to be sleeping when
the load balancer runs, tend not to be moved in the balancing process.
Instead, the scheduler will, whenever a known small task is awakened,
consider whether that task should be moved from its current CPU to the
buddy CPU. If the buddy is sufficiently idle, the task will be moved;
otherwise the normal wakeup logic runs as always. Over time, small tasks
will tend to migrate toward the far end of the buddy chain as long as the
load on those processors does not get too high. They should, thus, end up
"packed" on a relatively small number of power-efficient processors.
Vincent's patch set included some benchmark results showing that throughput with the
modified scheduler is essentially unchanged. Power consumption is a
different story, though; using "cyclictest" as a benchmark, he showed power
consumption at about ⅓ its previous level. The benefits are sure to be
smaller with a real-world workload, but it seems clear that pushing small
tasks toward a small number of CPUs can be a good move. Expect discussion
of approaches like this one to pick up once the per-entity load tracking
patches have found their way into the mainline.
Comments (5 posted)
In an article last week, we saw that
the EPOLL_CTL_DISABLE operation proposed by Paton Lewis provides a
way for multithreaded applications that cache information about file
descriptors to safely delete those file descriptors from an epoll interest
list. For the sake of brevity, in the remainder of this article we'll use
the term "the EPOLL_CTL_DISABLE problem" to label the underlying
problem that EPOLL_CTL_DISABLE solves.
This article revisits the EPOLL_CTL_DISABLE story from a
different angle, with the aim of drawing some lessons about the design of
the APIs that the kernel presents to user space. The initial motivation for
pursuing this angle arises from the observation that the
EPOLL_CTL_DISABLE solution has some difficulties of its own. It is
neither intuitive (it relies on some non-obvious details of the epoll
implementation) nor easy to use. Furthermore, the solution is somewhat
limiting, since it forces the programmer to employ the
EPOLLONESHOT flag. Of course, these difficulties arise at least in
part because EPOLL_CTL_DISABLE is designed so as to satisfy one of
the cardinal rules of Linux development: interface changes must not break
existing user-space applications.
If there had been an awareness of the EPOLL_CTL_DISABLE
problem when the epoll API was originally designed, it seems likely that a
better solution would have been built, rather than bolting on
EPOLL_CTL_DISABLE after the fact. Leaving aside the question of
what that solution might have been, there's another interesting question:
could the problem have been foreseen?
One might suppose that predicting the EPOLL_CTL_DISABLE
problem would have been quite difficult. However, the synchronized-state
problem is well known and the epoll API was designed to be thread
friendly. Furthermore, the notion of employing a user-space cache of the
ready list to prevent file descriptor starvation was documented in the epoll(7)
man page (see the sections "Example for Suggested Usage" and "Possible
Pitfalls and Ways to Avoid Them") that was supplied as part of the original
In other words, almost all of the pieces of the puzzle were known when
the epoll API was designed. The one fact whose implications might
not have been clear was the presence of a blocking interface
(epoll_wait()) in the API. One wonders if more review (and
building of test applications) as the epoll API was being designed might
have uncovered the interaction of epoll_wait() with the remaining
well-known pieces of the puzzle, and resulted in a better initial design
that addressed the EPOLL_CTL_DISABLE problem.
So, the first lesson from the EPOLL_CTL_DISABLE story is that
more review is necessary in order to create better API designs (and we'll
see further evidence supporting that claim in a moment). Of course, the
need for more review is a general problem in all aspects of Linux
development. However, the effects of insufficient review can be especially
painful when it comes to API design. The problem is that once an API has
been released, applications come to depend on it, and it becomes at the
very least difficult, or, more likely, impossible to later change the
aspects of the API's behavior that applications depend upon. As a
consequence, a mistake in API design by one kernel developer can create
problems that thousands of user-space developers must live with for many
A second lesson about API design can be found in a comment that Paton made when responding to a
question from Andrew Morton about the design of
EPOLL_CTL_DISABLE. Paton was speculating about whether a call of
epoll_ctl(epfd, EPOLL_CTL_DEL, fd, &epoll_event);
could be used to provide the required functionality. The
EPOLL_CTL_DEL operation does not currently use the fourth argument
of epoll_ctl(), and applications should specify it as
NULL (but more on that point in a moment). The idea would be that
"epoll_ctl [EPOLL_CTL_DEL] could set a bit in epoll_event.events
(perhaps called EPOLLNOTREADY)" to notify the caller that the file
descriptor was in use by another thread.
But Paton noted a shortcoming of this approach:
However, this could cause a problem with any legacy code that relies on the
fact that the epoll_ctl epoll_event parameter is ignored for
EPOLL_CTL_DEL. Any such code which passed an invalid pointer for that
parameter would suddenly generate a fault when running on the new kernel
code, even though it worked fine in the past.
In other words, although the EPOLL_CTL_DEL operation doesn't
use the epoll_event argument, the caller is not required to
specify it as NULL. Consequently, existing applications are free
to pass random addresses in epoll_event. If the kernel now started
using the epoll_event argument for EPOLL_CTL_DEL, it
seems likely that some of those applications would break. Even though those
applications might be considered poorly written, that's no justification
for breaking them. Quoting
We care about user-space interfaces to an insane degree. We go to extreme
lengths to maintain even badly designed or unintentional
interfaces. Breaking user programs simply isn't acceptable.
The lesson here is that when an API doesn't use an argument, usually
the right thing to do is for the implementation to include a check that
requires the argument to have a suitable "empty" value, such as
NULL or zero. Failure to do that means that we may later be
prevented from making the kind of API extensions that Paton was talking
about. (We can leave aside the question of whether this particular
extension to the API was the right approach. The point is that the option
to pursue this approach was unavailable.) The kernel-user-space API
provides numerous examples of failure to do this sort of checking.
However, there is yet more life in this story. Although there have been
many examples of system calls that failed to check that "empty" values were
passed for unused arguments, it turns out that
epoll_ctl(EPOLL_CTL_DEL) fails to include the check for another
reason. Quoting the BUGS section of the epoll_ctl()
In kernel versions before 2.6.9, the EPOLL_CTL_DEL operation
required a non-NULL pointer in event [the epoll_event
argument], even though this argument is ignored. Since Linux 2.6.9,
event can be specified as NULL when using EPOLL_CTL_DEL.
Applications that need to be portable to kernels before 2.6.9 should
specify a non-NULL pointer in event.
In other words, applications that use EPOLL_CTL_DEL are not
only permitted to pass random values in the epoll_event
argument: if they want to be portable to Linux kernels before 2.6.9 (which
the problem), they are required to pass a pointer to some random,
but valid user-space address. (Of course, most such applications would
simply allocate an unused epoll_event structure and pass a pointer
to that structure.) Here, we're back to the first lesson: more review of
the initial epoll API design would almost certainly have uncovered this
fairly basic design error. (It's this writer's contention that one of the
best ways to conduct that sort of review is by thoroughly documenting the
API, but he admits to a certain bias on
Failing to check that unused arguments (or unused pieces of arguments)
have "empty" values can cause subtle problems long after the fact. Anyone
looking for further evidence on that point does not need to go far: the
epoll_ctl() system call provides another example.
Linux 3.5 added a new epoll flag, EPOLLWAKEUP, that can be specified in the
epoll_event.events field passed to epoll_ctl(). The
effect of this flag is to prevent the system from being suspended while
epoll readiness events are pending for the corresponding file
descriptor. Since this flag has a system-wide effect, the caller must have
a capability, CAP_BLOCK_SUSPEND (initially misnamed
In the initial EPOLLWAKEUP implementation, if the caller did
not have the CAP_BLOCK_SUSPEND capability, then
epoll_ctl() returned an error so that the caller was informed of
the problem. However, Jiri Slaby reported
that the new flag caused a regression: an existing program failed because
it was setting formerly unused bits in epoll_event.events when
calling epoll_ctl(). When one of those bits acquired a meaning (as
EPOLLWAKEUP), the call failed because the program lacked the
required capability. The problem of course is that epoll_ctl() has
never checked the flags in epoll_event.events to ensure that the
caller has specified only flag bits that are actually implemented in the
kernel. Consequently, applications were free to pass random garbage in the
When one of those random bits suddenly caused the application to fail,
what should be done? Following the logic outlined above, of course the
answer is that the kernel must change. And that is exactly what happened in
this case. A patch was applied so that if
the EPOLLWAKEUP flag was specified in a call to
epoll_ctl() and the caller did not have the
CAP_BLOCK_SUSPEND capability, then epoll_ctl()
silently ignored the flag instead of returning an error. Of course,
in this case, the calling application might easily carry on, unaware that
the request for EPOLLWAKEUP semantics had been ignored.
One might observe that there is a certain arbitrariness about the
approach taken to dealing with the EPOLLWAKEUP breakage. Taken to
the extreme, this type of logic would say that the kernel can never add new
flags to APIs that didn't hitherto check their bit-mask arguments—and
there is a long list of such system calls (mmap(),
splice(), and timer_settime(), to name just a
few). Nevertheless, new flags are added. So, for example, Linux 2.6.17
added the epoll event flag EPOLLRDHUP, and since no one complained
about a broken application, the flag remained. It seems likely that the
same would have happened for the original implementation of
EPOLLWAKEUP that returned an error when CAP_BLOCK_SUSPEND
was lacking, if someone hadn't chanced to make an error report.
As an aside to the previous point, in cases where someone reports a
regression after an API change has been officially released, there
is a conundrum. On the one hand, there may be old applications that depend
on the previous behavior; on the other hand, newer applications may
already depend on the newly implemented change. At that point, there is no
simple remedy: to fix things almost certainly means that some applications
We can conclude with two observations, one specific, and the other more
general. The specific observation is that, ironically,
EPOLL_CTL_DISABLE itself seems to have had surprisingly little
review before being accepted into the 3.7 merge window. And in fact, now
that more attention has been focused on it, it looks as though the proposed API will see some
changes. So, we have a further, very current, piece of evidence that there
is still insufficient review of kernel-user-space APIs.
More generally, the problem seems to be that—while the kernel
code gets reviewed on many dimensions—it is relatively uncommon for
kernel-user-space APIs to be reviewed on their own merits. The kernel has
maintainers for many subsystems. By now, the time seems ripe for there to
be a kernel-user-space API maintainer—someone whose job it is
to actively review and ack every kernel-user-space API change, and to
ensure that test cases and sufficient documentation are supplied with the
implementation of those changes. Lacking such a maintainer, it seems likely
that we'll see many more cases where kernel developers add badly designed
APIs that cause years
of pain [PDF] for user-space developers.
Comments (29 posted)
As is generally the case when realtime Linux developers get together, the
discussion soon turns to how (and when) to get the remaining pieces of the
realtime patch set into the mainline. That was definitely the case at the 2012
realtime minisummit, which was held October 18 in conjunction with the 14th Real Time
Linux Workshop (RTLWS) in Chapel Hill, North Carolina. Some other
addressed as well, of course, and a lively discussion, which Thomas Gleixner
characterized as "twelve people siting around a table not agreeing on
anything", ensued. Gleixner's joke was just that, as there was actually a
great deal of agreement around that table.
I unfortunately missed the first hour or so of the
minisummit, so I am using Darren Hart's notes, Gleixner's recap for the
entire workshop on October 19, and some conversations with attendees as the
basis for the report on that part of the meeting.
The first topic was on using Bugzilla to track bugs in the realtime
patches. Hart and
Clark Williams have agreed to shepherd a Bugzilla to help ensure the bugs
have useful information and provide the needed pieces for the developers to
track the problems down. Bugs can now be reported to the kernel Bugzilla using
PREEMPT_RT for the "Tree" field. Doing so will send an email to
developers who have registered their interest with Hart.
Gleixner has "mixed feelings" about it because it
involves "web browsers, mouse clicks and other things developers hate".
Previously, the normal way to report a bug was via the realtime or kernel
mailing lists, but Bugzilla does provide a way to attach large files
(e.g. log files) to bugs, which may prove helpful. The realtime hackers
will know better in a year how well Bugzilla is working out and will report
on it then, he said.
There was general agreement that the development process for realtime is
working well. Currently, Gleixner is maintaining a patch set based on 3.6,
which will be turned over to Steven Rostedt when it stabilizes. Rostedt
then follows the mainline stable releases and is, in effect, the stable
"team" for realtime. Those stable kernels are the ones that users and
distributions generally base their efforts on. In the future, Gleixner has
plans to update his 3.6-rt tree with incremental patches that have already
been merged into other stable realtime kernels (3.0, 3.2, 3.4) to keep it
closer to the mainline 3.6 stable release.
There was some discussion of the long-term
support initiative (LTSI) kernels and what relationship those kernels have
with the realtime stable kernels. The answer is: not much. LTSI plans to
have realtime versions of its kernels, but when Hart suggested
aligning the realtime kernel versions with those of LTSI, it was not
met with much agreement. Gleixner said that the LTSI kernels would likely be
supported for years, "probably decades", which is well beyond the scope of
what the realtime developers are interested in doing.
3.6 softirq changes
One of the topics that came up frequently as part of both the
workshop/minisummit and the extensive hallway/micro-brewery track was
Gleixner's softirq processing changes
released in 3.6-rt1. The locks for the ten different softirq types have been
separated so that the softirqs raised in the context of a thread can be
handled in that thread—without having to handle unrelated softirqs.
This solves a number of problems with softirq handling (victimizing
unrelated threads to process softirqs, configuring separate softirq thread
to get the desired behavior, etc.), but is a big change from the existing
mainline implementation—as well as from previous
realtime patch sets.
In the minisummit, Gleixner emphasized that more testing of the patches is
needed. Networking, which is the most extensive user of softirqs in the
kernel, needs more testing in particular.
But the larger issue is the possibility of eventually eliminating softirqs
in the kernel completely. To that end, each of the specific kernel
was discussed, with an eye toward eliminating the softirq dependency for
both realtime and mainline.
The use of softirqs in the network subsystem is "massive" and even the
network developers are not quite sure why it all works, according to
Gleixner. But, softirqs seem to work fine for Linux networking, though the
definition of "working" is not necessarily realtime friendly. If the kernel can pass the
network throughput tests and fill the links on high-speed test
hardware, then it is considered
to be working. Any alternate solution will have to meet or exceed the
current performance, which may be difficult.
The block subsystem's use of softirqs is mostly legacy code. Something
like 90% of the deferred work has been shifted to workqueues over the
years. Eliminating the rest won't be too difficult, Gleixner said.
The story with tasklets is similar. They should be "easy to get rid of",
he said, it will just be a lot of work. Tasklets are typically used by
legacy drivers and are not on a performance-critical path. Tasklet
handling could be moved to its own thread, Rostedt suggested, but Gleixner
thought it would be better to
eliminate them entirely.
The timer softirq, which is used for the timer wheel (described and
diagrammed in this LWN article), is more
problematic. The timer wheel is mostly used for timeouts in the network
and elsewhere, so it is pretty low priority. It can't run with interrupts
disabled in either the mainline or in the realtime kernel, but it has to run
somewhere, so pushing it off to ksoftirqd is a possibility.
The high-resolution timers softirq is mostly problematic because of POSIX
timers and their signal-delivery semantics. Determining which thread
should be the "victim" to deliver the signal to can be a lengthy process,
so it is not done in the softirq handler in the realtime patches as it is
in mainline. One solution that may be acceptable to mainline developers is
to set a flag in the thread which requested the timer, and allow it to do
all of the messy victim-finding and signal delivery. That would mean that
the thread which requests a POSIX timer pays the price for its semantics.
Williams asked if users were not being advised to avoid signal-based
timers. Gleixner said that he tells users to "use pthreads". But,
"customers aren't always reasonable", Frank Rowand observed. He pointed
out that some he knows of are using floating point in the kernel, and now
that they have hardware floating point want to add that context to what is
saved during context switches. Paul McKenney noted that many processors
have lots of floating point registers which can add "multiple hundreds of
milliseconds microseconds" to save or restore. Similar problems exist for the
auto-vectorization code that is being added to GCC, which will result in
many more registers needing to be saved.
Back to the softirqs, McKenney said that the read-copy-update (RCU) work
had largely moved to
threads in 3.6, but that not all of the processing moved out of the
softirq. He had tried to completely move out of softirq in a patch a ways
back, but Linus Torvalds "kicked it out immediately". He has some ideas of
ways to address those complaints, though, so eliminating the RCU softirq
should be possible.
Finally, the scheduler softirq does "nothing useful that I can see",
Gleixner said. It mostly consists of heuristics to do load balancing, and
Peter Zijlstra may be amenable to moving it elsewhere. Mike Galbraith
pointed out that the NUMA scheduling
work will make the problem worse, as
management. ARM's big.LITTLE scheduling
could also complicate things, Rowand said.
There is a great deal of interest in getting those
changes into the 3.2 and 3.4 realtime kernels. Later in the meeting,
Rostedt said that he would
create an unstable branch of those kernels to facilitate that. The
modifications are "pretty local", Gleixner said, so it should be fairly
straightforward to backport the changes. In addition, it is unlikely that
backports of other fixes into the mainline stable kernels (which are picked up
by the realtime stable kernels) will touch the changed areas, so the
ongoing maintenance should not be a big burden.
Gleixner said that he is "swamped" by a variety of tasks, including
stabilizing the realtime tree, the softirq split, and a "huge backlog" of
work that needs to be done for the CPU hotplug rework. Part of the latter
was merged for 3.7, but there is lots more to do. Rusty Russell has
offered to help once Gleixner gets the infrastructure in place, so he needs
to "get that out the door". Beyond that, he also spends a lot of time
tracking down bugs found by the Open Source Automation Development Lab
(OSADL) testing and from Red Hat bug
He needs some help from the other realtime kernel developers in order
to move more of the patch set into the mainline.
Those in the room seemed very willing to help. The first step is to go
through all of the realtime patches and work on any that are "halfway
get upstream" first.
One of the top priorities for upstreaming is not a kernel change, but is a
change needed in the GNU C library (glibc). Gleixner noted that the
development process for glibc has gotten a "lot better" recently and that
the new maintainers are doing a "great job". That means that a
longstanding problem with
condvars and priority inheritance may finally be able to be addressed.
When priority inheritance was added to the kernel, Ulrich Drepper wrote the
user-space portion for glibc. He had a solution for the problem of
condvars not being
able to specify that they want to use a priority-inheriting mutex, but that
solution was one that Gleixner and Ingo Molnar didn't like, so nothing was
added to glibc.
Three years ago, Hart presented a solution at the RTLWS in Dresden, but he
was unable to get it into glibc. It is a real problem for users according
to Gleixner and Williams, so Hart's solution (or something derived from it)
should be merged into glibc. Hart said he would put that at the top of his
Another area that should be fairly easy to get upstream are changes to the
SLUB allocator to make it work with the realtime code. SLUB developer
Christoph Lameter has done some work to make the core allocator lockless
and for it not to disable interrupts or preemption. Lameter's work was mostly
to support enterprise users on large NUMA systems, but it
should also help make SLUB work better with realtime.
If SLUB can be made to work relatively easily, Gleixner would be quite
willing to drop support for SLAB. The SLOB allocator is targeted at
smaller, embedded systems, including those without an MMU, so it is not
an interesting target. Besides which, SLOB's "performance is terrible",
Rostedt said. During the minisummit, Williams was able to build and boot
SLUB on a realtime system, which "didn't explode right away", Gleixner
reported in the recap. That, coupled with SLUB's better NUMA performance,
may make it a much better target anyway, he said.
Switching to SLUB might also get rid of a whole pile of "intrusive changes"
in the memory allocator code. The realtime memory
management changes will be some of the hardest to sell to the
upstream developers, so any reduction in the size of those patches will be
There are a number of places where drivers call local_irq_save()
and local_irq_enable() that have been changed in the realtime tree
to call *_nort() variants. There are about 25 files that use
those variants, mostly drivers designed for uniprocessor machines that have
never been fixed for multiprocessor systems. No one really cares about
those drivers any more, Gleixner said, so the _nort changes can either go into
mainline or be trivially maintained out of it.
Bit spinlocks (i.e. single bits used as spinlocks) need to be changed to
support realtime, and that can probably be sold because it would add
lockdep coverage. Right now, bit spinlocks are not checked by lockdep,
which is a debugging issue. In converting bit spinlocks to regular
spinlocks, Gleixner said
he found 3-4 locking bugs in the mainline, so it would be beneficial to
have a way to check them.
The problem is that bit spinlocks are typically using flag
bits in size-constrained structures (e.g. struct page). But, for
debugging, it will be acceptable to grow those structures when lockdep is
enabled. For realtime, there is a need to just "live with the fact that we
are growing some structures", Gleixner said. There aren't that many bit
spinlocks; two others that he mentioned were the buffer head lock and the
journal head lock.
Hart brought up the sleeping spinlock conversion, but Gleixner said that part is
the least of his worries. Most of the annotations needed have already been
merged, as have the header file changes. The patches are "really
unintrusive now", though it is still a big change.
The CPU hotplug rework should eliminate most of the changes required for
realtime once it gets merged. The migrate
enable and disable patches are
self-contained. The high-resolution timers changes and softirq changes can
be fairly self-contained as well. Overall, getting the realtime patches
upstream is "not that far away", Gleixner said, though some thought is
needed on good arguments to get around the "defensive list" of some
To try to ensure they hadn't skipped over anything, Williams put up a January
email from Gleixner with a "to do" list for the realtime patches. There
are some printk()-related issues that were on the list. Gleixner
said those still linger, and it will be "messy" to deal with them.
Zijlstra was at one time opposed to the explicit migrate enable/disable
calls, but that may be not be true anymore, Gleixner said. The problem may
be that there will be a question of who uses the code when trying to get
the infrastructure merged. It is a "hen-egg problem", but there needs to
be a way to ensure that processes do not move between CPUs, particularly
In the mainline, spinlocks disable preemption (which disables migration),
but that's not true in realtime. The current mainline behavior is somewhat
"magic", and realtime adds an explicit way to disable migration if that's
truly what's needed. As Paul Gortmaker put it, "making it explicit is an
argument for it in it's own right". Gleixner said he would talk to
Zijlstra about a use case and get the code into shape for mainline.
Gortmaker asked if there were any softirq uses that could be completely
eliminated. McKenney believes he can do so for the RCU softirq, but he
does have the reservation that he has never successfully done so in the past.
High-resolution timers and the timer wheel can both move out of softirqs,
Gleixner said, though the former may be tricky. The block layer softirq
work can be moved to workqueues, but the network stack is the big issue.
One possible solution for the networking softirqs is something Rostedt
calls "ENAPI" (even newer API, after "NAPI", the new API). When using
threaded interrupt handlers, the polling that is currently done in the
softirq handler could be done directly in the interrupt handler thread. If
works, and shows a performance benefit, Gleixner said, the network driver
writers will do much of the work on the conversion.
Wait queues are another problem area. While most are "pretty
straightforward", there are some where the wait queue has a callback that
is called on wakeup for every woken task. Those callbacks could do most
anything, including sleep, which prevents those wait queues from being
converted to use raw locks. Lots of places can be replaced, but not for
"NFS and other places with massive callbacks", Gleixner said.
There are a number of pieces that should be able to go into mainline
largely uncontended. Code to shorten the time that locks are held and
to reduce the interrupt and preempt disabled regions is probably
non-controversial. The _nort annotations may also fall into that
category as they don't hurt things in mainline.
The final item on the day's agenda is a feature that is not part of the
realtime patches, but is of interest to many of the same users: CPU
isolation. That feature, which is known by other names such as "adaptive
NOHZ", would allow users to dedicate one or more cores to user-space
removing all kernel processing from those cores. Currently, nearly all
processing can be moved to other cores using CPU affinity, but there is
still kernel housekeeping (notably CPU time accounting and RCU) that will
run on those CPUs.
Frédéric Weisbecker has been working on CPU isolation, and he attended the
minisummit at least partly to give an update on the status of the feature.
Accounting for CPU time without the presence of the timer tick is one of
the areas that needs work. Users still want to see things like load
averages reflect the time being spent in user-space processing on an
isolated CPU, but that information is normally updated in the timer tick
In order to isolate the CPU, though, the timer tick needs to be turned
off. In the recap, Gleixner noted that the high-performance computer users
of the feature aren't so concerned about the time spent in the timer tick
(which is minimal), but the cache effects from running the code. Knocking
data and instructions out of the cache can result in a 3% performance hit,
which is significant for those workloads.
To account for CPU time usage without the tick, adaptive NOHZ will use the
same hooks that RCU uses to calculate the CPU usage. While the CPU is
isolated, the CPU time will just be calculated, but won't be updated until
the user-space process enters the kernel (e.g. via a system call). The
tick might be restarted when system calls are made, which will eventually
occur so that the CPU-bound process can report its results or get new
data. Restarting the tick would allow
accounting and RCU housekeeping to be done. Weisbecker felt that it
should only be restarted if it was needed for RCU; even that might possibly
be offloaded to a separate CPU.
That led to a discussion of what the restrictions there are for using CPU
isolation. There was talk of trying to determine which system calls will
actually require restarting the tick, but that was deemed too
kernel-version specific to be useful. The guidelines will be that anything
other than one thread that makes no system calls on the CPU may result in
less than 100% of the CPU available. Gleixner suggested adding a
tracepoint that would indicate when the CPU exited isolated mode and why.
McKenney suggested a warning
like "this code needs a more deterministic universe"—to some chuckles
around the table.
Weisbecker and Rostedt plan to work on CPU isolation in the near term, with
an eye toward getting it upstream soon.
And that is pretty much where the realtime minisummit ended. While there
is plenty of work still to do, it is clear that there is increasing
interest in "finishing" the task of getting the realtime changes merged.
Gleixner confessed to being tired of maintaining it in the recap session,
and that feeling is probably shared by others. Given that the mainline has
benefited from the realtime changes already merged, it seems likely
that will continue as more goes upstream. The trick will be in convincing
the other kernel hackers of that.
[ I would like to thank OSADL and RTLWS for supporting my travel to the minisummit and workshop. ]
Comments (8 posted)
Patches and updates
- Thomas Gleixner: 3.6.3-rt6 .
(October 24, 2012)
- Thomas Gleixner: 3.6.2-rt4 .
(October 20, 2012)
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>