By Jonathan Corbet
September 28, 2009
Prior to the
Eleventh
Real Time Linux Workshop in Dresden, Germany, a small
group met to discuss the further development of the realtime preemption
work for the Linux kernel. This "mini-summit" covered a wide range of
topics, but was driven by a straightforward set of goals: the continuing
improvement of realtime capabilities in Linux and the merging of the
realtime preemption patches into the mainline.
The participants were:
Stefan Assmann,
Jan Blunck,
Jonathan Corbet,
Sven-Thorsten Dietrich,
Thomas Gleixner,
Darren Hart,
John Kacur,
Paul McKenney,
Ingo Molnar,
Oleg Nesterov,
Steven Rostedt,
Frederic Weisbecker,
Clark Williams, and
Peter Zijlstra.
Together they represented several companies working in the area of realtime
Linux; they brought a lot of experience with customer needs to the table.
The discussion was somewhat unstructured - no formal agenda existed - but
a lot of useful topics were covered.
Threaded interrupt handlers came out early in the discussion. This
feature was merged into the mainline for the 2.6.30 kernel; it is useful in
realtime situations because it allows interrupt handlers to be prioritized
and scheduled like any other process.
There is one part of the threaded interrupt code which remains outside of
the mainline: the piece which forces all drivers to use threaded
handlers. There are no plans to move that code into the mainline; instead,
it's going to be a matter of persuasion to get driver writers to switch to
the newer way of doing things.
Uptake in the mainline is small so
far; few drivers are actually using this feature. That is beginning to
change, though; the SCSI layer is one example. SCSI has always featured
relatively heavyweight interrupt-handling code and work done in
single-threaded workqueues. This code could move fairly
naturally to process context; the SCSI developers are said to be evaluating
a possible move toward threaded interrupt handlers in the near future.
There have also been suggestions that the network stack
might eventually move in that direction.
System management interrupts (SMIs) are a very different sort of
problem. These interrupts happen at a very low level in the hardware and
are handled by the BIOS code. They often perform hardware monitoring
tasks, from simple thermal monitoring to far more complex operations not
normally associated with BIOS-level software. SMIs are almost entirely
invisible to the operating system and are
generally not subject to control at that level, but they are visible in
some important ways: they monopolize anything between one CPU and all CPUs
in the system for a measurable period of time, and they can change
important parameters like the system clock rate. SMIs on some types of
hardware can run for surprisingly long periods; one vendor sells systems
where an SMI for managing ECC memory runs for 200µs every three
minutes. That is long enough to play havoc with any latency deadlines
that the operating system is trying to meet.
Dealing with the SMI problem is a challenge. Some hardware allows SMIs to
be disabled, but it's never clear what the consequences of doing so might
be; if the CPU melts into a puddle of silicon, the resulting latencies will
be even worse than before. Sharing information about SMI problems can be
hard because
many of the people working in this area are working under non-disclosure
agreements with the hardware vendors; this is unfortunate, because some
vendors have done a far better job of avoiding SMI-related latencies than
others. There is a tool now (hwlat_detector)
which can measure SMI latency, so we should start seeing more
publicly-posted information on this issue. And, with luck, vendors will
start to deal with the problem.
Not all hardware latency is caused by SMIs; hypervisors, too, can be a
significant source of latency problems.
A related issue is hardware changes imposed by SMI handlers. If the BIOS
determines that the system is overheating, it may respond by slowing the
clock rate or lowering the processor voltage. On a throughput-oriented
system, that may well be the right thing to do. When latencies are
important, though, slowing the processor could be a mistake - it could
cause applications to miss their deadlines. A better response might be to
simply shut down some processors while keeping others at full speed. What
is really needed here is a way to get this information to user space so
that policy decisions can be made there.
Testing is always an issue in this kind of software development; how
do the developers know that they are really making things better? There
are various test suites out there (RTMB, for example),
but there is no complete and integrated test suite.
There was some talk of trying to move more of the realtime testing code into the
Linux Test Project, but LTP is a huge body of code. So the realtime tests
might remain on their own, but it would be nice, at least, to standardize
test options and output formats to help with the automation of testing.
XML output from test programs is favored by some, but it is fair to say
that XML is not universally loved in this crowd.
The big kernel lock is a perennial outstanding issue for realtime
development for a couple of reasons. One is that, despite having been
pushed out of much of the core code, the BKL can still create long
latencies. The other is that elimination of the BKL would appear to be
part of the price for an eventual merge of sleeping spinlocks into the
mainline kernel. The ability to preempt code running under the BKL was
removed in 2.6.26; this change was directly motivated by a performance
regression caused by the semaphore rewrite, but it was also seen as a way
to help inspire BKL-removal efforts by those who care about latencies.
Much of the hard work in getting rid of the BKL has been done; one big
outstanding piece is the conversion of reiserfs being done by Frederic
Weisbecker. After that, what's left is a lot of grunt work: figuring out
what (if anything) is protected by a lock_kernel() call and putting in
proper locking. The "tip" tree has a branch (rt/kill-the-bkl) where this
work can be coordinated and collected.
Signal delivery is still not an entirely solved problem. Actually,
signals are always a problem, for implementers and users alike. In the
realtime context, signal delivery has some specific latency issues. Signal
delivery to thread groups involves an O(n) algorithm to determine which
specific thread to target; getting through this code can create excessive
latencies. There are also some locks in the delivery path
which interfere with the delivery of signals in realtime interrupt
context.
Everybody agrees that the proper solution is to avoid signals in
applications whenever possible. For example, timerfd() can be
used for timer events. But everybody also agrees that applications will
continue to use signals, so they have to be made to work somehow. The
probable solution is to remove much of the work from the immediate signal
delivery path. Signal delivery would just enqueue the information and set
a bit in the task structure; the real work would then be done in the
context of the receiving process. That work might still be expensive, but
it would at least fall to the process which is actually using signals
instead of imposing latencies on random parts of the system.
A side discussion on best practices for efficient realtime
application development yielded a few basic recommendations. The best API
to use, it turns out, is the basic pthread interface; it has been well
optimized over time. SYSV IPC is best avoided.
Cpusets work better than the affinity mechanism
for CPU isolation. In general, developers should realize that getting the
best performance out of a realtime system will require a certain amount of
manual tuning effort. Realtime Linux allows the prioritization of things
like interrupt handlers, but the hard work of figuring out what those
priorities should be can only be done by developers or administrators. It
was acknowledged that the interfaces provided to administrators currently
are not entirely easy to use; it can be hard to identify interrupt threads,
for example. Red Hat's tuna
tool can help in this regard, but more needs to be done.
Scalability was a common theme at the meeting. As a general rule,
realtime development has not been focused specifically on scalability
issues. But there is interest in running realtime applications on larger
systems, and that is bringing out problems. The realtime kernel tends to
run into scalability problems before the mainline kernel does; it was
described as an early warning system which highlights issues that the
mainline will be dealing with five years from now. So realtime will tend
to scale more poorly than mainline, but fixing realtime's problems will
eventually benefit mainline users as well.
Darren Hart presented a couple of charts
containing the results of some work by John Stultz
showing the impact of running the realtime kernel on a 24-processor
system. When running in anything other than uniprocessor mode, the
realtime kernel imposes a roughly 50% throughput penalty on a suitably
pathological workload - a severe price.
Interestingly, if the locking changes from the realtime kernel are removed
while leaving all of the other changes, most of the performance loss goes
away. This has led Darren to wonder if there should be a hybrid option
available for situations where hard latency requirements are not present.
In other situations, the realtime kernel generally shows performance
degradation starting with eight CPUS, with sixteen showing unacceptable
overhead.
As it happens, nobody really understands where the performance cost of
realtime locking comes from. It could be in the sleeping spinlocks, but
there is also a lot of suspicion directed at reader-writer locks. In the
mainline kernel, rwlocks allow multiple readers to run in parallel; in the
realtime kernel, instead, only one reader runs at a time. That change is
necessary to make priority inheritance work; priority inheritance in the
presence of multiple readers is a difficult problem. One obvious
conclusion that comes from this observation is that, perhaps, rwlocks
should not implement priority inheritance. There is resistance to that
idea, though; priority inheritance is important in situations where the
highest-priority process should always run as quickly as possible.
The alternative to changing rwlocks is to simply stop using them whenever
possible. The usual way to remove an rwlock is to replace it with a
read-copy-update scheme. Switching to RCU will improve scalability,
arguably at the cost of increasing complexity. But before embarking on any
such effort, it is important to get a handle on how much of the problem
really comes down to rwlocks. Some research will be done in the near
future to better understand the source of the scalability problems.
Another problem is per-CPU variables, which work by disabling preemption
while a specific variable is being used. Disabling preemption is anathema
to the realtime developers, so per-CPU variables in the realtime tree are
protected by sleeping locks instead. That increases overhead. The problem
is especially acute in slab-level memory allocators, which make extensive
use of per-CPU variables.
Solutions take a number of forms. There will eventually be a more
realtime-friendly slab allocator, probably a variant of SLQB. Minimizing
the use of per-CPU variables in general makes sense for realtime.
There are also schemes involving the creation of multiple virtual "CPUs" so
that even processes running on the same processor can have their own
"per-CPU" variables. That decreases contention for those variables
considerably at the cost of a slightly higher cache footprint.
Plain old locks can also be a problem; a run of dbench on a 16-processor
system during the workshop showed a 90% reduction in throughput, with the
processors sitting idle half the time. The problem in this case turns out
to be dcache_lock, one of the last global spinlocks remaining in
the kernel. The realtime tree feels the effects of this lock more strongly
for a couple of reasons. One is that threads holding the lock can be
preempted; that leads to longer lock hold times and more context switches. The
other is that sleeping spinlocks are simply more complicated, especially in
the contended slow path of the code. So the locking primitives themselves
require more CPU time.
The solution to this particular problem can only be the elimination of the
global dcache_lock. Nick Piggin has a patch set which does
exactly that, but it has not yet been tested with the realtime tree.
Realtime makes life harder for the scheduler. On a normal system, the
scheduler can optimize for overall system throughput. The constraints
imposed by realtime, though, require the scheduler to respond much more
aggressively to events. So context switches are higher and processes are
much more likely to migrate between CPUs - better for bounded response
times, but worse for throughput. By the time the system scales up to
something relatively large - 128 CPUs, say - there does not seem to be
any practical way to get consistently good decisions from the scheduler.
There is some interest in deadline-oriented schedulers. Adding an
"earliest deadline first" or related scheduler could be useful for
application developers, but nobody seems to feel that a deadline scheduler
would scale better than the current code.
What all this means is that realtime applications running on that kind of system
must be partitioned. When specific CPUs are set aside for specific
processes, the scheduling problem gets simpler. Partitioning requires real
work on the part of the administrator, but it seems unavoidable for larger
systems.
It doesn't help that complete CPU isolation is still hard to accomplish on
a Linux system. Certain sorts of operations, such as workqueue flushes,
can spill into a processor which has been set aside for specific
processes. In general, anything involving interrupts - both device
interrupts and inter-processor interrupts - is a problem when one is trying
to dedicate a CPU to a task. Steering device interrupts to a given
processor is not that hard, though the management tools could use
improvement. Inter-processor interrupts are currently harder to avoid;
code generating IPIs needs to be reviewed and, when possible, modified to
avoid interrupting processors which do not actually have work to do.
Integrating interrupt management into the current cpuset and control group
code would be useful for system administrators. That seems to be a harder
task; Paul Jackson, the original cpuset developer, was strongly opposed to
trying to include interrupt management there. There's a lack of good
abstractions for this kind of administration, though the generic IRQ layer
helps. The opinion at the meeting seemed to be that this was a solvable
problem; if it can be solved for the x86 architecture, the other
architectures will eventually follow.
Going to a fully tickless kernel is also an important step for full CPU
isolation. Some work has recently been done in that direction, but much
remains to be done.
Stable kernel ABI concerns made a surprising appearance. The
"enterprise" Linux offerings from distributors generally include a promise
that the internal kernel interface will not change. The realtime
enterprise distributions have been an exception to this rule, though; the
realtime code is simply in too much flux to make such a promise practical.
This exemption has made life easier for developers working on that code,
naturally; it also has made it possible for customers to get the newest
code much more quickly. There are some concerns that, once the remaining
realtime code is merged into the mainline, the same kernel ABI constraints
may be imposed on realtime distributions. It is not clear that this needs
to happen, though; realtime customers seem to be more interested in keeping
up with newer technology and more willing to put up with large changes.
Future work was discussed briefly. Some of the things remaining to
be done include:
- More SMP work, especially on NUMA systems.
- A realtime idle loop. There is the usual tension there between
preserving the best response time and minimizing power consumption.
- Supporting hardware-assisted operations - things like onboard
cryptographic acceleration hardware.
- Elimination of the timer tick.
- Synchronization of clock events across CPUs. Clock synchronization is
always a challenging task. In this case, it's complicated by the fact
that a certain amount of clock skew can actually be advantageous on an
SMP system. If clock events are strictly synchronized, processors
will be trying to do things at the same time and lock contention will
increase.
A near-future issue is spinlock naming. Merging the sleeping
spinlock code requires a way to distinguish between traditional, spinning
locks and the newer type of lock which might sleep on a realtime system.
The best solution, in theory, is to rename sleeping locks to something like
lock_t, but that would be a huge change affecting many thousands
of files. So the realtime developers have been contemplating a new name
for non-sleeping locks instead. There are far fewer of these locks, so
renaming them to something like atomic_spinlock would be much less
disruptive.
There was some talk of the best names for "atomic spinlocks"; they could be
"core locks," "little kernel locks," or "dread locks." What really came
out of the discussion, though, is that there was a fair amount of confusion
regarding the two types of locks even in this group, which understands them
better than anybody else. That suggests that some extra care should go
into the naming, with the goal of making the locking semantics clear and
discouraging the use of non-sleeping locks. If the semantics of
spinlock_t change, there is a good argument that the name should
also change. That supports the idea of the massive lock renaming,
regardless of how disruptive it might be.
Whether such a change would be accepted is an open question, though. For
now, both the small renaming and the massive renaming will be prepared for
review. The issue may then be taken to the kernel summit in October for a
final decision.
Tools for realtime developers came up a couple of times. There are
a number of tools for application optimization now, but they are scattered
and not always easy to use. And, it is said, there needs to be a tool with
a graphical interface or a lot of users simply will not take it seriously.
The "perf" tool, part of the kernels "performance events" subsystem, seems
poised to grow into this role. It can handle many of the desired tasks -
latency tracing, for example - now, and new features are being added. The
"tuna" tool may be extended to provide a nicer interface to perf.
User-space tracepoints seem to be high on the list of desirable features
for application developers. Best would be to integrate these tracepoints
with ftrace somehow. Alternatively, user-space trace data could be
collected separately and integrated with kernel trace data at
postprocessing time. That leads to clock synchronization issues, though,
which are never easy to solve.
The final part of the meeting became a series of informal discussions and
hacking efforts. The participants universally saw it as a worthwhile
gathering, with much learned by all. There are some obvious
action items, including more testing to better understand scalability
problems, increasing adoption of threaded interrupt handlers, solving the
spinlock naming problem, improving tools, and more. Plenty of work for all
to do. But your editor has been assured that the work will be done and
merged in the next year - for real this time.
(
Log in to post comments)