By Jonathan Corbet
October 23, 2011
The realtime Linux minisummit was held on October 22 in Prague on the third
day of the 13th Realtime Linux Workshop. This year's gathering featured a
relatively unstructured discussion that does not lend itself to a tidy
writeup. What has been attempted here is a report on the most important
themes, supplemented with material from some of the workshop talks.
Per-CPU data
One of the biggest challenges for the realtime preemption patch set from
the beginning has been the handling of per-CPU data. Throughput-oriented
developers are quite fond of per-CPU variables; they facilitate lockless
data access and improve the scalability of the code. But working with
per-CPU data usually requires disabling preemption, which runs directly
counter to what the realtime preemption patch set has been trying to do.
It is a classic example of how high throughput and low latency can
sometimes be incompatible goals.
The way the realtime patch set has traditionally dealt with per-CPU data is
to put a lock around it and treat it like any other shared resource. The
get_cpu_var() API has supported that mode fairly well. In more
recent times, though, that API has been pushed aside by a new set of
functions with names like this_cpu_read() and
this_cpu_write(). The value of these functions is speed,
especially on the x86 architecture where most of the operations boil down
to a single machine instruction. But this new API has also caused some
headaches, especially (but not exclusively) in the realtime camp.
The first problem is that there is no locking mechanism built into this
API; it does not even disable preemption. So a value returned from
this_cpu_read() may not correspond to the CPU that the calling
thread is actually running on by the time it does something with the value
it reads. Without some sort of external preemption disable, a sequence of
this_cpu_read() followed by this_cpu_write() can write a
value back on a different CPU than where the value was read, corrupting the
resulting data structure; this kind of problem has already been experienced
a few times.
The bigger problem, though, is that the API does not in any way indicate
where the critical section involving the per-CPU data lies in the code.
So there is no way to put in extra protection for realtime operation, and
there is no way to put in any sort of debugging infrastructure to verify
that the API is being used correctly. There have already been some heated
discussions on the mailing list about these functions, but the simple fact
is that they make things faster, so they are hard to argue against. There
is real value in the this_cpu API; Ingo Molnar noted that the API is not a
problem as such - it is just an incomplete solution.
Incomplete or not, usage of this API is growing by hundreds of call sites
in each kernel release, so something needs to be done. A lot of time was
spent talking about the naming of these functions, which is seen as
confusing. What does "this CPU" mean in a situation where the calling
thread can be migrated at any time? There may be an attempt to change the
names to discourage improper use of the API, but that is clearly not a
long-term solution.
The first step may be to figure out how to add some instrumentation to the
interface that is capable of finding buggy usage. If nothing else, real
evidence of bugs will help provide a justification for any changes that
need to be made. There was also talk of creating some sort of "local lock"
that can be used to delineate per-CPU critical sections, disable migration
(and possibly preemption) in those sections, and, on realtime, to simply
take a lock. Along with hopefully meeting everybody's goals, such an
interface would allow the proper user of per-CPU data to be verified with
the locking checker.
Meanwhile, the realtime developers have recently come up with a new
approach to the management of per-CPU data in general. Rather than
disabling preemption while working with such data, they simply disable
migration between processors instead. If (and this is a bit of a big "if")
the per-CPU data is properly protected by locking, preventing migration
should be sufficient to keep processes from stepping on each other's toes
while preserving the ability for high-priority processes to preempt
lower-priority processes when the need arises.
The fear that comes with disabling migration is that processes could get
stuck and experience long latencies even though there are free CPUs
available. If a low-priority process disables migration, then is
preempted, it cannot be moved to an available processor. So that process
will simply wait until the preempting process either completes or gets
migrated itself. Some work has been done to try to characterize the
problem; Paul McKenney talked for a bit about some modeling work he has
done in that area. At this point, it's really not clear how big the
problem really is, but some developers are clearly worried about how this
technique may affect some workloads.
Software interrupts
For a long time, the realtime preemption patch set has split software
interrupt ("softirq") handling into a separate thread, making it
preemptable like everything else. More recent patches have done away with
that split, though, for a couple of reasons: to keep the patches closer to
the mainline, and to avoid delaying network processing. The
networking layer depends heavily on softirqs to keep the data moving;
shifting that processing to a separate thread can add latency in an
unwanted place.
But there had been a good reason for the initial split: software interrupts
can introduce latencies of their own. So there was some talk of bringing
the split back in a partial way. Most softirqs could run in a separate
thread but, perhaps, the networking softirqs would continue to run as
"true" software interrupts. It is not a solution that pleases anybody, but
it would make things better in the short term.
There appears to be agreement on the form of the long-term solution:
software interrupts simply need to go away altogether. Most of the time,
softirqs can be replaced by threaded interrupt handlers; as Thomas Gleixner
put it, the networking layer's NAPI driver API is really just a poor-man's
form of threaded IRQ handling. There is talk of moving to threaded
handlers in the storage subsystem as well; that evidently holds out some
hope of simplifying the complex SCSI error recovery code. That is not a
trivial change in either storage or networking, though; it will take a long
time to bring about. Meanwhile, partially-split softirq handling will
probably return to the realtime patches.
An interesting problem related to threaded softirq handlers is the
prevalence of "trylock" operations in the kernel. There are a number of
sites where the code will attempt to obtain a lock; if the attempt fails,
the IRQ handler will stop with the idea that it will try again on the next
softirq run. If the handler is running as a thread, though, chances are
good that it will be the highest-priority thread, so it will run again
immediately. A softirq handler looping in this manner might prevent the
scheduling of the thread that actually holds the contended lock, leading to
a livelock situation. This problem has been observed a few times on real
systems; fixing it may require a restructuring of the affected code.
Upstreaming
There was a long discussion on how the realtime developers planned to get
most or all of the patch set upstream. It has been out of tree for some
time now - the original preempt-rt patch was announced almost
exactly seven years ago. Vast amounts of code have moved from the
realtime tree into the mainline over that time, but there are still some
300 patches that remain out of the mainline. Thomas suggested a couple of
times that he has maintained those patches for about long enough, and does
not want or plan to do it forever.
Much of what is in the realtime tree seems to be upstreamable, though
possibly requiring some cleanup work first. There are currently about 75
patches queued to go in during the 3.2 merge window, and there is a lot
more that could be made ready for 3.3 or 3.4. The current goal is to merge
as much code as possible in the next few development cycles; that should
serve to reduce the size of the realtime patch set considerably, making the
remainder easier to maintain.
There are a few problem areas. Sampling of randomness is disabled in the
realtime tree; otherwise contention for the entropy pool causes unwanted
latencies. It seems that contribution of entropy is on the decline
throughout the kernel; as processors add hardware entropy generators, it's
not clear that maintaining the software pool has value. The seqlock API
has a lot of functions with no users in the kernel; the realtime patch set
could be simplified if those functions were simply removed from the
mainline. There are still a lot of places where interrupts are disabled,
probably unnecessarily; fixing those will require auditing the code to
understand why interrupts were disabled in the first place. Relayfs has
problems of its own, but there is only one in-tree user; for now, it will
probably just be disabled for realtime. The stop_machine()
interface is seen to be a pain, and is probably also unnecessary most of
the time.
And so on. A number of participants went away from this session with a
list of patches to fix up and send upstream; with luck, the out-of-tree
realtime patch set will shrink considerably.
The future
Once the current realtime patch set has been merged, is the job done? In
the past, shrinking the patch set has been made harder by the simple fact
that the developers continue to come up with new stuff. At the moment,
though, it seems that there is not a vast amount of new work on the
horizon. There are only a couple of significant additions expected in the
near future:
- Deadline scheduling. That patch is in a condition that is "not
too bad," but the developer has gotten distracted with new things. It
looks like the deadline scheduler may be picked up by a new developer
and progress will resume.
- CPU isolation - the ability to run one or more processors with no
clock tick or other overhead. Fredéric Weisbecker continues to
work on those patches, but there is a lot to to be done and no
pressing need to merge them right away. So this is seen as a
long-term project.
Once upon a time, the addition of deterministic response to a
general-purpose operating system was seen as an impossible task: the
resulting kernel would not work well for anybody. While some people were
saying that, though, the realtime preemption developers simply got to work;
they have now mostly achieved that goal. The realtime work may never be
"done" - just like the Linux kernel as a whole is never done - but this
project looks like one that has achieved the bulk of its objectives. Your
editor has learned not to make predictions about when everything will be
merged; suffice to say that the developers themselves appear to be
determined to get the job done before too long. So, while there may always
be an out-of-tree realtime patch set, chances are it will shrink
considerably in the coming development cycles.
(
Log in to post comments)