The 2011 realtime minisummit

By Jonathan Corbet
October 23, 2011

The realtime Linux minisummit was held on October 22 in Prague on the third day of the 13th Realtime Linux Workshop. This year's gathering featured a relatively unstructured discussion that does not lend itself to a tidy writeup. What has been attempted here is a report on the most important themes, supplemented with material from some of the workshop talks.

Per-CPU data

One of the biggest challenges for the realtime preemption patch set from the beginning has been the handling of per-CPU data. Throughput-oriented developers are quite fond of per-CPU variables; they facilitate lockless data access and improve the scalability of the code. But working with per-CPU data usually requires disabling preemption, which runs directly counter to what the realtime preemption patch set has been trying to do. It is a classic example of how high throughput and low latency can sometimes be incompatible goals.

The way the realtime patch set has traditionally dealt with per-CPU data is to put a lock around it and treat it like any other shared resource. The get_cpu_var() API has supported that mode fairly well. In more recent times, though, that API has been pushed aside by a new set of functions with names like this_cpu_read() and this_cpu_write(). The value of these functions is speed, especially on the x86 architecture where most of the operations boil down to a single machine instruction. But this new API has also caused some headaches, especially (but not exclusively) in the realtime camp.

The first problem is that there is no locking mechanism built into this API; it does not even disable preemption. So a value returned from this_cpu_read() may not correspond to the CPU that the calling thread is actually running on by the time it does something with the value it reads. Without some sort of external preemption disable, a sequence of this_cpu_read() followed by this_cpu_write() can write a value back on a different CPU than where the value was read, corrupting the resulting data structure; this kind of problem has already been experienced a few times.

The bigger problem, though, is that the API does not in any way indicate where the critical section involving the per-CPU data lies in the code. So there is no way to put in extra protection for realtime operation, and there is no way to put in any sort of debugging infrastructure to verify that the API is being used correctly. There have already been some heated discussions on the mailing list about these functions, but the simple fact is that they make things faster, so they are hard to argue against. There is real value in the this_cpu API; Ingo Molnar noted that the API is not a problem as such - it is just an incomplete solution.

Incomplete or not, usage of this API is growing by hundreds of call sites in each kernel release, so something needs to be done. A lot of time was spent talking about the naming of these functions, which is seen as confusing. What does "this CPU" mean in a situation where the calling thread can be migrated at any time? There may be an attempt to change the names to discourage improper use of the API, but that is clearly not a long-term solution.

The first step may be to figure out how to add some instrumentation to the interface that is capable of finding buggy usage. If nothing else, real evidence of bugs will help provide a justification for any changes that need to be made. There was also talk of creating some sort of "local lock" that can be used to delineate per-CPU critical sections, disable migration (and possibly preemption) in those sections, and, on realtime, to simply take a lock. Along with hopefully meeting everybody's goals, such an interface would allow the proper user of per-CPU data to be verified with the locking checker.

Meanwhile, the realtime developers have recently come up with a new approach to the management of per-CPU data in general. Rather than disabling preemption while working with such data, they simply disable migration between processors instead. If (and this is a bit of a big "if") the per-CPU data is properly protected by locking, preventing migration should be sufficient to keep processes from stepping on each other's toes while preserving the ability for high-priority processes to preempt lower-priority processes when the need arises.

The fear that comes with disabling migration is that processes could get stuck and experience long latencies even though there are free CPUs available. If a low-priority process disables migration, then is preempted, it cannot be moved to an available processor. So that process will simply wait until the preempting process either completes or gets migrated itself. Some work has been done to try to characterize the problem; Paul McKenney talked for a bit about some modeling work he has done in that area. At this point, it's really not clear how big the problem really is, but some developers are clearly worried about how this technique may affect some workloads.

Software interrupts

For a long time, the realtime preemption patch set has split software interrupt ("softirq") handling into a separate thread, making it preemptable like everything else. More recent patches have done away with that split, though, for a couple of reasons: to keep the patches closer to the mainline, and to avoid delaying network processing. The networking layer depends heavily on softirqs to keep the data moving; shifting that processing to a separate thread can add latency in an unwanted place. But there had been a good reason for the initial split: software interrupts can introduce latencies of their own. So there was some talk of bringing the split back in a partial way. Most softirqs could run in a separate thread but, perhaps, the networking softirqs would continue to run as "true" software interrupts. It is not a solution that pleases anybody, but it would make things better in the short term.

There appears to be agreement on the form of the long-term solution: software interrupts simply need to go away altogether. Most of the time, softirqs can be replaced by threaded interrupt handlers; as Thomas Gleixner put it, the networking layer's NAPI driver API is really just a poor-man's form of threaded IRQ handling. There is talk of moving to threaded handlers in the storage subsystem as well; that evidently holds out some hope of simplifying the complex SCSI error recovery code. That is not a trivial change in either storage or networking, though; it will take a long time to bring about. Meanwhile, partially-split softirq handling will probably return to the realtime patches.

An interesting problem related to threaded softirq handlers is the prevalence of "trylock" operations in the kernel. There are a number of sites where the code will attempt to obtain a lock; if the attempt fails, the IRQ handler will stop with the idea that it will try again on the next softirq run. If the handler is running as a thread, though, chances are good that it will be the highest-priority thread, so it will run again immediately. A softirq handler looping in this manner might prevent the scheduling of the thread that actually holds the contended lock, leading to a livelock situation. This problem has been observed a few times on real systems; fixing it may require a restructuring of the affected code.

Upstreaming

There was a long discussion on how the realtime developers planned to get most or all of the patch set upstream. It has been out of tree for some time now - the original preempt-rt patch was announced almost exactly seven years ago. Vast amounts of code have moved from the realtime tree into the mainline over that time, but there are still some 300 patches that remain out of the mainline. Thomas suggested a couple of times that he has maintained those patches for about long enough, and does not want or plan to do it forever.

Much of what is in the realtime tree seems to be upstreamable, though possibly requiring some cleanup work first. There are currently about 75 patches queued to go in during the 3.2 merge window, and there is a lot more that could be made ready for 3.3 or 3.4. The current goal is to merge as much code as possible in the next few development cycles; that should serve to reduce the size of the realtime patch set considerably, making the remainder easier to maintain.

There are a few problem areas. Sampling of randomness is disabled in the realtime tree; otherwise contention for the entropy pool causes unwanted latencies. It seems that contribution of entropy is on the decline throughout the kernel; as processors add hardware entropy generators, it's not clear that maintaining the software pool has value. The seqlock API has a lot of functions with no users in the kernel; the realtime patch set could be simplified if those functions were simply removed from the mainline. There are still a lot of places where interrupts are disabled, probably unnecessarily; fixing those will require auditing the code to understand why interrupts were disabled in the first place. Relayfs has problems of its own, but there is only one in-tree user; for now, it will probably just be disabled for realtime. The stop_machine() interface is seen to be a pain, and is probably also unnecessary most of the time.

And so on. A number of participants went away from this session with a list of patches to fix up and send upstream; with luck, the out-of-tree realtime patch set will shrink considerably.

The future

Once the current realtime patch set has been merged, is the job done? In the past, shrinking the patch set has been made harder by the simple fact that the developers continue to come up with new stuff. At the moment, though, it seems that there is not a vast amount of new work on the horizon. There are only a couple of significant additions expected in the near future:

Deadline scheduling. That patch is in a condition that is "not too bad," but the developer has gotten distracted with new things. It looks like the deadline scheduler may be picked up by a new developer and progress will resume.
CPU isolation - the ability to run one or more processors with no clock tick or other overhead. Fredéric Weisbecker continues to work on those patches, but there is a lot to to be done and no pressing need to merge them right away. So this is seen as a long-term project.

Once upon a time, the addition of deterministic response to a general-purpose operating system was seen as an impossible task: the resulting kernel would not work well for anybody. While some people were saying that, though, the realtime preemption developers simply got to work; they have now mostly achieved that goal. The realtime work may never be "done" - just like the Linux kernel as a whole is never done - but this project looks like one that has achieved the bulk of its objectives. Your editor has learned not to make predictions about when everything will be merged; suffice to say that the developers themselves appear to be determined to get the job done before too long. So, while there may always be an out-of-tree realtime patch set, chances are it will shrink considerably in the coming development cycles.

Index entries for this article
Conference	Realtime Linux Workshop/2011

The 2011 realtime minisummit

Posted Oct 24, 2011 10:49 UTC (Mon) by abacus (guest, #49001) [Link] (6 responses)

Great to read that people are still working on getting the real-time patches upstream. Any chances that names are added to the picture such that we know who was participating in this minisummit ?

Names

Posted Oct 24, 2011 11:08 UTC (Mon) by corbet (editor, #1) [Link] (5 responses)

Sorry, I should have done that from the beginning.

Standing, left to right: Jonathan Corbet, Ingo Molnar, John Kacur, Paul McKenney, Peter Zijlstra, Juri Lelli, Paul Gortmaker, Frank Rowand, Thomas Gleixner. Sitting: Tux. Kneeling: Darren Hart, Frédéric Weisbecker, Steven Rostedt.

real-time Photo

Posted Oct 24, 2011 20:10 UTC (Mon) by jkacur (guest, #63) [Link] (4 responses)

Was that seriously the best photo you could find of us!!!? That's horrid, half of us have our eyes shut, the other half are looking in different directions. I know we're not a good looking bunch, but that photo is just awful!

real-time Photo

Posted Oct 24, 2011 20:28 UTC (Mon) by corbet (editor, #1) [Link] (2 responses)

I didn't have a whole lot of options in my camera, sorry. If you don't cooperate with the picture taker, that's what happens...:)

real-time Photo

Posted Oct 24, 2011 22:06 UTC (Mon) by neilbrown (subscriber, #359) [Link] (1 responses)

It's not so bad... There is a clear view of Frédéric's watch. What more do we need?

real-time Photo

Posted Oct 25, 2011 7:43 UTC (Tue) by nevets (subscriber, #11875) [Link]

And he was very proud of his pink watch too.

real-time Photo

Posted Oct 25, 2011 8:22 UTC (Tue) by nevets (subscriber, #11875) [Link]

John, that was after a very productive minisummit and most of us were still suffering from jet-lag. I would say, our eyes were closed more often than they were opened throughout the day. The picture just reflects the reality of our state.

The 2011 realtime minisummit

Posted Oct 27, 2011 12:07 UTC (Thu) by richard_weinberger (subscriber, #38938) [Link]

"Your editor has learned not to make predictions about when everything will be merged" :-)