The realtime preemption endgame

By Jonathan Corbet
August 5, 2009

There has been relatively little noise out of the realtime preemption camp in recent months. That does not mean that the realtime developers have been idle, though; instead, they are preparing for the realtime endgame: the merger of the bulk of the remaining patches into the mainline kernel. The 2.6.31-rc4-rt1 tree recently announced by Thomas Gleixner shows the results of much of this work. This article will look at some of the recent changes to -rt.

The point of the realtime preemption project is to enable a general-purpose Linux kernel to provide deterministic response times to high-priority processes. "Realtime" does not (necessarily) mean "fast"; it means knowing for sure that the system can respond to important events within a specific time period. It has often been said that this cannot be done, that the complexity of a full operating system would thwart any attempt to guarantee bounded response times. Of course, it was also said that free software developers could never create a full operating system in the first place. The realtime hackers believe that both claims are equally false, and they have been working to prove it.

One of the long-term realtime features was threaded interrupt handlers. A "hard" interrupt handler can monopolize the CPU for as long as it runs; that can create latencies for other users. Moving interrupt handlers into their own threads, instead, allows them to be scheduled like any other process on the system. Thus, threaded interrupt handlers cannot get in the way of higher-priority processes.

Much of the threaded interrupt handling code moved into the mainline for the 2.6.30 release, but in a somewhat different form. While the threading of interrupt handlers is nearly universal in a realtime kernel, it's an optional (and, thus far, little-used) feature in the mainline, so the APIs had to change somewhat. Realtime interrupt handling has been reworked on top of the mainline threaded interrupt mechanism, but it still has its own twists.

In particular, the kernel can still be configured to force all interrupt handlers into threads. If a given driver explicitly requests a threaded handler, behavior is similar to a non-realtime kernel; the driver's "hard" interrupt handler runs as usual in IRQ context. Drivers which do not request threaded handlers get one anyway, with a special hard handler which masks the interrupt line while the driver's handler runs. Interrupt handler threads are per-device now (rather than per-IRQ line). All told, the amount of code which is specific to the realtime tree is fairly small now; the bulk of it is in the mainline.

Software interrupt handling is somewhat different in the realtime tree. Mainline kernels will normally handle software interrupts at convenient moments - context switches or when returning to user space from a system call, usually. If the software interrupt load gets too heavy, though, handling will be deferred to the per-CPU "ksoftirqd" thread. In the realtime tree (subject to a configuration option), all software interrupt handling goes into ksoftirqd - but now there is a separate thread for each interrupt line. So each CPU will get a couple of ksoftirqd threads for network processing, one for the block subsystem, one for RCU, one for tasklets, and so on. Software interrupts are also preemptable, though that may not happen very often; they run at realtime priority.

The work which first kicked off the realtime preemption tree was the replacement of spinlocks with sleeping mutexes. The spinlock technique is difficult to square with deterministic latencies; any processor which is spinning on a lock will wait an arbitrary period of time, depending on what code in another CPU is doing. Code holding spinlocks also cannot be preempted; doing so would cause serious latencies (at best) or deadlocks. So the goal of ensuring bounded response times required the elimination of spinlocks to the greatest extent possible.

Replacing spinlocks throughout the kernel with realtime mutexes solves much of the problem. Threads waiting for a mutex will sleep, freeing the processor for some other task. Threads holding mutexes can be preempted if a higher-priority process comes along. So, if the priorities have been set properly, there should be little in the way of the highest-priority process being able to respond to events at any time. This is the core idea behind the entire realtime preemption concept.

As it happens, though, not all spinlocks can be replaced by mutexes. At the lowest levels of the system, there is still a need for true (or "raw") spinlocks; the locks which are used to implement mutexes are one obvious example. Over the years, a fair amount of effort has gone into the task of figuring out which spinlocks really needed to be "raw" locks. At the code level, the difference was papered over through the use of some rather ugly trickery in the spinlock primitives. Regardless of whether a raw spinlock or a sleeping lock was being used, the code would call spin_lock() to acquire it; the only difference was where the lock was declared.

This approach was probably useful during the early development phases where it was often necessary to change the type of specific locks. But ugly compiler trickery which serves to obfuscate the type of lock being used in any specific context seems unlikely to fly when it comes to merger into the mainline. So the realtime hackers have bitten the bullet and split the two types of locks entirely. The replacement of "spinlocks" with mutexes still happens as before, for the simple reason that changing every spinlock call would be a massive, disruptive change across the entire kernel code base. But the "raw" spinlock type, which is used in far fewer places, is more amenable to this kind of change.

The result is a new mutual exclusion primitive, called atomic_spinlock_t, which looks a lot like traditional spinlocks:

    #include <linux/spinlock.h>

    DEFINE_ATOMIC_SPINLOCK(name)
    atomic_spin_lock_init(atomic_spinlock_t *lock);

    void atomic_spin_lock(atomic_spinlock_t *lock);    
    void atomic_spin_lock_irqsave(atomic_spinlock_t *lock, long flags);
    void atomic_spin_lock_irq(atomic_spinlock_t *lock);
    void atomic_spin_lock_bh(atomic_spinlock_t *lock);
    int atomic_spin_trylock(atomic_spinlock_t *lock);    

    void atomic_spin_unlock(atomic_spinlock_t *lock);
    void atomic_spin_unlock_irqrestore(atomic_spinlock_t *lock, long flags);
    void atomic_spin_unlock_irq(atomic_spinlock_t *lock);
    void atomic_spin_unlock_bh(atomic_spinlock_t *lock);

These new "atomic spinlocks" are used in the scheduler, low-level interrupt handling code, clock-handling, PCI bus management, ACPI subsystem, and in many other places. The change is still large and disruptive - but much less so than changing ordinary "spinlock" users would have been.

[PULL QUOTE: One might argue that putting atomic spinlocks back into the kernel will reintroduce the same latency problems that the realtime developers are working to get rid of. END QUOTE] One might argue that putting atomic spinlocks back into the kernel will reintroduce the same latency problems that the realtime developers are working to get rid of. There is certainly a risk of that happening, but it can be minimized with due care. Auditing every kernel path which uses spinlocks is clearly not a feasible task, but it is possible to look very closely at the (much smaller) number of code paths using atomic spinlocks. So there can be a reasonable degree of assurance that the remaining atomic spinlocks will not cause the kernel to exceed the latency goals.

(As an aside, Thomas Gleixner is looking for a better name for the atomic_spinlock_t type. Suggest the winning idea, and free beer at the next conference may be your reward.)

Similar changes have been made to a number of other kernel mutual exclusion mechanisms. There is a new atomic_seqlock_t variant on seqlocks for cases where the seqlock writer cannot be preemptable. The anon_semaphore type mostly appears to be a renaming of semaphores and their related functions; it is a part of the continuing effort to eliminate the use of semaphores in any place where a mutex or completion should be used instead. There is also a rw_anon_semaphore type as a replacement for rw_semaphore.

Quite a few other realtime-specific changes remain in the -rt tree. The realtime code is incompatible with the SLUB allocator, so only slab is allowed. There is also an interesting problem with kmap_atomic(); this function creates a temporary, per-CPU kernel-space address mapping for a given memory page. Preemption cannot be allowed to happen when an atomic kmap is active; it would be possible for other code to change the mapping before the preempted code tries to use it. In the realtime setting, the performance benefits from atomic kmaps are outweighed by the additional latency they can cause. So, for all practical purposes, kmap_atomic() does not exist in a realtime kernel; calls to kmap_atomic() are mapped to ordinary kmap() calls. And so on.

As for work which is not yet even in the realtime tree, the first priority would appear to be clear:

We seriously want to tackle the elimination of the PREEMPT_RT annoyance #1, aka BKL. The Big Kernel Lock is still used in ~330 files all across the kernel.

At this point, the remaining BKL-removal work comes down to low-level audits of individual filesystems and drivers; for the most part, it has been pushed out of the core kernel.

Beyond that, of course, there is the little task of getting as much of this code as possible into the mainline kernel. To that end, a proper git tree with a bisectable sequence of patches is being prepared, though that work is not yet complete. There will also be a gathering of realtime Linux developers at the Eleventh Real-Time Linux Workshop this September in Dresden; getting the realtime work into the mainline is expected to be discussed seriously there. As it happens, your editor plans to be in the room; watch this space in late September for an update.

Index entries for this article
Kernel	Atomic spinlocks
Kernel	Realtime
Kernel	Spinlocks

The realtime preemption endgame

Posted Aug 5, 2009 14:20 UTC (Wed) by dankamongmen (subscriber, #35141) [Link]

heh, this will obsolete some hoary old slide in 95% of all graduate realtime classes for at least a year. people are still talking about select() and poll() in their powerpoints, bah.

kmap_atomic

Posted Aug 5, 2009 14:48 UTC (Wed) by arjan (subscriber, #36785) [Link] (3 responses)

turning kmap_atomic into a full kmap sounds a bit overkill, one could do a per cpu kmap like pool, and then pin the task to the cpu for the duration of the kmap... saves the cross-cpu IPIs that kmap_atomic tries to eliminate from kmap...

kmap_atomic

Posted Aug 5, 2009 14:56 UTC (Wed) by corbet (editor, #1) [Link] (2 responses)

That wouldn't quite do it - you still have to avoid contention for the atomic kmap slots or chaos will result. That generally implies disabling preemption, which is what they're trying to avoid. I suppose one could implement some sort of slot-management layer so that one task's KM_USER0 is different from another's, but that sounds messy...

kmap_atomic

Posted Aug 5, 2009 17:28 UTC (Wed) by iq-0 (subscriber, #36655) [Link] (1 responses)

So if it were performance-wise feasible to make kmaps local per thread (part
of a thread switch) the whole kmap_atomic and kmap could be folded back
into one implementation?

kmap_atomic

Posted Aug 6, 2009 16:06 UTC (Thu) by ebiederm (subscriber, #35028) [Link]

Just don't allow highmem and RT.

The realtime preemption endgame

Posted Aug 5, 2009 15:44 UTC (Wed) by tertium (guest, #56169) [Link] (10 responses)

Doesn't replacing spinlocks with mutexes cancel a recently discussed idea of adaptive mutexes, which spin while the lock is taken by a running thread?

The realtime preemption endgame

Posted Aug 5, 2009 17:40 UTC (Wed) by avik (guest, #704) [Link]

Certainly it makes sense now to switch some spinlocks to mutexes.

The realtime preemption endgame

Posted Aug 5, 2009 18:49 UTC (Wed) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (8 responses)

I would argue that replacing spinlocks with mutexes would make things like adaptive mutexes more urgently needed.

The realtime preemption endgame

Posted Aug 5, 2009 19:09 UTC (Wed) by tertium (guest, #56169) [Link] (7 responses)

I just don't see how the two will co-exist. In an rt-kernel, spinlocks are being replaced with mutexes so that a thread waiting for a lock could be preempted. And with adaptive mutexes, a thread entering a mutex will (in certain circumstances) spin, thus breaking the latency guarantees, if I read the article well. Could you please shed some light on this?

The realtime preemption endgame

Posted Aug 5, 2009 20:39 UTC (Wed) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (6 responses)

Just to make sure I understand, your concern is exemplified by the following sequence of events, right?

High-priority task A is initially blocked waiting for something to do.
Low-priority task B acquires a lock.
Medium-priority tasks C, D, E, and so on (one per CPU) start running, preempting task B.
Work arrives for task A, so that it wakes up, preempting one of the medium-priority tasks.
Task A now attempts to acquire the lock held by preempted task B, and spins for awhile (uselessly degrading latency).
Task A finally blocks, which kicks priority inheritance into action, awakening task B, which releases the lock so that task A can acquire it.

Or am I missing your point?

The realtime preemption endgame

Posted Aug 5, 2009 21:21 UTC (Wed) by tertium (guest, #56169) [Link] (5 responses)

I'm imagining a much simpler scenario (for two CPUs):

1. High-priority task A takes a mutex and runs for some time without going to sleep.
2. Low-priority task B tries to take the mutex and spins, since A is running. This is where we lose latency, because...
3. ...while B spins, it can't be preempted, so a medium-priority task C can't run on either of CPUs, even if there is some work for it.

I presume, I'm missing something obvious and would be glad to know what it is.

The realtime preemption endgame

Posted Aug 5, 2009 21:24 UTC (Wed) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (3 responses)

OK, you lost me on this one. Why can't task B be preempted while spinning? What am I missing here?

The realtime preemption endgame

Posted Aug 5, 2009 21:43 UTC (Wed) by tertium (guest, #56169) [Link] (2 responses)

I was actually looking at the following phrase in the article: "Code holding spinlocks also cannot be preempted; doing so would cause serious latencies (at best) or deadlocks." Am I confused?

The realtime preemption endgame

Posted Aug 5, 2009 22:17 UTC (Wed) by tertium (guest, #56169) [Link] (1 responses)

Indeed, it seems I am: "code holding a spinlock" != "code spinning". My apologies.

By the way, in your scenario (if I'm not off the mark again) at the step 5, the task A wouldn't spin, since B isn't running. Instead, A would block immediately, awakening B and taking the lock.

The realtime preemption endgame

Posted Aug 5, 2009 22:47 UTC (Wed) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Exactly! :-)

The realtime preemption endgame

Posted Aug 7, 2009 15:49 UTC (Fri) by dvhart (guest, #19636) [Link]

It is indeed a balancing act. Testing has shown however that spinning for a short while can be preferable to at least 2 additional context switches (figure 25us or so each). See kernel/rtmutex.c adaptive_wait() for details, but basically we sping until one of the following occurs: we get the lock, the owner changes, or the owner goes to sleep.

The realtime preemption endgame

Posted Aug 5, 2009 20:55 UTC (Wed) by glikely (subscriber, #39601) [Link]

From Paragraph 5: "... If a given driver explicitly requests a threaded handler, behavior is similar to a non-realtime kernel; the driver's "hard" interrupt handler runs as usual in IRQ context."

For those who were confused by this, as I was, this post from tglx may help:

http://lwn.net/Articles/324980/

Drivers which request threaded irq handlers actually register 2 handlers in the call to request_threaded_irq(). The primary or "hard" handler runs at the usual IRQ contex, and it checks if the IRQ originated from the device. If so it masks out the IRQ at the device level and returns IRQ_WAKE_THREAD. Then the generic IRQ code can wake the thread that runs the threaded handler.

In the realtime context, it doesn't make sense to run the "hard" handler in a thread since it is already designed to be low latency and it would just turn around and ask for another thread to be scheduled anyway. Plus, since the hard handler masks the IRQ at the device level, the generic IRQ code can leave the irq line unmasked so that IRQs from other devices on the same line don't have to get inhibited also.

Not knowing this background, the paragraph from the article sounded like drivers requesting a threaded handler would get the old non-threaded behaviour. Hence my confusion.

The obvious name ...

Posted Aug 5, 2009 20:56 UTC (Wed) by felixfix (subscriber, #242) [Link] (6 responses)

Not going to create an account there just for one suggestion. If someone submits this and wins a free beer, enjoy!

The obvious name is raw_spinlock, as opposed to cooked_spinlock. There ie great UNIX tradition in this.

The obvious name ...

Posted Aug 6, 2009 9:28 UTC (Thu) by eliezert (subscriber, #35757) [Link] (2 responses)

I would call the busywait spinlocks.
This is what they do, and this will also make it obvious they should be used rarely.
I like your cooked / raw proposal, but I think a name the would discourage unnecessary use would be better.

The obvious name ...

Posted Aug 6, 2009 18:03 UTC (Thu) by cpeterso (guest, #305) [Link]

busywait_spinlock is a great name! It accurately describes its behavior and discourages people from using it (without deep thought).

The obvious name ...

Posted Aug 10, 2009 14:44 UTC (Mon) by kjp (guest, #39639) [Link]

I thought atomic_spinlock wasn't too bad, but I like busywait_spinlock.

Of course, I do think it would be better to rename the spinlock itself if it no longer does that....

I guess to facilitate the changeover you could have atomic_lock or busywait_lock (spin is a bit redundant) and mutex, and only very old code would need a typdef for the now non existant 'spinlock' type.

The obvious name ...

Posted Aug 7, 2009 0:40 UTC (Fri) by Kamilion (subscriber, #42576) [Link] (2 responses)

Eh, I signed up just to post up "snooze_lock".
Seemed pretty self-explanatory to me. :)

And there's always the alarm-clock connotation, someone always manages to mash the snooze button.

The obvious name ...

Posted Aug 7, 2009 1:08 UTC (Fri) by martinfick (subscriber, #4455) [Link] (1 responses)

But, it's not a snooze lock, it's a spin lock! A snooze lock would imply that a process goes to sleep and yields the processor. A spin lock does not yield, it's in fact the opposite of a "sleep" lock.

If I understand the naming dilemma, the problem is that the old spin_locks which will not be renamed are in fact becoming sleep locks. If your naming convention is adopted, you would have spin_locks which sleep, and snooze_locks which spin!

So, raw_spinlock or real_spinlock seem more appropriate to me.

The obvious name ...

Posted Aug 10, 2009 18:08 UTC (Mon) by kjp (guest, #39639) [Link]

It's incredibly sad if the spinlocks wouldn't get renamed. Linux already has incredibly badly named things (struct class anyone?) but now it's also getting incapable of making a mass rename due to what - proprietary or out of tree code?

The realtime preemption endgame

Posted Aug 6, 2009 3:48 UTC (Thu) by tialaramex (subscriber, #21167) [Link] (2 responses)

For the BKL stuff, what is needed to get a BKL-removing patch into a mainline driver?

It seems as though proving thoroughly that the BKL does nothing useful (in some particular driver) is nearly impossible since it interacts so comprehensively with other subsystems, so my first thought was that I should just try removing uses of the BKL from the driver and recompiling. But presumably the LKML would also frown on a patch which says "Drivername: Remove BKL" and offers as justification only the fact that "It works for me".

The realtime preemption endgame

Posted Aug 6, 2009 8:06 UTC (Thu) by johill (subscriber, #25196) [Link]

> For the BKL stuff, what is needed to get a BKL-removing patch into a mainline driver?

You're expected to carefully review interactions with other subsystems, locking within the driver, identify things relying on the BLK (often it's very little or nothing) and then remove/replace the BLK usage.

The realtime preemption endgame

Posted Aug 6, 2009 10:07 UTC (Thu) by mb (subscriber, #50428) [Link]

> But presumably the LKML would also frown on a patch which says "Drivername: Remove BKL" and offers as justification only the fact that "It works for me".

If you just remove any locking from a driver, it will most likely just work as before. Locking usually is not such a critical thing in a driver that it immediately breaks things.
_But_ it will introduce silent cornercase breakages that you will probably only notice weeks later when some weird coincidence happens.
So you'd better make sure you _do_ understand the locking requirements before removing the BKL from a driver.
Removing the BKL from part of the code requires deep understanding of that code and its interactions with other BKL holders(!). That is the actual problem.

name games

Posted Aug 10, 2009 13:57 UTC (Mon) by tridge (guest, #26906) [Link]

What about strangle_lock() - it strangles the CPU for the time it holds
the lock. It should also discourage use :-)

Or grid_lock() ? Too much grid lock and nobody gets to move.

Finally, flint_lock(), because it's very primitive :-)

Cheers, Tridge

The realtime preemption endgame

Posted Aug 17, 2009 20:57 UTC (Mon) by trasz (guest, #45786) [Link] (1 responses)

Great to see Linux finally doing what FreeBSD done a decade ago - real mutexes, interrupt threads etc. Except that in FreeBSD - just like in OSX or Solaris - this is the default behaviour.

The realtime preemption endgame

Posted Jan 6, 2010 11:37 UTC (Wed) by Kraftman (guest, #62833) [Link]

Great, but those are just their imaginary "advantages". Don't forget about Freebsd lock limitations.

Major features wating to be merged

Posted Aug 29, 2009 10:15 UTC (Sat) by YannisDas (guest, #60144) [Link]

An interesting idea for an article would be an overview of the major or not so major features/drivers/whatsoever yet unmerged to the mainline. Stuff like realtime, reiser4 etc.