The realtime preemption endgame
The point of the realtime preemption project is to enable a general-purpose Linux kernel to provide deterministic response times to high-priority processes. "Realtime" does not (necessarily) mean "fast"; it means knowing for sure that the system can respond to important events within a specific time period. It has often been said that this cannot be done, that the complexity of a full operating system would thwart any attempt to guarantee bounded response times. Of course, it was also said that free software developers could never create a full operating system in the first place. The realtime hackers believe that both claims are equally false, and they have been working to prove it.
One of the long-term realtime features was threaded interrupt handlers. A "hard" interrupt handler can monopolize the CPU for as long as it runs; that can create latencies for other users. Moving interrupt handlers into their own threads, instead, allows them to be scheduled like any other process on the system. Thus, threaded interrupt handlers cannot get in the way of higher-priority processes.
Much of the threaded interrupt handling code moved into the mainline for the 2.6.30 release, but in a somewhat different form. While the threading of interrupt handlers is nearly universal in a realtime kernel, it's an optional (and, thus far, little-used) feature in the mainline, so the APIs had to change somewhat. Realtime interrupt handling has been reworked on top of the mainline threaded interrupt mechanism, but it still has its own twists.
In particular, the kernel can still be configured to force all interrupt handlers into threads. If a given driver explicitly requests a threaded handler, behavior is similar to a non-realtime kernel; the driver's "hard" interrupt handler runs as usual in IRQ context. Drivers which do not request threaded handlers get one anyway, with a special hard handler which masks the interrupt line while the driver's handler runs. Interrupt handler threads are per-device now (rather than per-IRQ line). All told, the amount of code which is specific to the realtime tree is fairly small now; the bulk of it is in the mainline.
Software interrupt handling is somewhat different in the realtime tree. Mainline kernels will normally handle software interrupts at convenient moments - context switches or when returning to user space from a system call, usually. If the software interrupt load gets too heavy, though, handling will be deferred to the per-CPU "ksoftirqd" thread. In the realtime tree (subject to a configuration option), all software interrupt handling goes into ksoftirqd - but now there is a separate thread for each interrupt line. So each CPU will get a couple of ksoftirqd threads for network processing, one for the block subsystem, one for RCU, one for tasklets, and so on. Software interrupts are also preemptable, though that may not happen very often; they run at realtime priority.
The work which first kicked off the realtime preemption tree was the replacement of spinlocks with sleeping mutexes. The spinlock technique is difficult to square with deterministic latencies; any processor which is spinning on a lock will wait an arbitrary period of time, depending on what code in another CPU is doing. Code holding spinlocks also cannot be preempted; doing so would cause serious latencies (at best) or deadlocks. So the goal of ensuring bounded response times required the elimination of spinlocks to the greatest extent possible.
Replacing spinlocks throughout the kernel with realtime mutexes solves much of the problem. Threads waiting for a mutex will sleep, freeing the processor for some other task. Threads holding mutexes can be preempted if a higher-priority process comes along. So, if the priorities have been set properly, there should be little in the way of the highest-priority process being able to respond to events at any time. This is the core idea behind the entire realtime preemption concept.
As it happens, though, not all spinlocks can be replaced by mutexes. At the lowest levels of the system, there is still a need for true (or "raw") spinlocks; the locks which are used to implement mutexes are one obvious example. Over the years, a fair amount of effort has gone into the task of figuring out which spinlocks really needed to be "raw" locks. At the code level, the difference was papered over through the use of some rather ugly trickery in the spinlock primitives. Regardless of whether a raw spinlock or a sleeping lock was being used, the code would call spin_lock() to acquire it; the only difference was where the lock was declared.
This approach was probably useful during the early development phases where it was often necessary to change the type of specific locks. But ugly compiler trickery which serves to obfuscate the type of lock being used in any specific context seems unlikely to fly when it comes to merger into the mainline. So the realtime hackers have bitten the bullet and split the two types of locks entirely. The replacement of "spinlocks" with mutexes still happens as before, for the simple reason that changing every spinlock call would be a massive, disruptive change across the entire kernel code base. But the "raw" spinlock type, which is used in far fewer places, is more amenable to this kind of change.
The result is a new mutual exclusion primitive, called atomic_spinlock_t, which looks a lot like traditional spinlocks:
#include <linux/spinlock.h> DEFINE_ATOMIC_SPINLOCK(name) atomic_spin_lock_init(atomic_spinlock_t *lock); void atomic_spin_lock(atomic_spinlock_t *lock); void atomic_spin_lock_irqsave(atomic_spinlock_t *lock, long flags); void atomic_spin_lock_irq(atomic_spinlock_t *lock); void atomic_spin_lock_bh(atomic_spinlock_t *lock); int atomic_spin_trylock(atomic_spinlock_t *lock); void atomic_spin_unlock(atomic_spinlock_t *lock); void atomic_spin_unlock_irqrestore(atomic_spinlock_t *lock, long flags); void atomic_spin_unlock_irq(atomic_spinlock_t *lock); void atomic_spin_unlock_bh(atomic_spinlock_t *lock);
These new "atomic spinlocks" are used in the scheduler, low-level interrupt handling code, clock-handling, PCI bus management, ACPI subsystem, and in many other places. The change is still large and disruptive - but much less so than changing ordinary "spinlock" users would have been.
[PULL QUOTE: One might argue that putting atomic spinlocks back into the kernel will reintroduce the same latency problems that the realtime developers are working to get rid of. END QUOTE] One might argue that putting atomic spinlocks back into the kernel will reintroduce the same latency problems that the realtime developers are working to get rid of. There is certainly a risk of that happening, but it can be minimized with due care. Auditing every kernel path which uses spinlocks is clearly not a feasible task, but it is possible to look very closely at the (much smaller) number of code paths using atomic spinlocks. So there can be a reasonable degree of assurance that the remaining atomic spinlocks will not cause the kernel to exceed the latency goals.
(As an aside, Thomas Gleixner is looking for a better name for the atomic_spinlock_t type. Suggest the winning idea, and free beer at the next conference may be your reward.)
Similar changes have been made to a number of other kernel mutual exclusion mechanisms. There is a new atomic_seqlock_t variant on seqlocks for cases where the seqlock writer cannot be preemptable. The anon_semaphore type mostly appears to be a renaming of semaphores and their related functions; it is a part of the continuing effort to eliminate the use of semaphores in any place where a mutex or completion should be used instead. There is also a rw_anon_semaphore type as a replacement for rw_semaphore.
Quite a few other realtime-specific changes remain in the -rt tree. The realtime code is incompatible with the SLUB allocator, so only slab is allowed. There is also an interesting problem with kmap_atomic(); this function creates a temporary, per-CPU kernel-space address mapping for a given memory page. Preemption cannot be allowed to happen when an atomic kmap is active; it would be possible for other code to change the mapping before the preempted code tries to use it. In the realtime setting, the performance benefits from atomic kmaps are outweighed by the additional latency they can cause. So, for all practical purposes, kmap_atomic() does not exist in a realtime kernel; calls to kmap_atomic() are mapped to ordinary kmap() calls. And so on.
As for work which is not yet even in the realtime tree, the first priority would appear to be clear:
At this point, the remaining BKL-removal work comes down to low-level audits of individual filesystems and drivers; for the most part, it has been pushed out of the core kernel.
Beyond that, of course, there is the little task of getting as much of this
code as possible into the mainline kernel. To that end, a proper git tree
with a bisectable sequence of patches is being prepared, though that work
is not yet complete. There will also be a gathering of realtime Linux
developers at the Eleventh Real-Time
Linux Workshop this September in Dresden; getting the realtime work
into the mainline is expected to be discussed seriously there. As it
happens, your editor plans to be in the room; watch this space in late September
for an update.
Index entries for this article | |
---|---|
Kernel | Atomic spinlocks |
Kernel | Realtime |
Kernel | Spinlocks |
Posted Aug 5, 2009 14:20 UTC (Wed)
by dankamongmen (subscriber, #35141)
[Link]
Posted Aug 5, 2009 14:48 UTC (Wed)
by arjan (subscriber, #36785)
[Link] (3 responses)
Posted Aug 5, 2009 14:56 UTC (Wed)
by corbet (editor, #1)
[Link] (2 responses)
Posted Aug 5, 2009 17:28 UTC (Wed)
by iq-0 (subscriber, #36655)
[Link] (1 responses)
Posted Aug 6, 2009 16:06 UTC (Thu)
by ebiederm (subscriber, #35028)
[Link]
Posted Aug 5, 2009 15:44 UTC (Wed)
by tertium (guest, #56169)
[Link] (10 responses)
Posted Aug 5, 2009 17:40 UTC (Wed)
by avik (guest, #704)
[Link]
Posted Aug 5, 2009 18:49 UTC (Wed)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (8 responses)
Posted Aug 5, 2009 19:09 UTC (Wed)
by tertium (guest, #56169)
[Link] (7 responses)
Posted Aug 5, 2009 20:39 UTC (Wed)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (6 responses)
Posted Aug 5, 2009 21:21 UTC (Wed)
by tertium (guest, #56169)
[Link] (5 responses)
1. High-priority task A takes a mutex and runs for some time without going to sleep.
I presume, I'm missing something obvious and would be glad to know what it is.
Posted Aug 5, 2009 21:24 UTC (Wed)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (3 responses)
Posted Aug 5, 2009 21:43 UTC (Wed)
by tertium (guest, #56169)
[Link] (2 responses)
Posted Aug 5, 2009 22:17 UTC (Wed)
by tertium (guest, #56169)
[Link] (1 responses)
By the way, in your scenario (if I'm not off the mark again) at the step 5, the task A wouldn't spin, since B isn't running. Instead, A would block immediately, awakening B and taking the lock.
Posted Aug 5, 2009 22:47 UTC (Wed)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
Posted Aug 7, 2009 15:49 UTC (Fri)
by dvhart (guest, #19636)
[Link]
Posted Aug 5, 2009 20:55 UTC (Wed)
by glikely (subscriber, #39601)
[Link]
For those who were confused by this, as I was, this post from tglx may help:
http://lwn.net/Articles/324980/
Drivers which request threaded irq handlers actually register 2 handlers in the call to request_threaded_irq(). The primary or "hard" handler runs at the usual IRQ contex, and it checks if the IRQ originated from the device. If so it masks out the IRQ at the device level and returns IRQ_WAKE_THREAD. Then the generic IRQ code can wake the thread that runs the threaded handler.
In the realtime context, it doesn't make sense to run the "hard" handler in a thread since it is already designed to be low latency and it would just turn around and ask for another thread to be scheduled anyway. Plus, since the hard handler masks the IRQ at the device level, the generic IRQ code can leave the irq line unmasked so that IRQs from other devices on the same line don't have to get inhibited also.
Not knowing this background, the paragraph from the article sounded like drivers requesting a threaded handler would get the old non-threaded behaviour. Hence my confusion.
Posted Aug 5, 2009 20:56 UTC (Wed)
by felixfix (subscriber, #242)
[Link] (6 responses)
The obvious name is raw_spinlock, as opposed to cooked_spinlock. There ie great UNIX tradition in this.
Posted Aug 6, 2009 9:28 UTC (Thu)
by eliezert (subscriber, #35757)
[Link] (2 responses)
Posted Aug 6, 2009 18:03 UTC (Thu)
by cpeterso (guest, #305)
[Link]
Posted Aug 10, 2009 14:44 UTC (Mon)
by kjp (guest, #39639)
[Link]
Of course, I do think it would be better to rename the spinlock itself if it no longer does that....
I guess to facilitate the changeover you could have atomic_lock or busywait_lock (spin is a bit redundant) and mutex, and only very old code would need a typdef for the now non existant 'spinlock' type.
Posted Aug 7, 2009 0:40 UTC (Fri)
by Kamilion (subscriber, #42576)
[Link] (2 responses)
And there's always the alarm-clock connotation, someone always manages to mash the snooze button.
Posted Aug 7, 2009 1:08 UTC (Fri)
by martinfick (subscriber, #4455)
[Link] (1 responses)
If I understand the naming dilemma, the problem is that the old spin_locks which will not be renamed are in fact becoming sleep locks. If your naming convention is adopted, you would have spin_locks which sleep, and snooze_locks which spin!
So, raw_spinlock or real_spinlock seem more appropriate to me.
Posted Aug 10, 2009 18:08 UTC (Mon)
by kjp (guest, #39639)
[Link]
Posted Aug 6, 2009 3:48 UTC (Thu)
by tialaramex (subscriber, #21167)
[Link] (2 responses)
It seems as though proving thoroughly that the BKL does nothing useful (in some particular driver) is nearly impossible since it interacts so comprehensively with other subsystems, so my first thought was that I should just try removing uses of the BKL from the driver and recompiling. But presumably the LKML would also frown on a patch which says "Drivername: Remove BKL" and offers as justification only the fact that "It works for me".
Posted Aug 6, 2009 8:06 UTC (Thu)
by johill (subscriber, #25196)
[Link]
You're expected to carefully review interactions with other subsystems, locking within the driver, identify things relying on the BLK (often it's very little or nothing) and then remove/replace the BLK usage.
Posted Aug 6, 2009 10:07 UTC (Thu)
by mb (subscriber, #50428)
[Link]
If you just remove any locking from a driver, it will most likely just work as before. Locking usually is not such a critical thing in a driver that it immediately breaks things.
Posted Aug 10, 2009 13:57 UTC (Mon)
by tridge (guest, #26906)
[Link]
Or grid_lock() ? Too much grid lock and nobody gets to move.
Finally, flint_lock(), because it's very primitive :-)
Cheers, Tridge
Posted Aug 17, 2009 20:57 UTC (Mon)
by trasz (guest, #45786)
[Link] (1 responses)
Posted Jan 6, 2010 11:37 UTC (Wed)
by Kraftman (guest, #62833)
[Link]
Posted Aug 29, 2009 10:15 UTC (Sat)
by YannisDas (guest, #60144)
[Link]
The realtime preemption endgame
kmap_atomic
That wouldn't quite do it - you still have to avoid contention for the atomic kmap slots or chaos will result. That generally implies disabling preemption, which is what they're trying to avoid. I suppose one could implement some sort of slot-management layer so that one task's KM_USER0 is different from another's, but that sounds messy...
kmap_atomic
kmap_atomic
of a thread switch) the whole kmap_atomic and kmap could be folded back
into one implementation?
kmap_atomic
The realtime preemption endgame
The realtime preemption endgame
The realtime preemption endgame
The realtime preemption endgame
Just to make sure I understand, your concern is exemplified by the following sequence of events, right?
The realtime preemption endgame
Or am I missing your point?
The realtime preemption endgame
2. Low-priority task B tries to take the mutex and spins, since A is running. This is where we lose latency, because...
3. ...while B spins, it can't be preempted, so a medium-priority task C can't run on either of CPUs, even if there is some work for it.
The realtime preemption endgame
The realtime preemption endgame
The realtime preemption endgame
The realtime preemption endgame
The realtime preemption endgame
The realtime preemption endgame
The obvious name ...
The obvious name ...
This is what they do, and this will also make it obvious they should be used rarely.
I like your cooked / raw proposal, but I think a name the would discourage unnecessary use would be better.
The obvious name ...
busywait_spinlock
is a great name! It accurately describes its behavior and discourages people from using it (without deep thought).
The obvious name ...
The obvious name ...
Seemed pretty self-explanatory to me. :)
The obvious name ...
The obvious name ...
The realtime preemption endgame
The realtime preemption endgame
The realtime preemption endgame
_But_ it will introduce silent cornercase breakages that you will probably only notice weeks later when some weird coincidence happens.
So you'd better make sure you _do_ understand the locking requirements before removing the BKL from a driver.
Removing the BKL from part of the code requires deep understanding of that code and its interactions with other BKL holders(!). That is the actual problem.
name games
the lock. It should also discourage use :-)
The realtime preemption endgame
The realtime preemption endgame
Major features wating to be merged