LWN: Comments on "A realtime preemption overview"

A realtime preemption overview

PaulMcKenney — Mon, 08 Jun 2015 17:56:45 +0000

The fact that tasks can block while acquiring a spinlock in -rt kernels is a side-effect of the fact that tasks can be preempted while holding a spinlock. Note "can be preempted" rather than "can block".

To see why it is illegal to acquire a spinlock in -rt kernels with interrupts or preemption disabled, consider the following sequence of events:

A task is preempted while holding a spinlock.
One task per CPU starts spinning on that same spinlock.
At this point, the system is deadlocked. The original task cannot release its spinlock until it gets a CPU, but the other tasks won't let go of their CPU until they acquire the lock.

Fortunately, lockdep detects this situation.

A realtime preemption overview

ecapoccia — Mon, 05 Oct 2009 15:18:13 +0000

#1 Preemptible critical sections

"This preemptibility means that you can block while acquiring a spinlock, which in turn means
that it is illegal to acquire a spinlock with either preemption or interrupts disabled (the one
exception to this rule being the _trylock variants, at least as long as you don't repeatedly invoke
them in a tight loop). This also means that spin_lock_irqsave() does -not- disable hardware
interrupts when used on a spinlock_t."

a) "..while acquiring" One thing it is not clear to me is that with the new spinlocks a process can
block only before having obtained the lock (namely while spinning), or also when holding the
lock itself, inside the critical section. Apparently from the rest of the article (and the title of the
subsection) the second interpratation is correct, "while acquiring" is also "while holding". Is this
correct?

b) "it is illegal to acquire a spinlock with either preemption or interrupts disabled", I can figure
out that if one process blocks while holding a spinlock, should the preemption be disabled then
the local processor will hang since it wouldn't be able to run any other kernel path. Is this
correct? However I can't figure out why it is mandatory to have the interrupts enabled. Can you
clarify?

A realtime preemption overview

PaulMcKenney — Fri, 19 Aug 2005 21:43:26 +0000

One could indeed upgrade a single reader at a time, and on a single-CPU system this might make sense (and rumor has it that some RTOSes actually do take this approach). But on (say) a 4-CPU system, you could potentially be handling four readers at a time, and so upgrading a single reader at a time would result in delaying the high-priority waiting writer four times longer than necessary. This added scheduling delay will be unacceptable for some applications. And commodity multi-CPU systems (hyperthreading, dual cores) are becoming readily available at reasonable cost. A single-chip 4-CPU ARM system was demoed recently, so even the deep embedded CPUs are starting to take this approach.

OK, so you could upgrade N readers at a time, where N is the number of CPUs. But suppose we are doing priority inheritance for semaphores, and one of the readers blocks. Then we again have an idle CPU that could be getting readers out of the way of the high-priority waiting writer, again, needlessly degrading the waiting writer's scheduling latency.

And this latter issue causes problems even on a single-CPU system. :-(

Therefore, the current approach is to allow only one reader task at a time to hold a reader-writer lock or semaphore.

A realtime preemption overview

goaty — Thu, 18 Aug 2005 07:35:05 +0000

Concerning the multiple reader problem, if you'll forgive my naivety, isn't it sufficient just to upgrade one reader at a time? This is assuming that upgrades are transitive, so if the upgraded reader is waiting on an un-upgraded reader, the upgrade will be passed down the chain. Assuming each lock has a linked-list of readers, this seems straightforward. Obviously once a reader releases the lock, we need to give the next reader an upgrade, as long as there are readers remaining. This should limit the scheduler overhead to be proportional to the number of locks, rather than the number of readers.

Thanks for this outstanding article!

nettings — Wed, 17 Aug 2005 15:00:18 +0000

I have no background in kernel programming at all, but being a sound engineer and musician, low latency is an important topic for me. Debugging latency-related problems in my signal chain needs rather more insight than I have, and this article did a lot to alleviate that.
Hopefully we will read more from Paul McKenney in the future. Thanks also to mingo for sharing his knowledge in this article's comments.

A realtime preemption overview

mingo — Tue, 16 Aug 2005 05:08:29 +0000

#1 there's no requirement that critical sections must never be preempted - in fact we do "preempt" most spinlocks sections with interrupt contexts in the stock kernel.

A critical section is "critical" only because the data structure affected must be updated transactionally (fully or not at all). How that is achieved, and whether certain types of contexts may or may not execute during such critical sections is not specified. But i think we mostly agree here.

#2 an alias will only cause confusion and most of the code wont be changed. A wholesale namespace cleanup might eventually be done, but it is not practical right now.

#4 more configurability also means more ways to misconfigure, but that's a natural consequence, not a bad thing. There are already some userspace tools emerging that simplify things and boost certain types of processing (such as "audio"). I'm not putting too much effort into formalizing this though until the infrastructure has not been finalized.

A realtime preemption overview

balbir — Tue, 16 Aug 2005 04:49:26 +0000

Thanks for answering all the questions

#1. Pre-empting critical sections sounds like an oxymoron, if the sections are critical, why pre-empt them? Just kidding, I like the idea of a true priority based pre-emptive scheduler. I agree that SPL() will disable pre-emption and is opaque, but if you want to disable IRQ pre-emption by other tasks (if there is any) then it works well, but it has all the limitations you mentioned.

#2. I meant, lets create an alias and encourage people to use the new name instead of spinlock_t, like raw_spinlock_t is an alias for the older spinlock_t. Lets still have spinlock_t and a newer name for it.

#4. Your numbers look very good, the optimizations seem good as well. What scares me now is correct assignment of task priority will be critical to programming linux drivers/kernel components in the future. Is this understanding correct? Are there any guidelines that you follow?

I will search for your patch and read the code to understand it better.

A realtime preemption overview

balbir — Tue, 16 Aug 2005 04:37:57 +0000

typo pedagogy style ==> pedagogical style

A realtime preemption overview

mingo — Tue, 16 Aug 2005 04:29:45 +0000

Re #1, isnt SPL() disabling process/process preemption? That makes the concept unsuitable for the purposes of PREEMPT_RT. SPL() is also a pretty 'opaque' serialization method, only little better than the 'Big Kernel Lock' that Linux has now finally gotten rid of. Thirdly, it only has a limited number of (32? 64?) "priority levels" available, while Linux has thousands of separate types of critical sections. Fourthly, isnt SPL() nested? E.g. blocking up to level 5 means all execution covered by levels 0,1,2,3,4 are blocked - while with a spinlock you will only block access to the data structure affected. Such artificial nesting is pretty bad if you want to avoid deadlocks and want to have good SMP scalability. The natural expression of locking hierarchies is not a flat "priority space" as SPL() does, but it's more like a forest of trees of independent entities, where we want to maintain as much independence as possible.

Re #2, there over 5000 uses of the spin-lock APIs in the Linux kernel, renaming it just to show that it might not spin anymore is not really worth the trouble (and the huge intrusion!) at this point.

Re #3, yes, priority inheritance is pretty important when an RT task wants to make use of kernel services.

Re #4, what precisely do you mean by "interrupt context" and "process context" in this particular case? The current situation is the following:

In the stock kernel there are 3 basic types of contexts: there is "interrupt context" (non-preemptible), "soft interrupt context" (non-preemptible) and "process context" (preemptible, unless executing in one of the many types of critical sections such as spinlocks).

In the PREEMPT_RT kernel there are 4 essential types of contexts: "hard interrupt context", "interrupt context", "soft interrupt context" and "process context". The hard interrupt context is an extremely small shim in essence - a few tens of lines total, per arch - it just deals with the interrupt controller, masks the IRQ line, acks the controller and returns. The "interrupt context" is a separate per-IRQ interrupt thread, which behaves like a process and is fully preemptible. "Soft interrupt context" is a separate per-softirq system-thread too, fully preemptible. "Process context" is what it used to be, and fully preemptible too. ['fully preemptible' means it's preemptible for in essence everything but the scheduler code and the basic RT-mutex/PI code]

considering the above description, your comment about "the lesser we run in interrupt context, the better" is indeed correct: in PREEMPT_RT the hardirq context execution time and complexity has been reduced to an absolute physical minimum. It is a fundamentally good and important thing to achieve determinism. Everything else is a "thread", as far as the scheduler is concerned, and is as preemptible as possible. You can then use individual thread priorities to make some interrupts more important than others.

There is (inevitably) some scheduling overhead due to having more contexts, i've measured it to be 3-5%, worst-case [80 thousand irqs/sec], and near zero for the common case [couple of thousand irqs/sec], which is pretty good.

Note that Linux has specific scheduler optimizations that makes the introduction and use of system threads cheaper: e.g. the 'lazy-TLB' optimization will skip TLB flushes when switching between system threads, by letting system threads 'inherit' the TLB context of the previous user-process. Thus we might not need to do any TLB flushing if we switch back to the same user-process - and we dont have to do any TLB flushing if we switch between system threads. So in the TLB flushing sense, system threads are completely transparent and do not increase the number of TLB flushes.

A realtime preemption overview

balbir — Mon, 15 Aug 2005 09:26:43 +0000

Wow! Paul amazingly good article. I hope you write a book on the Linux kernel and RCU someday, your pedagogy style is very good.

I have so many comments (mostly doubts), I will probably start a few at a time.

It has been mentioned that

1. Interrupt handlers being pre-emptible and now run in process context

I remember that BSD based implementation of X86 have always used an SPL() level to keep interrupt handlers pre-emptive based on priority, couldn't we adopt this approach. With interrupts being handled in process context, I see a scheduling overhead for each interrupt being executed, what does this do to interrupt latency?

2. With spin-locks now being able to block, would it be a good idea to create an alias (with a different name) to spinlock, since a spinlock might no longer spin as its name suggests (I hope I understood this correctly)

3. I like that priority inheritence is now being implemented in the linux kernel.

4. There seems to be an inbuilt assumption that the lesser we run in interrupt context and the more in process context, the better. I agree with this principle, but there might be exceptions (for example point 1, above).

A realtime preemption overview

minty — Sat, 13 Aug 2005 19:08:00 +0000

Many thanks Paul and all at LWN. Another high quality article and an excellent companion to your other recent comparative review.

This article fills in the kernel side of things, but getting the best from the 'new-improved' kernel in userspace is often difficult, not because of any kernel problems but because there are few tools for a user to use. In my mind one of the big plusses of PREEMPT_RT is the number of extra ways we can debug and measure responsiveness.

Any chance of a companion article on how one might go about using these facilities?

A wishlist:

How to time/debug/amend your apps for low latency - what kernel options for instrumentation (already covered briefly), how to use them to assess performance (with examples e.g. from recent JACK-related threads on LKML)

How to configure a kernel for real-time responsiveness - Ingo often mentions tips on what might be done to improve performance for specific apps (rather than for general desktop use) e.g. configuring interrupt priorities.

A realtime preemption overview

jrigg — Fri, 12 Aug 2005 11:41:41 +0000

My LWN subscription has a job - to save me time. The cost of subscribing is far less than the cost of wasting working time trying to find this stuff elsewhere.

A realtime preemption overview

gravious — Fri, 12 Aug 2005 00:46:38 +0000

How? Does it have a job? Could it get my subscription a job too?

A realtime preemption overview

dhess — Thu, 11 Aug 2005 07:31:15 +0000

Fantastic! My LWN subscription continues to pay for itself.