LWN.net Logo

Eliminating rwlocks and IRQF_DISABLED

By Jonathan Corbet
December 1, 2009
Reader-writer spinlocks and interrupt-enabled interrupt handlers both have a long history in the Linux kernel. But both may be nearing the end of their story. This article looks at the push for the removal of a pair of legacy techniques for mutual exclusion in the kernel.

Reader-writer spinlocks (rwlocks) behave like ordinary spinlocks, but with some significant exceptions. Any number of readers can hold the lock at any given time; this allows multiple processors to access a shared data structure if none of them are making changes to it. Reader locks are also naturally nestable; a single processor can acquire a given read lock more than once if need be. Writers, instead, require exclusive access; before a write lock can be granted, all read locks must be released, and only one write lock can be held at any given time.

Rwlocks in Linux are inherently unfair in that readers can stall writers for an arbitrary period of time. New read locks are allowed even if a writer is waiting, so a steady stream of readers can block a writer indefinitely. In practice this problem rarely surfaces, but Nick Piggin has reported a case where the right user-space workload can cause an indefinite system livelock. This is a performance problem for specific users, but it is also a potential denial of service attack vector on many systems. In response, Nick started pondering on the challenge of implementing more fair rwlocks which do not create performance regressions.

That is not an easy task. The obvious solution - blocking new readers when a writer gets in line - will not work for the most important rwlock (tasklist_lock) because that lock can be acquired by interrupt handlers. If a processor already holding a read lock on tasklist_lock is interrupted, and the interrupt handler, too, needs that lock, forcing the handler to wait will deadlock the processor. So workable solutions require allowing nested reader locks to be acquired even when writers are waiting, or disabling interrupts when tasklist_lock is held. Neither solution is entirely pleasing.

Beyond that, there has been a general sentiment toward the removal of rwlocks for some years. The locking primitives themselves are significantly slower than plain spinlocks, so any performance gain from allowing multiple readers must be large enough to make up for that extra cost. In many cases, that gain does not appear to actually exist. So, over time, kernel developers have been changing rwlocks to normal spinlocks or replacing them with read-copy-update mechanisms. Even so, a few hundred rwlocks remain in the kernel. Perhaps it would be better to focus on removing them instead of putting a lot of work into making them more fair.

Almost all of those rwlocks could be turned into spinlocks tomorrow and nobody would ever notice. But tasklist_lock is a bit of a thorny problem; it is acquired in many places in the core kernel and it's not always clear what this lock is protecting. This lock is also taken in a number of critical kernel fast paths, so any change has to be done carefully to avoid performance regressions. For these reasons, kernel developers have generally avoided messing with tasklist_lock.

Even so, it would appear that, over time, a number of the structures protected by tasklist_lock have been shifted to other protection mechanisms. This lock has also been changed in the realtime preemption tree, though that code has not yet made it to the mainline. Seeing all this, Thomas Gleixner decided to try to get rid of this lock, saying "If nobody beats me I'm going to let sed loose on the kernel, lift the task_struct rcu free code from -rt and figure out what explodes." As of this writing, the results of this exercise have not been posted. But Thomas is still active on the mailing list, so one concludes that any explosions experienced cannot have been fatal.

If tasklist_lock can be converted successfully to an ordinary spinlock, the conversion of the remaining rwlocks is likely to happen quickly. Shortly after that, rwlocks may go away altogether, simplifying the set of mutual exclusion primitives in Linux considerably.

IRQF_DISABLED

Meanwhile, a different sort of exclusion happens with interrupt handlers. In the early days of Linux, these handlers were divided into "fast" and "slow" varieties. Fast handlers could be run with other interrupts disabled, but slow handlers needed to have other interrupts enabled. Otherwise, a slow handler (perhaps doing a significant amount of work in the handler itself) could block the processing of more important interrupts, impacting the performance of the system.

Over the years, this distinction has slowly faded away, for a number of reasons. The increase in processor speeds means that even an interrupt handler which does a fair amount of work can be "fast." Hardware has gotten smarter, minimizing the amount of work which absolutely must be done immediately on receipt of the interrupt. The kernel has gained improved mechanisms (threaded interrupt handlers, tasklets, and workqueues) for deferred processing. And the quality of drivers has generally improved. So driver authors generally do not really even need to think about whether their handlers run with interrupts enabled or not.

Those authors still need to make that choice when setting up interrupt handlers, though. Unless the handler is established with the IRQF_DISABLED flag set, it will be run with interrupts enabled. For added fun, handlers for shared interrupts (perhaps the majority on most systems) can never be assured of running with interrupts disabled; other handlers running on the same interrupt line might enable them at any time. So many handlers will be running with interrupts enabled, even though that is not needed.

The solution, it would seem, would be to eliminate the IRQF_DISABLED flag and just run all handlers with interrupts disabled. In almost all cases, everything will work just fine. There are just a few situations where interrupt handling still takes too long, or where one interrupt handler depends on interrupts for another device being delivered at any time. Those handlers could be identified and dealt with. "Dealt with" in this case could take a few forms. One would be to equip the driver with a better-written interrupt handler which does not have this problem. Another, related approach would be to move the driver to a threaded handler which, naturally, will run with interrupts enabled. Or, finally, the handler could be set up with a new flag (IRQF_NEEDS_IRQS_ENABLED, perhaps) which would cause it to run with interrupts turned on in the old way.

It's not clear when all this might happen, but it could be that, in the near future, all hard interrupt handlers are expected to run - quickly - with interrupts disabled. Few people will even notice, aside from some maintainers of out-of-tree drivers who will need to remove IRQF_DISABLED from their code. But the kernel as a whole should be faster for it.


(Log in to post comments)

Eliminating rwlocks and IRQF_DISABLED

Posted Dec 3, 2009 15:33 UTC (Thu) by johnflux (guest, #58833) [Link]

How does this interact with the realtime kernel stuff? Presumably keeping interrupts disabled for some arbitrary time conflicts with the realtime goals?

Eliminating rwlocks and IRQF_DISABLED

Posted Dec 3, 2009 18:59 UTC (Thu) by adamgundy (subscriber, #5418) [Link]

the realtime branch makes all interrupts threaded (except maybe the timer tick?), hence they all run 'interrupts enabled'.

Eliminating rwlocks and IRQF_DISABLED

Posted Dec 3, 2009 19:01 UTC (Thu) by kjp (guest, #39639) [Link]

I believe the threaded interrupts handle that, by having a 'fast' handler that acks the interrupt but most processing runs in a thread.

I'm REALLY looking forward to threaded interrupts. Right now, our network server spends so much time in hard and softirq (napi) that is makes the scheduler make really bad decisions (scheduler appears to not factor in processor interrupt usage).

Eliminating rwlocks and IRQF_DISABLED

Posted Dec 6, 2009 16:25 UTC (Sun) by kleptog (subscriber, #1183) [Link]

Aha, glad I'm not the only one with this issue. We recently found out that if you have more than 8 CPUs the round-robin IRQ distribution of the IO-APIC is disabled so that *all* IRQs from a network card (soft and hard) all end up on one CPU. As you point out, the scheduler doesn't handle this very well. (It was the first time I saw a CPU spending 90% of its time in kernel space while the other CPUs were almost idle.)

The only response we got from kernel developers was that "round-robin IRQs suck" which completely sidesteps the point that what happens now doesn't work at all, and any perceived suckyness of round-robin IRQs would at least be spread evenly.

Threaded IRQs do seem to be the solution here, I hope they are implemented soon.

Eliminating rwlocks and IRQF_DISABLED

Posted Dec 6, 2009 16:36 UTC (Sun) by dlang (✭ supporter ✭, #313) [Link]

there is the userspace tool to ballance interrupts between cpus, have you tried that?

Eliminating rwlocks and IRQF_DISABLED

Posted Dec 6, 2009 16:47 UTC (Sun) by kleptog (subscriber, #1183) [Link]

Sure, it makes it so the CPU with 90% usage is not always the same one, but jumps around every now and then. That doesn't actually solve the problem, actually it makes it worse because then you can't use CPU binding on other processes to avoid them landing on the unlucky CPU.

What I would like is for the IRQs to be distributed over a few CPUs (say 4 CPUs that share the same L3 cache). Anything to avoid all traffic being handled by one CPU.

Eliminating rwlocks and IRQF_DISABLED

Posted Dec 7, 2009 11:26 UTC (Mon) by xav (guest, #18536) [Link]

i don't think it's such a good solution. If all interrupts are related (say, you're receiving data from the network and must process it in an app), distributing the interrupts from CPU to CPU means the application must follow, and with it its cache, so you'll have lots of cache line bouncing, which is expensive.

Round-robin IRQs

Posted Dec 7, 2009 11:56 UTC (Mon) by kleptog (subscriber, #1183) [Link]

Since you cannot run the app on the same CPU as the one receiving the interrupts you're going to get a cache bounce *anyway*, right?

But it's worse than that, it's not *an* app, there are several apps which all need to see the same data and since they are running on different CPUs you're going to get a cache bounce for each one anyway.

What you're basically saying is: round-robin IRQ handling is bad because you're sometimes going to get 6 cache-bounces per packet instead of 5. BFD. Without round-robin IRQs if the amount of traffic doubles we have to tell the client we can't do it.

The irony is, if you buy a more expensive network card you get MSI-X which gets the network card to do the IRQ distribution. You then get the same number of cache bounces as if we programmed the IO-APIC to do round-robin, but the load is more evenly distributed. So we've worked around a software limitation by modifying the hardware!

I'm mostly irked by a built-in feature of the IO-APIC being summarily disabled on machines with 8+ CPUs with the comment "but you don't really want that" while I believe I should at least be given the choice.

Eliminating rwlocks and IRQF_DISABLED

Posted Dec 4, 2009 5:05 UTC (Fri) by naptastic (subscriber, #60139) [Link]

I don't understand how disabling interrupts during interrupt handlers is a good thing. I can see the increase in throughput because there's one less context switch, but isn't the added latency from non-interruptible interrupt handlers much worse?

It's frustrating for someone like me to watch these decisions keep getting made in favor of throughput (which you could get just as easily by overclocking your processor another half percent) and at the expense of latency, which seems to never go down, especially when the costs and benefits are so lopsided.

Eliminating rwlocks and IRQF_DISABLED

Posted Dec 4, 2009 14:19 UTC (Fri) by corbet (editor, #1) [Link]

Bear in mind that much of this work is being done by the principal developers behind the realtime preemption tree. I suspect that they're uninterested in increasing latencies...

Eliminating rwlocks and IRQF_DISABLED

Posted Dec 4, 2009 18:49 UTC (Fri) by naptastic (subscriber, #60139) [Link]

Yeah... what are they seeing that I'm not?

Eliminating rwlocks and IRQF_DISABLED

Posted Dec 5, 2009 23:51 UTC (Sat) by smipi1 (subscriber, #57041) [Link]

Provided interrupt handlers do nothing more than the absolute minimum in
interrupt context, this approach will not adversely affect latencies.

Allowing interrupt handlers to be interrupted actually is far worse: It
introduces context switching overhead and screws up overall predictability.

If all interrupt handlers simply feed threads (that actually do the
associated work) without interruption, latencies can be tuned by design and
not by accident. Balancing throughput and latency simply becomes a question
of educating your scheduler.

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds