Nested bottom-half locking for realtime kernels

By Jonathan Corbet
June 17, 2024

Software-interrupt handlers (also called "bottom halves") have a long history in the Linux kernel; for much of that history, developers have wished that they could go away. One of their unfortunate characteristics is that they can add unexpected latency to the execution of unrelated processes; this problem is felt especially acutely in the realtime-preemption community. The solution adopted there has created problems of its own, though; in response Sebastian Andrzej Siewior is proposing a new locking mechanism for realtime builds of the kernel that may have benefits for non-realtime users as well.

In normal kernel builds, a software-interrupt handler will run, if needed, at the earliest opportunity that the kernel finds; usually, that is immediately after the completion of a hardware-interrupt handler or on return from the kernel to user space. Either way, software-interrupt handling can delay the execution of a process that may have nothing to do with the creation of that interrupt. For most systems, that delay is not usually a problem, but realtime kernels are all about response time; a badly timed software-interrupt handler has the potential to cause a realtime task to miss its deadline.

It turns out that the realtime developers are firmly of the opinion that they have not worked on that project for over two decades just to be thwarted by a software-interrupt handler. So those handlers have been made preemptible like nearly everything else in realtime kernels. That change only addresses part of the problem, though. The kernel makes heavy use of per-CPU variables as a way of avoiding contention between processors; as long as no other CPU can access a memory location, there will be no contention for it, and no need for locking to ensure mutual exclusion. Except, of course, if a software-interrupt handler runs and tries to access the same data.

To avoid such problems, kernel code can call local_bh_disable(), which blocks the execution of software-interrupt handlers until local_bh_enable() is called. A call to local_bh_disable() will also disable preemption and migration for the running task, ensuring that it has sole access to the CPU during its critical section. That solves the problem of racing with software-interrupt handlers (or any other kernel code) on the same CPU, but creates another latency problem for realtime kernels; as long as preemption is disabled, a higher-priority process cannot run on that CPU, once again threatening to increase latency for that higher-priority process.

The solution to that problem in the realtime tree is to make tasks preemptible while software-interrupt handlers are disabled. But, since a task may be depending on local_bh_disable() to keep other tasks from accessing its per-CPU data, local_bh_disable() takes a per-CPU lock on realtime kernels. As a result, only one task can be running with software interrupts disabled on any given CPU at a time.

But, it almost goes without saying, there is another problem. If a low-priority process enters a local_bh_disable() section, it can be preempted within that section and prevented from executing (and restoring software interrupts) for a long time. That could block a higher-priority process from completing a local_bh_disable() call of its own. It is, in other words, a classic priority-inversion situation. Here, the problem is worsened by the fact that, in all likelihood, there is no actual contention between the two tasks; they are probably calling local_bh_disable() to protect entirely different data.

This situation highlights a problem with disabling software interrupt handling: it is essentially a big lock that provides no indication of what data it is actually protecting. That, in turn, points to a potential solution: replace the big lock with fine-grained locking that protects a limited and well-defined set of data. That is the approach taken by Siewior's patch set. Specifically, it adds a pair of new macros:

    local_lock_nested_bh(local_lock_t *lock);
    local_unlock_nested_bh(local_lock_t *lock);

Using this mechanism requires auditing each local_bh_disable() section, figuring out which data is protected therein, and adding a local_lock_t (a specialized lock that only prevents access from the same CPU) to that data structure. That lock can then be passed to local_lock_nested_bh() to protect only that structure while not blocking concurrent execution by unrelated code.

Code using this approach must still call local_bh_disable() to prevent access by software-interrupt handlers and to prevent migration to another CPU. But, once all of the local_bh_disable() sections have been audited and adjusted (a job that is reminiscent of the long effort to remove the Big Kernel Lock), it will be possible to remove the lock that realtime kernels take in local_bh_disable(), eliminating a significant source of contention and latency. Benchmark results posted with the patch series show a significant improvement (14.5%) for a networking workload once that lock is removed.

For non-realtime kernels, instead, local_lock_nested_bh() is essentially a no-op, though it does provide information to the locking checker for debugging purposes. Local locks have no effect on non-realtime kernels, and do not require any storage. Thus, this patch satisfies one of the rules that has constrained realtime development from the start: realtime-specific features must not have a performance impact on non-realtime kernels.

This work will have a significant benefit for the rest of the kernel, though, even if it doesn't change the generated code in most cases. With the current local_bh_disable() pattern, there is no indication of what data is being protected. That makes it hard to reason about concurrent access, and makes the introduction of bugs more likely. Once this work is done, the locking rules for the affected data structures will be documented; in many cases, that may make it possible to stop disabling software interrupts entirely in favor of a more focused locking scheme.

This patch set is in its sixth revision. Previous postings have resulted in some significant changes, mostly in how some networking subsystems were changed to use the new mechanism; the core concept has remained mostly the same. A few developers have indicated their acceptance of this work, so chances are good that it will find its way upstream before too long.

Index entries for this article
Kernel	Interrupts/Software
Kernel	Realtime
Kernel	Releases/6.11

largest source of unpredictable latency removed -- we hope

Posted Jun 18, 2024 4:45 UTC (Tue) by alison (subscriber, #63752) [Link] (3 responses)

The best source of information about the proposed change is Siewior's LPC 2023 talk. In that presentation he showed an example of how replacing local_bh_disable() with local_locks allowed a block softirq to preempt a network one since the block hard IRQ's priority was greater than the network hard IRQ's. That change would be a giant benefit in cases where softirqs don't touch the same data and could run concurrently.

local_locks are scoped, meaning that their addition requires no new goto labels to free the lock. Thus they not only improve performance, but are rather elegant.

Given that tools for monitoring and debugging lock behavior are bountiful and those for softirqs not so much, the patches would have the added benefit of making the kernel easier to understand.

If local_bh_disable() can be eventually replaced at most of its current 300+ call sites, that change would make RT Linux more competitive with well-known RTOSes. Now we just need an ASIL rating for the kernel that is not "QM."

Efforts to move the timer softirq out of ksoftirqd have been underway for years, but have had limited success. Perhaps this completely new approach is better.

local lock scoping

Posted Jun 18, 2024 18:00 UTC (Tue) by bnorris (subscriber, #92090) [Link] (1 responses)

> local_locks are scoped

Are they, inherently? Looking through include/linux/local_lock.h and include/linux/local_lock_internal.h I don't see any such thing.

Looking at the slides you link, I see usage of scoped_guard(), which is new to me. It seems opt-in, so it'd be a matter of awareness and conventional usage. It also seems like it's available for other lock types (e.g., spinlock, mutex). Anyway, looks nice!

lock/other scoping

Posted Jun 19, 2024 8:12 UTC (Wed) by johill (subscriber, #25196) [Link]

Right, this is not inherent to the new local lock, but it's probably the first locking primitive introduced with the infrastructure available, and thus more likely to use it in general.

You can also easily add your own new types, e.g. to use in a specific subsystem/driver, using the DEFINE_GUARD() macro, see e.g. guard(rcu)(); use.

largest source of unpredictable latency removed -- we hope

Posted Jun 20, 2024 12:09 UTC (Thu) by ppisa (subscriber, #67307) [Link]

I am eager to see fine grained locking in networking. Our continuous benchmarking of SocketCAN support latency in mainline and RT kernels reveals big problems of Linux kernel.

https://canbus.pages.fel.cvut.cz/#can-bus-channels-mutual...

You can find our related embedded world Conference article copy on our CAN site as well as presentation and pointer to our results for more than year of benchmarking.

We work with OSADL.org to offer service even on their QA Farm on Real-time of Mainline Linux

https://www.osadl.org/OSADL-QA-Farm-Real-time.linux-real-...

The CTU CAN FD core on Zylinx Zynq based CAN latency tester hardware is already installed there and I hope to have some time now, when the theses and subjects I guarantee are finished for actual semester.

When the Linux SocketCAN latecy is compared with our CAN FD stack developed for RTEMS RTOS on the same hardware then there is really big reserve. Our RTEMS stack results have been published on international CAN Conference but article is not yet made publically available by CiA. But Michal Lenc's thesis includes results as well and RTEMS with full load of its integrate BSD TCP/IP stack fits to 65 usec in all cases (compare with Linux worst cases on loaded RT kernel up to 10 msec).

The thesis text
CAN FD Support for Space Grade Real-Time RTEMS Executive; Michal Lenc; 2024;
https://wiki.control.fel.cvut.cz/mediawiki/images/c/cc/Dp...

Revisions moving quickly

Posted Jun 19, 2024 14:57 UTC (Wed) by Kamiccolo (subscriber, #95159) [Link]

Heh, they are moving fast!

> This patch set is in its sixth revision.

Two days past the publication, and it's on 8th revision:
https://lore.kernel.org/all/20240619072253.504963-1-bigea...

BKL is gone... long live the BKL?

Posted Jun 20, 2024 15:08 UTC (Thu) by intelfx (subscriber, #130118) [Link] (1 responses)

It's funny how the Linux kernel continues to have distant echoes of the BKL, in many forms, long after "the" BKL is gone.

I guess re-architecting a huge, mission-critical software project is hard, eh?

BKL is gone... long live the BKL?

Posted Jun 25, 2024 17:23 UTC (Tue) by willy (subscriber, #9762) [Link]

It's a similar design pattern rather than being a consequence of the BKL. Had the BKL never existed, a locking scheme like this might still have been devised.

What happened to trying to remove softirqs?

Posted Jul 3, 2024 13:12 UTC (Wed) by zse (guest, #120483) [Link]

I'm wondering, is the effort / desire to eventually remove softirqs dead by now? Things like this seem (from my limited understanding) to rather go into the opposite direction of making the softirq system even more intricate.