Detecting missing memory barriers with KCSAN

By Jonathan Corbet
December 2, 2021

Writing (correct) concurrent code that uses locking to avoid race conditions is difficult enough. When the objective is to use lockless algorithms, relying on memory barriers instead of locks to eliminate locking overhead, the problem becomes harder still. Bugs are easy to create and hard to find in this type of code. There may be some help on the way, though, in the form of this patch set from Marco Elver that enhances the Kernel Concurrency Sanitizer (KCSAN) with the ability to detect some types of missing memory barriers.

KCSAN works in a statistical manner by watching accesses to specific memory addresses and trying to detect racy patterns; the algorithm used is described in this article. In its current form, though, KCSAN can only catch certain types of race conditions, specifically those that arise from locking errors. Other types of races remain invisible to this tool, including a number that can arise in incorrect lockless code. KCSAN is, by design, blind to the kinds of problems that occur when CPUs and memory controllers reorder the visibility of memory writes.

Consider this code example, taken from this documentation patch in Elver's series (and lightly edited):

    int x, flag;
    void T1(void)
    {
        x = 1;                  // data race!
        WRITE_ONCE(flag, 1);    // should be: smp_store_release(&flag, 1)
    }

    void T2(void)
    {
        while (!READ_ONCE(flag))    // should be: smp_load_acquire(&flag)
	    ;
        ... = x;                    // data race!
    }

At a first glance, this code appears correct. T1() stores a value into the variable x then sets flag to indicate that x is valid. The other thread, running in T2(), does not attempt to read x until flag is set, so it should always proceed with a valid value. There is only one little problem: the lack of memory barriers gives the CPU the permission to reorder those operations, which appear to be independent; the write to x in T1() could, in fact, be visible to the rest of the system after the write to flag. That could cause T2() to proceed thinking it has a valid value of x when the real value is not yet visible to that thread.

Correct code would have used smp_store_release() to write flag; that would ensure that all writes done prior to that store would be visible globally before the store to flag becomes visible. Similarly, smp_load_acquire() is needed to read flag in a way that doesn't allow later reads to be reordered to happen earlier. Barriers almost always need to be paired in this way to work properly; omitting either half of the pair creates erroneous code.

This kind of bug can be difficult for developers implementing lockless algorithms to avoid; the need for a memory barrier for a specific access is not always obvious. Code containing such bugs can work just fine with all of the developer's tests, only to fail on a handful of production systems in obscure settings. That is why some developers, once they understand the challenges of lockless programming, conclude that dedicating their career to implementing annoying popups in JavaScript isn't such a bad thing after all.

KCSAN, in current kernels, cannot detect this race either. A system running under KCSAN may decide that the store to x in T1() is interesting and worthy of watching, but the way that KCSAN watches work will prevent the race from occurring. KCSAN will start monitoring accesses to x and, while that watch exists, it will delay the further execution of T1() to see if any racy accesses happen. T1() will only proceed (and set flag) once the watch concludes. KCSAN will thus delay the write to flag, causing T2() to wait, until the watch concludes. So the racy access cannot happen as long as KCSAN is watching, and will go undetected.

The new code makes one seemingly simple change to try to detect this kind of problem — though that apparent simplicity is belied by the fact that a 25-part patch series is required to implement it. Rather than just forgetting about x after the watch period ends, KCSAN will repeat the watch after every subsequent memory access until either a memory barrier is encountered or the function returns. In the case described above, KCSAN will watch x again after the assignment to flag, essentially simulating the reordering of the writes to those two variables. In essence, this repeated watch is seeing what will happen if the write to x becomes visible later than the developer expects. Any access to x seen on repeated watches is still racy, since no memory barrier has been executed to ensure correct ordering. So KCSAN will now detect the racy read of x in T2() and raise the alarm.

This algorithm can detect a set of race conditions caused by missing barriers, but not all of them. Most notably, it can test the effects of delaying an access to shared data — such as delaying the visibility of the write to x in the example above — but not the effects of executing that access earlier than the developer expects. But, even if its coverage is incomplete, the improved KCSAN may well be able to prevent a number of barrier-related bugs from getting to the point where they impact users. That may make lockless algorithms a bit more accessible to non-superhuman developers and might even free a few of them from the specter of a future devoted to JavaScript.

Index entries for this article
Kernel	Development tools/Sanitizers
Kernel	Lockless algorithms

Detecting missing memory barriers with KCSAN

Posted Dec 2, 2021 20:14 UTC (Thu) by acarno (subscriber, #123476) [Link] (1 responses)

> might even free a few of them from the specter of a future devoted to JavaScript

The technical content is fantastic, but it's little gems like this that keep me coming back to this site on the regular :) Thanks for the laugh!!

Detecting missing memory barriers with KCSAN

Posted Dec 9, 2021 18:44 UTC (Thu) by gutschke (subscriber, #27910) [Link]

You can always tell when an article was written by Jonathan. Nobody can imitate his unique dry sense of humor.

I frequently catch myself laughing out loud, then scrolling back to verify the by line. Works without fail.

Detecting missing memory barriers with KCSAN

Posted Dec 2, 2021 20:54 UTC (Thu) by ncm (guest, #165) [Link] (9 responses)

I usually fix performance problems nowadays by eliminating wrong-headed "lock-free data structures". Besides being too hard for normal programmers to get right, they are most frequently a lame attempt to paper over design failures. In typical well-designed systems, taking a mutex is rare and uncontested enough not to construct the performance bottleneck the lock-free data structure is a try at avoiding.

The sort of locking that happens at the hardware level when doing lock-free operations is very similar to what you do in a mutex, with the distinction that in place of risking deadlock, you risk silently-wrong results. So, performance problems for a locked structure are often also problems for unlocked ones, but with more difficult debugging, and harder-to-measure bottlenecks.

Kernel coders do not get the luxury of fixing the ill-conceived systems calling into their code, so must do their best to perform well anyway. So, lockless algorithms are an unavoidable necessity in the kernel. This is an argument for using kernel-bypass mechanisms that move the design choices to the application, so failures can be found and fixed without counting on anonymous kernel maintainers' thankless heroics.

Detecting missing memory barriers with KCSAN

Posted Dec 3, 2021 4:10 UTC (Fri) by Paf (subscriber, #91811) [Link] (8 responses)

It’s weird to me that you specifically cite a mutex, which is a type of sleeping lock, rather than a spinlock, which would be more comparable. Mutexes are quite heavy, intended for longer critical sections, and probably not competition for something like this where only a few specific accesses - at most - are managed.

Anyway, for what it’s worth, the type of operations used for memory barriers can be much, much cheaper and not that similar to the full compare exchange with cache clearing and shoot downs etc etc required to implement any form of atomic access. I don’t normally consider a “lock free algorithm” implemented with atomics to be lock free because atomics can be used to implement a lock, and the operations are basically equivalent. This isn’t true for memory barriers.

Detecting missing memory barriers with KCSAN

Posted Dec 3, 2021 11:39 UTC (Fri) by melver (subscriber, #134990) [Link]

There are well-established definitions of the different levels of forward-progress guarantees, with "lock free" being one of them.

The concept was introduced by [1]: "The lock-free condition guarantees that some process will always make progress despite arbitrary halting failures or delays by other processes, while the wait-free condition guarantees that all non-halted processes make progress."

There are even more levels of forward-progress guarantees, which [2] summarizes in Sec. 14.2.

[1] Maurice Herlihy, "A Methodology for Implementing Highly Concurrent Data Objects", 1993. https://dl.acm.org/doi/abs/10.1145/161468.161469
[2] Paul E. McKenney, "Is Parallel Programming Hard, And, If So, What Can You Do About It?" https://arxiv.org/pdf/1701.00854.pdf

Detecting missing memory barriers with KCSAN

Posted Dec 3, 2021 16:27 UTC (Fri) by notriddle (subscriber, #130608) [Link] (3 responses)

The point, at least what I got from ncm’s post, is that if you’re using a bunch of lock-free thread-safe data structures, then you have a bunch of shared, mutable state.

The behavior of systems like that is extremely hard to predict. Even if you manage to avoid coding a data race, you still have the opportunity to code classical race conditions, which are annoying because they often can’t be reproduced on demand.

It also has performance bottlenecks, even if it is lock free. On SMP machines, it creates cache contention. On NUMA it’s even worse, and it cannot be scaled into a cluster at all without being rearchitected.

As proven by BitTorrent, shared immutable state can scale up extremely well, even when running on a truly hostile fabric. And mutable state, if you only have one mutator, isn’t really anything special: if you’re not doing I/O, it can be modeled as a pure function anyway, and if you are, it’s a part of your problem domain.

Nirvana, if you can find it, is a shared nothing architecture, a system where each piece of data is owned by a single, conceptual “thread” with sole permission to change it. Linux can’t work this way, because everything is a file, and the file system is a big blob of shared, mutable state. It’s a shame, really.

Detecting missing memory barriers with KCSAN

Posted Dec 5, 2021 9:15 UTC (Sun) by matthias (subscriber, #94967) [Link] (2 responses)

> As proven by BitTorrent, shared immutable state can scale up extremely well, even when running on a truly hostile fabric.

Except that the BitTorrent approach does not scale at all. Even if today, the processing power of the BitTorrent network is billion times larger than in the beginning, the number of processed blocks/transactions is essentially the same.

Detecting missing memory barriers with KCSAN

Posted Dec 5, 2021 14:04 UTC (Sun) by notriddle (subscriber, #130608) [Link] (1 responses)

That’s BitCoin

Detecting missing memory barriers with KCSAN

Posted Dec 5, 2021 14:07 UTC (Sun) by matthias (subscriber, #94967) [Link]

Maybe I should not write comments that early in the morning ;)

Detecting missing memory barriers with KCSAN

Posted Dec 3, 2021 20:01 UTC (Fri) by ncm (guest, #165) [Link] (2 responses)

There are places for all of these.

In modern mutex implementations, spinning happens for a little while before finally sleeping, so a mutex and a spinlock are not different in kind. (Explicit spinlocks are mainly a kernel phenomenon.) And, barriers are the building block for everything. Atomics mostly just package up barriers (plus some extras) in the type system so normal people can reason about them.

In principle, raw barriers could give you lower overhead than a mutex, but it is better to design so you don't need to care about that. I.e., if the extra overhead of a mutex is a problem, you should try to redesign so it isn't.

Detecting missing memory barriers with KCSAN

Posted Dec 3, 2021 21:41 UTC (Fri) by itsmycpu (guest, #139639) [Link] (1 responses)

> but it is better to design so you don't need to care about that.

I agree in so far as a good design that reduces the need, amount or frequency of using shared data is more important than using lock-free algorithms. For example, by having each set of otherwise shared data be managed by a dedicated thread.

However I think lock-free algorithms are worth it for example if they are used within a few central library functions as a ready-to-use resource.

They can be the natural result of reducing the critical section, the code inside a lock, as you want to minimize contention. When it is reduced to a single (RMW) instruction, it becomes a lock-free construct. However that usually implies a fair amount of work, so that's why I'm saying it (usually) should be limited to library functions.

But then, most desktop applications are not really very multi-threaded in the first place, so don't need to bother anyway (except under the hood in the GUI shell, libraries, and kernel functions they use). These are surely better off just using a few locks if needed.

Detecting missing memory barriers with KCSAN

Posted Dec 4, 2021 15:30 UTC (Sat) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

In short, "Use the right tool for the job!" ;-)

Detecting missing memory barriers with KCSAN

Posted Dec 10, 2021 2:02 UTC (Fri) by kpfleming (subscriber, #23250) [Link]

This isn't a typo, but maybe a clarification suggestion:

"the write to x in T1() could, in fact, be visible to the rest of the system after the write to flag"

As I understand it, the write to x will *always* be visible after the write to flag, the issue is when it is not visible *before* the write to flag as it needs to be. A possible change to:

"the write to x in T1() could, in fact, not be visible to the rest of the system until after the write to flag"