LWN: Comments on "Lockless patterns: an introduction to compare-and-swap"

Lockless patterns: an introduction to compare-and-swap

foom — Fri, 19 Mar 2021 12:36:24 +0000

While this certainly helps scalability significantly, the second cmpxchg operation to execute will still necessarily fail and require a retry, because its "compare" value is out of date.

(Where the performance really shines is the operations that can be fully implemented in the memory controller, such as atomic fetch-and-add/sub/or/and/etc, because these cannot fail.)

Lockless patterns: an introduction to compare-and-swap

excors — Fri, 19 Mar 2021 11:38:44 +0000

I don't think there's any need to implement CAS in RAM directly, because there's already a memory controller in front of the RAM that has to coordinate and serialise memory accesses from multiple CPUs, and you can put the atomic logic there instead. And that's already supported by some CPUs: see e.g. AXI (https://developer.arm.com/documentation/ihi0022/latest) and CHI (https://developer.arm.com/documentation/ihi0050/latest), which are ARM's bus protocols for connecting the various components on a SoC (CPU clusters, GPU, memory controllers, etc). They define "atomic transactions" including AtomicCompare (CAS of up to 16 bytes). A CPU can send an AtomicCompare packet to the memory controller, which will execute it atomically on DRAM and send the result back, avoiding all the costs of cache coherency. If two CPUs do a simultaneous AtomicCompare, the memory controller will just buffer one transaction for a few cycles until the other has finished.

Lockless patterns: an introduction to compare-and-swap

epa — Fri, 19 Mar 2021 08:13:07 +0000

I suppose the logical next step is a special kind of RAM that implements compare-and-swap as an atomic operation in hardware.

Lockless patterns: an introduction to compare-and-swap

Alan.Stern — Thu, 18 Mar 2021 21:22:08 +0000

There's a technical point that's relevant here. In the section where you mention using cmpxchg_release instead of cmpxchg:

"Thread 4 would always read node2 from node3->next, because it read the value that thread 3 wrote to first. However, there would be no happens before edge from thread 1 and thread 2 to thread 4; therefore, thread 4 would need a smp_load_acquire() in order to see node1 in node2->next."

In fact this isn't so. In the Linux kernel memory model, an edge from a store-release to a relaxed load _does_ establish a happens-before relation. What it doesn't do is guarantee that this store or prior ones will be visible to other loads following the first, because there will be nothing to force those loads to execute in order (i.e., there generally is no happens-before relation from a relaxed load to later loads in the same thread). But in threads 2 and 3 there are no loads following the one inside the cmpxchg_release, so this lack doesn't matter. Only thread 4 needs to use a load-acquire -- which it already does: xchg_acquire.

One other thing you might consider mentioning in the last article. Full memory barriers, including those associated with value-returning atomic operations like xchg or cmpxchg, are even stronger than discussed so far. They do more than order all earlier and later memory accesses, both loads and stores, against each other; they also act as global synchronizing operations. That is, whenever two threads both carry out a full memory barrier, there is always a happens-before relation from the barrier that executes first to the one that executes second. To put it another way, if each thread has a full memory barrier between each pair of memory accesses, the whole program will end up obeying the Sequential Consistency model.

Lockless patterns: an introduction to compare-and-swap

dgc — Wed, 17 Mar 2021 05:15:55 +0000

> When you talk about cache coherency slowing things down, is it possible that a
> hardware compare-and-swap operation never using the cache, but always going
> directly to main memory, could end up being more scalable?

Even if you have no caches, you still need a hardware mechanism that guarantees atomic, exclusive access to that memory multiple CPUs all try to read/modify it at the same time. You can't have two CPUs perform concurrent atomic cmpxchg operations on the same memory and have both succeed - one has to fail and retry when the other succeeds. So even if you have no caches, you still need hardware level memory access coordination protocols....

-Dave.

Lockless patterns: an introduction to compare-and-swap

dancol — Mon, 15 Mar 2021 19:24:04 +0000

How does RR address the problem?

Lockless patterns: an introduction to compare-and-swap

roc — Sun, 14 Mar 2021 22:10:46 +0000

Hardly anyone cares, but LL/SC has the unfortunate property that in practice you sometimes have to retry even in the absence of contention (e.g. due to a hardware interrupt). That happens to break rr (https://rr-project.org) because it introduces nondeterminism that we can't track.

Lockless patterns: an introduction to compare-and-swap

pbonzini — Sun, 14 Mar 2021 11:48:26 +0000

Isn't that backwards? LL/SC must always be wrapped in a loop, so it cannot produce wait-free algorithms. Instead, processor-level RMW instructions can be wait-free.

Lockless patterns: an introduction to compare-and-swap

excors — Sun, 14 Mar 2021 11:05:42 +0000

I think that's what GPUs usually do. There are multiple L1 caches and a single shared L2 cache, and no cache coherency. Atomic operations on global memory have to be sent to the L2 controller, which can typically handle one atomic operation per cycle to a single address (and perhaps multiple per cycle if spread over multiple addresses in a single cache line). Operations include cmpxchg, add/sub, min/max, and/or/xor, and sometimes even floating-point addition, performed by ALUs inside the L2 controller.

The downside is that in the uncontended case where only a single thread is accessing the address, every atomic operation that returns a value (like cmpxchg) will have hundreds of cycles of latency. And if you have multiple logically-independent threads each accessing a different address in a different cache line, they can't execute independently - they all have to go to the L2 controller and create contention there. But it scales well when you have hundreds of threads contending for the same address.

(GPUs also have local memory (closer to the L1 level) and a separate implementation of atomics for that. If you're doing a big reduction computation (e.g. finding the sum of a big array), you should do it hierarchically - each thread would reduce some input elements by simply looping and using registers, then a group of threads would merge their results using local memory atomics, then a group of groups would merge their results using global memory atomics. That way you can have a huge number of threads and still avoid L2 contention.)

Lockless patterns: an introduction to compare-and-swap

pbonzini — Sun, 14 Mar 2021 09:56:14 +0000

Heh, I doubt you'd learn anything new! I do hope that more experienced people get some insights about how to present abstract concepts and their concrete embodiments at the same time, for example happens-before both as a math relation and as a property of APIs.

There's a chicken-and-egg problem between teaching the "what" and the "when/why". My choice here was to first teach the "what" so that people would be able to read others' code, and leave the "when/why" for the end, but I'm sure it was possible to do it otherwise. I read somewhere that you can't teach something well until the fifth time you teach it.

> I do "scalability" and not "lockless"....

Well, everybody should be doing scalability and not lockless. I focused the series on "lockless patterns" because, again, my aim was to teach people how to deal with code written by others. (Fun fact: I had never seen before the TCP code I explained in part 3. I literally grepped for smp_mb and picked a random example. It's too easy to omit details when explaining code you're familiar with, and I wanted to be in the same shoes as the readers).

But lockless (and sharding :)) are just tools that you use to achieve scalability. If you don't need scalability just throw a big lock around everything and call it a day!

Lockless patterns: an introduction to compare-and-swap

epa — Sun, 14 Mar 2021 09:30:10 +0000

When you talk about cache coherency slowing things down, is it possible that a hardware compare-and-swap operation never using the cache, but always going directly to main memory, could end up being more scalable? (For the sake of argument suppose that the counter accessed for compare-and-swap is in a special area of memory that can't be accessed by any other instruction.)

Lockless patterns: an introduction to compare-and-swap

dgc — Sat, 13 Mar 2021 22:02:11 +0000

Sharding is another of those terms I dislike. It's a catch-all phrase that covers a huge number of different techniques. e.g. a simple per-cpu counter is a "sharded counter". A per-node list (e.g. list_lru.h) is a "sharded list". splitting an index space up into multiple independent index trees is a "sharded tree". It just doesn't tell you anything about how the algorithm works to maintain global state or any of the downsides to doing global scope operations, just that the data structure has been split up in some way.

Even stuff like optimistic lock coupled trees could come under the "sharded" title, because each node in the tree has it's own individual locks or mechanism of lockless atomic updates. (Yes, I really did just say lockless lock coupled trees :) Again, calling such a structure as "sharded" does not provide any understanding of how data accesses have been isolated and reduced. Mention OLC and most people with knowledge of scalable tree structures immediately understand how it works, the constraints it works under, how they scale and when it is desirable to use such a construct.

Oh, there I go again, talking about "scalable" instead of "lockless" or, in this case, "sharding"... :)

-Dave.

Lockless patterns: an introduction to compare-and-swap

dgc — Sat, 13 Mar 2021 21:43:38 +0000

Hi Paolo, You make some good points about the inherent complexity of lockless algorithms that programmers and engineers need to understand about lockless patterns and when and how to use them. I've been reading your series closely because I might learn something I've missed from it, but I am probably looking at them from a different angle than most readers because I do "scalability" and not "lockless"....

It was this article that I put my finger on what was missing from the other articles: an explanation of when you _wouldn't_ want to use a particular pattern. Your articles explain the pattern and give examples of how to use them correctly and in a beneficial way, but the downsides of those patterns are not really enumerated. The context you've given here, I think, probably should have been in a leader article explaining what the series of articles is going to teach the reader. Paul is good at doing this. :)

Other than this, I'm finding the articles easy to read and fairly good at explaining complex behaviour without being overwhelming. Keep 'em coming! :)

-Dave.

Lockless patterns: an introduction to compare-and-swap

foom — Sat, 13 Mar 2021 17:14:32 +0000

While it is nice in theory, LL/SC can often have _worse_ scalability. There's a reason ARMv8.1 added the "LSE" instructions, consisting of cmpxchg and read-modify-write (add, sub, etc) atomic memory operations instructions. RISC-V has the latter set, as well.

With these single instructions, the memory system is able to e.g. send an atomic-add-1-and-store operation over the memory bus, and have that operation run in the memory controller, rather than having to fetch the data to the cpu core, run the add on the cpu, then write it back out. This can greatly increase the ability to scale to higher levels of parallelism.

Lockless patterns: an introduction to compare-and-swap

jcm — Sat, 13 Mar 2021 16:18:32 +0000

On a total tangent, this makes me wonder how long you could force vmap_area_lock to be held by performing an extreme number of small VMA allocations and then freeing them.

Lockless patterns: an introduction to compare-and-swap

jcm — Sat, 13 Mar 2021 15:51:46 +0000

This is why I'm a fan of load-link/store-conditional in classical RISC. On the other hand, it suffers from complexity too in that you need to build an exclusive monitor that tracks updates in the coherency protocol to any affected lines (and is possibly permitted to only track one line, for example). RISC-V seems to have taken a purist position here.

Lockless patterns: an introduction to compare-and-swap

PaulMcKenney — Sat, 13 Mar 2021 15:50:23 +0000

The other oft-cited benefit of lock-free algorithms is behavior at high CPU load in workloads suffering from frequent preemption. Of course, as you say, livelock is no better than blocking, but carefully designed algorithms can in some cases avoid both. Another alternative is to limit CPU utilization, thus greatly reducing the duration of the preemptions, in turn allowing simpler algorithms to perform well.

But no matter what you do, there can be no question that having your software work reliably under extreme conditions does require extreme care in both development and validation.

Lockless patterns: an introduction to compare-and-swap

jcm — Sat, 13 Mar 2021 15:49:04 +0000

...which of course is literally one of the reasons willy is working on Maple Trees, to remove the locking required in nodes for red-black tree rotation :)

Lockless patterns: an introduction to compare-and-swap

pbonzini — Sat, 13 Mar 2021 10:00:20 +0000

Also another related topic is that of sharding. Splitting the data structure in multiple parts can reduce the number of writes and reads to a single shard, which is beneficial in itself, and it can again turn a multiple producer or multiple consumer scenario into one with a single producer and/or a single consumer—without locks this time. That in turn completely changes the tools you can use.

Lockless patterns: an introduction to compare-and-swap

willy — Sat, 13 Mar 2021 01:45:41 +0000

Understanding the behaviour of one's code is essential to choosing the locking pattern to use (lockfree being, of course, one locking pattern).

I've seen several examples of code in the kernel that uses an rwlock_t for protection because one call path reads the data structure. The author didn't stop to think "Is there actual parallelism to be extracted here?" or "What's the cost of an rwlock vs a spinlock?"

Lockless patterns: an introduction to compare-and-swap

pbonzini — Sat, 13 Mar 2021 00:42:34 +0000

If you mean compare-and-swap and not Linux cmpxchg, then ARM, PPC, and RISC-V all have opcodes (or can use LL/SC sequences) that provide relaxed compare and swap.

Lockless patterns: an introduction to compare-and-swap

pbonzini — Sat, 13 Mar 2021 00:40:42 +0000

On Linux actually cmpxchg does imply a memory barrier. The article is only written in terms of acquire and release because I wanted to introduce the "weaving" of the happens-before relation through all the add-to-list operations; this will come in handy in part 5. C11's atomic_compare_and_swap does *not* have the stricter reordering semantics of Linux's cmpxchg though.

There are a few other simplifications throughout the articles, they will all be "corrected" in due time.

Lockless patterns: an introduction to compare-and-swap

pbonzini — Sat, 13 Mar 2021 00:18:43 +0000

Hi Dave, thanks very much for this write up. Some of the things you wrote are the main reason why I wasn't sure whether to include linked lists at all. On one hand they are the poster child of lock-free data structures, on the other hand they are substantially less useful than any of the other things I am talking about, for all the reasons that you went into. They do have their place, but they're easy to misuse. For example the few times that I used llist, I did so not because of the lock-free producer side (which seems so great but has fairness/livelock problems if taken to the extreme), but because I knew writes would be rare and I wanted the (single) read side to be as fast as possible. This is not obvious at all.

In the end I included them because they provided me with good running examples, but more than any other patterns in this series they require answering the question of *when* to use lock free algorithms.

I was going to put that into the conclusive article (though now I might have to rethink that :), but I will explain quickly my ideas here. What I wanted to "teach" with this series is that on one hand you should not fear lockless algorithms because they can be properly distilled into common patterns and abstracted, on the other hand they are nothing but a tool. What matters the most in almost all cases is maintainability, and then:

* knowing a few well-known and/or well-abstracted patterns can actually make your code *more* maintainable and scalable despite using more advanced techniques. This is where lock-free linked lists fail the worst, because they're far from being the silver bullet that an inexperienced programmer could imagine them to be.

* you should split as much as possible your code between in a lock-free part, which will usually be one of these patterns or another abstraction provided by Linux such as lockref, and a lock-based slow path. This makes it easier to figure out the specific technique to use, as the lockless part remains small;

* {single,multiple} {producer,consumer} is a huge part of choosing your lockless technique. Knowing that you have a single producer or a single consumer lets you pick the right pattern and lets you keep the complication of the techniques at bay.

Understanding relative frequency of reads vs writes (or the relative frequency of slow vs fast path) is to some extent a prerequisite since otherwise you won't even know where the scalability problems are. It is a tradeoff that is present everywhere Linux uses lockless patterns (the above personal anecdote with llist is one example but the most common and also extreme case is probably RCU), and it also touches the previous bullets because, for low-enough frequencies, you can just treat the less frequent case as a slow path, use a lock to go from multiple to single producer/consumer, and see what this simplification can do for you in terms of new patterns to use.

Lockless patterns: an introduction to compare-and-swap

dgc — Fri, 12 Mar 2021 23:15:57 +0000

A couple of important things about cmpxchg algorithms that the article implies but doesn't explicitly mention. It says that cmpxchg is an "atomic read-modify-write operation" but then doesn't explain the scalabilty limitations this imposes on "lockless" cmpxchg algorithms.

1. cmpxchg is a hardware locked operation rather than a software locked operation. It still bounces cachelines around the machine via the hardware cache coherency protocol, so frequent cmpxchg operations will always reach an update saturation point where CPU usage goes non-linear and the update rate goes backwards.

2. cmpxchg algorithms can be extremely unfair and effectively livelock when subject to high concurrent modification rates because they directly expose the unfairness in hardware cache coherency protocols (i.e. update patterns usually reflect core-to-core latency differences).

Hence a number of important cmpxchg algorithms in the kernel drop out of the cmpxchg() loop after a set number of failures and fall back to spin locks to avoid the worst costs of these adverse behaviours. e.g. the lockref infrastructure (lockref_get/lockref_put) used by the dentry cache breaks out of cmpxchg loops after 100 consecutive failures.

As an example of how much work can be done before cacheline contention (cache coherency protocol saturation) starts to be a problem, I'm seeing non-linearity in CPU usage in a hot cmpxchg loop being measurable at about 2.5-3 million updates/s to a single 64 bit variable on it's own isolated cacheline on a CPU bound 32-way workload on a 2yr old 2-socket, 32p machine. It's not quite completely saturated yet, but profiles have a single cmpxchg loop starting to consume significant single digit percentages of the entire CPU usage of a workload creating 650,000 files/s.

In constrast, spinlocks that are seeing the same amount of traffic are consuming ~30% of the entire CPU usage of this workload. e.g. per-sb inode list lock is being trafficked 2.6m times/s resulting in it burning ~10% CPU (3 whole CPUs) spinning. That's the same atomic cacheline update rate as the hot cmpxchg loop I'm also seeing CPU usage go non-linear....

IOWs, cmpxchg() algorithms can scale better than locked algorithms, but they do not offer unbound scalability because they are not truly lockless operations. A couple of decades of writing concurrent algorithms has taught me that scalability is really defined by the frequency the concurrent algorithm accesses/updates shared data, not whether the software algorithm is considered "lockless" or not. A lock is simply a high frequency atomic access to shared data, hence it's easy to see why they become a scalability limitation very quickly if you consider them from a CPU cacheline access perspective. It should also be obvious now that cmpxchg() or atomic variables don't offer a huge improvement in fundamental scalability over locks - all they do is allow much finer grained and hence less frequent access to shared data. It is the reduction in access frequency to shared data that improves scalability, not the use of a "lockless" pattern.

This is the art of concurrent programming - it's not enough just to know what a lockless algorithm is, you need to understand the data access patterns those algorithms result in and when those access patterns are going to become a limitation to the software. Of course, knowing when not to use a lockless algorithm because there are better ways to reduce shared data access is also part of the art... :)

-Dave.

Lockless patterns: an introduction to compare-and-swap

jreiser — Fri, 12 Mar 2021 18:01:57 +0000

> [not] equivalent to a full memory barrier on every architecture

Please name the two most-prevalent architectures on which compare-and-swap is not equivalent to a full memory barrier.

Lockless patterns: an introduction to compare-and-swap

jcm — Fri, 12 Mar 2021 16:22:32 +0000

Worth noting that the compare-and-swap operation might provide both acquire and release semantics, but this doesn't make it equivalent to a full memory barrier on every architecture. The examples in this article carefully accommodate that but don't call it out.