Lockless patterns: full memory barriers

March 5, 2021

This article was contributed by Paolo Bonzini

The first two articles in this series introduced four ways to order memory accesses: load-acquire and store-release operations in the first installment, read and write memory barriers in the second. The series continues with an exploration of full memory barriers, why they are more expensive, and how they are used in the kernel.

The primitives that were introduced so far constrain the ordering of loads and stores in four different ways:

Load-acquire operations are ordered before subsequent loads and stores.
Store-release operations are ordered after earlier loads and stores.
Read memory barriers order earlier loads against subsequent loads.
Write memory barriers order earlier stores against subsequent stores.

It may be surprising that no combination of these operations orders an earlier store against a later load:

Ordered against

Load Store

Earlier load smp_load_acquire(),
smp_rmb() smp_load_acquire(),
smp_store_release()

Earlier store ?? smp_store_release(),
smp_wmb()

	Ordered against
	Load	Store
Earlier load	`smp_load_acquire()`, `smp_rmb()`	`smp_load_acquire()`, `smp_store_release()`
Earlier store	??	`smp_store_release()`, `smp_wmb()`

It turns out that guaranteeing a global ordering of stores against later loads is much more complicated for the processor, and it deserves a primitive of its own. To find out why, we need to abandon even the high-level concepts that we have used so far and take a direct peek at how processors operate.

How processors really do it

The first article already mentioned that, deep down, processors communicate via a message-passing architecture, such as QPI or HyperTransport. At the assembly-language level, however, the programmer sees operations like memory loads and stores. Any acquire and release semantics associated with these memory operations are an illusion that is provided by the processor's execution pipeline, based on the program it runs and on the constraints of the architecture.

For example, on x86 processors, all memory loads count as "load-acquire", and all memory stores count as "store-release"; this behavior is required by the architecture. When compiling for these processors, it's still important to annotate acquire and release operations and/or to insert memory barriers, but these annotations are only used by the compiler to block invalid optimizations. Even in this model, the architecture does not guarantee that all processors see stores and loads in the same order. Consider this example, where a and b are initially zero:

    CPU 1                    CPU 2
    -------------------      --------------------
    store 1 into a           store 1 into b
    load b into x            load a into y

If you try interleaving the four operations by hand, you'll see that at least one store will come before the corresponding load. Therefore one would expect that at least one of x and y will be 1 after the above sequence runs to completion. Yet, even on an x86 processor, it is possible that both x and y read as zero.

How so? The reason lies in the existence of the store buffer, which lives between the CPU and its L1 cache. Stores to memory usually only change one part of a cache line; completing those stores, even just to cache, may require fetching the full cache line from memory — a slow operation. The store buffer holds the stored data while this fetch takes place, allowing the CPU to proceed with execution.

Even a CPU with a store buffer can provide load-load, load-store, and store-store ordering relatively easily:

On out-of-order CPUs (which can execute operations in an order different from how they appear in the code to improve performance), ordering memory operations against earlier loads can be done via speculative execution. During the time between a cache line access and the retiring of the instruction that caused the access, the CPU tracks evictions of that cache line. Retiring of instructions happens in order, so this tracking will last until all earlier loads have completed. If the cache line is evicted before the instruction is retired, speculation is canceled and the processor retries the memory operation.
Keeping stores ordered is even simpler; it is enough to flush the entries of the store buffer to the cache in FIFO order.

However, store-load ordering is a different story. First of all, one CPU's store might be stuck in its store buffer, and therefore it will not be visible to another CPU when that CPU loads data from L1 cache. Second, a store buffer also provides store forwarding, where the result of memory loads is taken directly from the store buffer instead of going through the cache. If either CPU 1 or CPU 2 has an entry for respectively b or a in its store buffer, the value could be forwarded to the load, which would return zero.

The only way around these problems is to flush the store buffer entirely between the store and the load. This is as expensive as it sounds (a few tens of clock cycles), but this is what the full memory barrier smp_mb() does under the hood. Here is the same pseudocode as above, fixed and rewritten into C:

    thread 1                 thread 2
    -------------------      --------------------
    WRITE_ONCE(a, 1);        WRITE_ONCE(b, 1);
    smp_mb();                smp_mb();
    x = READ_ONCE(b);        y = READ_ONCE(a);

Let's say x is zero. I will use a squiggly horizontal line from a read to a write to express that WRITE_ONCE(b, 1) overwrites the value that thread 1 had read; then the situation (which is described in detail in the kernel's memory-model documentation) is as follows:

    WRITE_ONCE(a, 1);
           |
      -----+----- smp_mb();
           |
           v
    x = READ_ONCE(b);   ∿∿∿∿∿∿∿>  WRITE_ONCE(b, 1);
                                         |
                                    -----+----- smp_mb();
                                         |
                                         v
                                  y = READ_ONCE(a);

Because these are relaxed operations, this is not enough to introduce a happens-before edge between thread 1 and thread 2. The barriers also do not have acquire or release semantics by themselves, so there are no happens-before edges across the two threads at all.

However, the barriers do provide enough information to order the operations in the two threads. In particular, continuing with the case of x=0, the full memory barrier in thread 2 guarantees that the store buffer has been flushed by the time READ_ONCE(a) executes. Could this be even before READ_ONCE(b) executed? If so, READ_ONCE(b) would certainly see the earlier WRITE_ONCE(b, 1) from thread 2 (remember the store buffer has been flushed), and x would be 1. That's a contradiction, therefore READ_ONCE(b) must have executed first:

    WRITE_ONCE(a, 1);              WRITE_ONCE(b, 1);
           |                              |
           |                         -----+----- smp_mb();
           |                              |
           v                              v
    x = READ_ONCE(b); -----------> y = READ_ONCE(a);
                        (if x=0)

Due to transitivity, READ_ONCE(a) can see the effect of WRITE_ONCE(a, 1) and y=1. Likewise, if thread 2 reads a as zero, thread 1's full memory barrier ensures that the READ_ONCE(a) must have executed before thread 1's READ_ONCE(b):

    WRITE_ONCE(a, 1);              WRITE_ONCE(b, 1);
           |                              |
      -----+----- smp_mb();               |
           |                              |
           v                              v
    x = READ_ONCE(b); <----------- y = READ_ONCE(a);
                        (if y=0)

meaning that y=0 implies x=1. Different executions may show one or the other possibility, but in any case x and y cannot both be zero; that would mean that READ_ONCE(a) executed before READ_ONCE(b) and vice versa, which is impossible.

The Linux kernel memory model does not call these read-read edges "happens-before", because there's no indication of which is the release operation and which is the acquire operation; nevertheless, they have the same effect of ordering memory operations across threads. Therefore we can state that high-level APIs have acquire or release semantics even if internally they use full memory barriers; users of those APIs will be able to think of their code in terms of the happens-before framework, too. The next example will show how this is done in practice.

Sleep/wake-up synchronization

The detailed description in the previous section shows that full memory barriers impose constraints in a complicated and unintuitive manner. Fortunately, the above pattern of "two threads and two flags", where each thread writes to a flag and reads from the other, is probably all that you need to know about full memory barriers.

This pattern has many important and practical applications. Suppose thread 2 wants thread 1 to act on 2's requests; thread 1 on the other hand wants to do something else—something that could even take an indefinite amount of time, for example removing itself from the scheduler's run queue and going to sleep. In this case, the code will look like this:

    thread 1                                   thread 2
   -------------------                        --------------------------
    WRITE_ONCE(dont_sleep, 1);                 WRITE_ONCE(wake_me, 1);
    smp_mb();                                  smp_mb();
    if (READ_ONCE(wake_me))                    if (!READ_ONCE(dont_sleep))
      wake(thread2);                             sleep();

If thread 2 reads dont_sleep as zero, thread 1 will read wake_me as one and wake up thread 2; it's useful to think of thread 1 as having release semantics (imagine that the wake() happens as part of mutex_unlock). If thread 1 reads wake_me as zero, thread 2 will read dont_sleep as one and will not go to sleep. Usually this is the side that needs to exhibit acquire semantics.

There is a hidden assumption that thread 1's wake-up requests are never lost, even if for example wake() is called after thread 2's READ_ONCE() but before sleep(). A way to avoid this bug is for wake() and sleep() calls to take the same lock. Again, we see how lockless patterns cooperate with traditional synchronization—after all this is a slow path.

It really does work, and we can see this, for example, in Linux's prepare_to_wait() and wake_up_process() APIs. This interface was introduced in the 2.5.x development kernels, and was duly covered back then by LWN. Here is how the pattern appears after expanding some of the functions that are involved:

    thread 1                                     thread 2
    -------------------                          --------------------------
    WRITE_ONCE(condition, 1);                    prepare_to_wait(..., TASK_INTERRUPTIBLE) {
    wake_up_process(p) {                           set_current_state(TASK_INTERRUPTIBLE) {
      try_to_wake_up(p, TASK_NORMAL, 0) {            WRITE_ONCE(current->state, TASK_INTERRUPTIBLE);
        smp_mb();                                    smp_mb();
        if (READ_ONCE(p->state) & TASK_NORMAL)     }
	  ttwu_queue(p);                         }      
      }                                          if (!READ_ONCE(condition))
    }                                              schedule();

As we saw in the seqcount case, the memory barriers are hidden inside the implementation of higher-level APIs. Embedding memory barriers or load-acquire/store-release operations inside the implementation is in fact how one defines an API that has acquire and release semantics; in this case, wake_up_process() has release semantics, while set_current_state() and its caller prepare_to_wait() have acquire semantics.

The sleep condition is often checked twice to limit unnecessary wakeups, like this:

    thread 1                               thread 2
    -------------------                    --------------------------
    WRITE_ONCE(dont_sleep, 1);             if (!READ_ONCE(dont_sleep)) {
    smp_mb();                                WRITE_ONCE(wake_me, 1);
    if (READ_ONCE(wake_me))                  smp_mb();
        wake(thread2);                       if (!READ_ONCE(dont_sleep))
                                               sleep();
                                           }

In the kernel we can see this kind of check in the interaction between tcp_data_snd_check() and the call to tcp_check_space(), defined immediately above it (thread 1) and tcp_poll() (thread 2). In this case, the code does not have the luxury of relying on a higher-level abstraction, so it merits a closer look. Thread 2 wants to sleep if the socket has no room in its send buffer, therefore tcp_poll() sets the "wake me" SOCK_NOSPACE flag before checking __sk_stream_is_writeable(). The core of the lockless synchronization in tcp_poll() is:

    set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
    smp_mb__after_atomic();
    if (__sk_stream_is_writeable(sk, 1))
      mask |= EPOLLOUT | EPOLLWRNORM;

The caller will ultimately go to sleep if mask is zero. smp_mb__after_atomic() is a specialized version of smp_mb() and has the same semantics. These optimized barriers will be explained in a future article.

Thread 1, instead, must wake up thread 2 after consuming data in the send buffer. tcp_data_snd_check() first sends out packets to make room in the buffer ("do not sleep, there's room now"), then checks SOCK_NOSPACE, and finally (through the sk->sk_write_space() function pointer) reaches sk_stream_write_space(), where thread 2 is woken up. The call stack is relatively shallow, so I'll let the reader explore the code. I will however point out this comment in tcp_check_space():

    /* pairs with tcp_poll() */
    smp_mb();
    if (test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
      tcp_new_space(sk);

The reference to the "pairing" of barriers tells us that this function has either acquire or release semantics. A read or write memory barrier would tell us straight away that the function has respectively acquire and release semantics. With a full memory barrier, we need to look at the code around the barrier. In this case, we know that the function is on the "wake-up" side of the pattern and therefore has release semantics; tcp_poll() instead has acquire semantics.

Something similar happens almost everywhere a smp_mb() can be found in the kernel. For example:

Workqueues use this idiom to decide whether workers have more work to do. In this case, thread 1's role is taken by insert_work(), while thread 2's role is taken by wq_worker_sleeping().
In the futex() system call, thread 1's write is in user space, while the memory barrier and read are part of futex(FUTEX_WAKE). Thread 2's operation is entirely part of the futex(FUTEX_WAIT) (because wake_me is a flag in kernel memory); FUTEX_WAIT accepts the expected value of the futex as an argument to the system call, and uses it to decide whether to sleep. See the long comment at the head of kernel/futex.c for details on how this works.
Within KVM, sleep() is replaced by the act of entering the processor's guest mode and running a virtual machine. In order to kick the processor out of guest mode, kvm_vcpu_kick() sends an inter-processor interrupt to the processor. The familiar comment around memory barrier pairing can be found down the call chain in kvm_vcpu_exiting_guest_mode(), which also where vcpu->mode is read.
Virtio devices use two instances of the above pattern. On one hand, the driver wants to stop processing completed requests, and waking it up consists of sending an interrupt. On the other hand, the device wants to stop processing submitted requests, and the device can wake it up by writing to a "doorbell" memory location.

That completes the introduction of memory barriers. From here, we will look at compare-and-swap operations, how they can be combined with locks to create lock-free fast paths, and their role in the implementation of lock-free linked lists.

Index entries for this article
Kernel	Lockless algorithms
GuestArticles	Bonzini, Paolo

Lockless patterns: full memory barriers

Posted Mar 5, 2021 18:50 UTC (Fri) by jcm (subscriber, #18262) [Link] (5 responses)

A key thing to remember with memory barriers is they are only about the observed order of memory operations. An implementation is actually free to do very different things with those barriers (including speculating right through them, which you can do as long as there is no circumstance under which someone can observe incorrect ordering).

Lockless patterns: full memory barriers

Posted Mar 5, 2021 19:06 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (4 responses)

Yes, it's only about the observed order, and in different ways depending on whether the tricks come from the compiler or the processor.

The genius stroke of C++11 compared to e.g. the Java memory model was to treat the compiler+processor combo as weak memory ordering even if the underlying architecture is TSO. I don't think any compiler applies very much the leeway that it's given, but it does make for a nice and consistent model at the language level.

I do prefer the LKMM and its clear foundation on the behavior of hardware (which I tried to convey in this article) to C++11's slightly too handwavy "just use sequential consistency unless you need something else".

Lockless patterns: full memory barriers

Posted Mar 6, 2021 15:25 UTC (Sat) by jcm (subscriber, #18262) [Link] (3 responses)

One of the things that has been keeping me up at night recently thinking has been speculating through barriers. You do it by tracking cache ownership/state transitions as you go through and roll back as needed. The easiest way to think about it is probably as close to the apparatus needed for transactions. And there are a few papers in this area. So anyway, the point is draining various buffers isn’t the only way to pull this off. Think how Intel does their MOB for TSO by tracking updates and replaying too.

Lockless patterns: full memory barriers

Posted Mar 6, 2021 15:28 UTC (Sat) by jcm (subscriber, #18262) [Link]

The mental rabbit hole is separating the perceived observed order from reality. Reality might involve combinations of invalidation queue and store/load buffer tracking, but it might involve something very different :)

Lockless patterns: full memory barriers

Posted Mar 6, 2021 17:44 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (1 responses)

Yep, I even mentioned (with a little bit of simplification) what Intel does for TSO. To some extent it should be possible to do the same for full barriers, by delaying invalidate messages and keeping them buffered until the store buffer has been flushed. After all in most cases there is no race and therefore the effect of the barrier is kind of invisible.

Lockless patterns: full memory barriers

Posted Mar 9, 2021 5:00 UTC (Tue) by jcm (subscriber, #18262) [Link]

Thing is you don’t even need to flush the sucker, just track age and ensure that they’ve all passed through. You can blow right through every barrier without cost provided you are willing to track enough cache lines in the process. I have been doing a lot of thinking lately about renaming SPRs and speculating through serializing instructions too. There’s no reason you couldn’t (provided you tracked everything, were willing to pay the cost, and also could precisely unwind the state to prevent side-channel crumbs, which might force you to add eg a side buffer). This has been dancing in my head for the past week solid.

Lockless patterns: full memory barriers

Posted Mar 7, 2021 16:02 UTC (Sun) by jcm (subscriber, #18262) [Link]

x86 processors maintain the illusion of TSO ordering through the use of a MOB (Memory Ordering Buffer), which not only tracks the cache lines for ownership/invalidation but also replays at retirement if necessary in order to maintain ordering. So e.g. a load might be performed twice in order to ensure it retires correctly.

Lockless patterns: full memory barriers

Posted Mar 7, 2021 16:10 UTC (Sun) by jcm (subscriber, #18262) [Link] (2 responses)

Btw, in the SB example, it's not that the store buffer is forwarding locally, "which would return zero", it's that each is writing into its local store buffer while reading the other variable from the initial state. The SB allows the stores to be delayed in terms of observation by other processors relative to the local one.

Lockless patterns: full memory barriers

Posted Mar 7, 2021 20:56 UTC (Sun) by pbonzini (subscriber, #60935) [Link] (1 responses)

Read the paragraph again, it mentions both scenarios. :)

Lockless patterns: full memory barriers

Posted Mar 8, 2021 20:56 UTC (Mon) by jcm (subscriber, #18262) [Link]

Yea, I did after. I agree, thanks :)

Lockless patterns: full memory barriers

Posted Mar 9, 2021 1:41 UTC (Tue) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (2 responses)

Nice gentle introduction to the reads-from relation! ("~~~~~~~~") ;-)

Lockless patterns: full memory barriers

Posted Mar 9, 2021 20:31 UTC (Tue) by pbonzini (subscriber, #60935) [Link] (1 responses)

If "that's all you have to say about that" (as Forrest Gump would put it), then I guess you didn't find any mistake, yay!

Lockless patterns: full memory barriers

Posted Mar 9, 2021 20:57 UTC (Tue) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Or I haven't yet had a chance to read it thoroughly, your choice. ;-)

Lockless patterns: full memory barriers

Posted Mar 12, 2021 17:03 UTC (Fri) by firolwn (guest, #96711) [Link] (2 responses)

Hi Paolo, great article. I think maybe you forgot to add 'smp_mb();' to the two diagrams which are just around the line starting with 'Due to transitivity'.
--
Firo

Lockless patterns: full memory barriers

Posted Mar 15, 2021 14:35 UTC (Mon) by firolwn (guest, #96711) [Link] (1 responses)

After reading 'Multicopy atomicity' section from kernel Documentation/memory-barriers.txt, I realize that I am wrong and no more smp_mb() is necessary to add.
--
Firo

Lockless patterns: full memory barriers

Posted Apr 16, 2021 7:01 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

It's not intuitive at all, but only one memory barrier matters in each of the two cases. But both are needed (separately) to ensure that x=0 && y = 0 is not possible.