User-space RCU: Memory-barrier menagerie

November 12, 2013

This article was contributed by Paul E. McKenney, Mathieu Desnoyers, Lai Jiangshan, and Josh Triplett

Most parallel code does not use explicit memory barriers. After all, the required memory barriers are coded into the underlying primitives, such as locking, atomic operations, and RCU. This allows most developers to use these primitives and cheerfully ignore memory barriers. However, if you need the ultimate in performance and scalability, or if you need to work on the underlying primitives, then you will need a solid understanding of memory barriers. This document is intended to provide that understanding. Although it uses the userspace RCU library's primitives, the concepts also apply to other environments, including the Linux kernel.

Conceptual underpinnings

It is worth clearly stating what memory barriers are not:

Memory barriers are not a way of making prior memory accesses get done faster.
On the contrary, although memory barriers might slow down subsequent stores, they rarely make anything go faster.
Memory barriers are not fairy dust that magically makes your concurrent program work correctly.
Yes, sprinkling memory barriers might reduce your failure probabilities, but so might random delays. If you use memory barriers, you had better understand them.
Memory barriers are not a mechanism that commits prior stores to main memory.
In fact, memory barriers control memory-mapped I/O (MMIO), store buffers, and perhaps also caches, but not normally main memory.
Memory barriers are not a broadcast mechanism that transmits prior stores to other CPUs.
Memory barriers do not necessarily transmit anything, they only force ordering.
Memory barriers are not a way of providing unconditional ordering guarantees.
You cannot expect an isolated memory barrier to do much of anything for you, except in a very few special cases. You must instead work with pairs of memory barriers.

The truth is that pairs of memory barriers provide conditional ordering guarantees. The conditional nature of these guarantees is encapsulated in if-then conditions. To illustrate this, let's look at a simple example where x and y both are initially equal to zero:

CPU 0                    CPU 1

x = 1;                   r1 = y;
cmm_smp_mb();            cmm_smp_mb();
y = 1;                   r2 = x;

[memory barrier example] The pair of memory barriers is the pair of cmm_smp_mb() statements. Now, if r1 == 1, then we know that CPU 0's store to y preceded CPU 1's load from y. If this condition is met, then the two memory barriers guarantee that CPU 0's store to x also preceded CPU 1's load from x, which means that r2 == 1. In short, this memory-barrier pair guarantees that if r1 == 1, then r2 == 1, as illustrated in the diagram to the right.

With a little Boolean algebra, we can convert this if-then condition into the following assertion: r1 == 0 || r2 == 1. Assertions are much more convenient for more complex memory-barrier examples, which have numerous if-then forms, but only one assertion (give or take the commutative law for logical OR). Either way, the following outcomes are possible:

r1 == 0 && r2 == 0: CPU 0's code started after CPU 1's code completed.
r1 == 0 && r2 == 1: CPU 0's and CPU 1's code executed concurrently.
r1 == 1 && r2 == 1: CPU 0's code completed before CPU 1's code started.

Only the outcome r1 == 1 && r2 == 0 is prohibited, which we will call the BUG_ON() expression. After all, this outcome requires that either CPU 0's or CPU 1's code be executed out of order, and ordering is exactly what the memory barriers are supposed to be guaranteeing in the first place.

Of course, all this assumes that all CPUs agree on the order of the sequence of values for any given memory location. Fortunately, all modern architectures provide some form of this guarantee that at least works in the presence of memory barriers.

Quick Quiz 1: Why can't CPU 0's cmm_smp_mb() be a cmm_smp_wmb()? For that matter, why can't CPU 1's cmm_smp_mb() be a cmm_smp_rmb()?
Answer
Quick Quiz 2: Suppose that some other CPU also stores to x. Are we still guaranteed that r2 == 1?
Answer

It is worth looking into what would happen without the memory barriers:

CPU 0                    CPU 1

x = 1;                   r1 = y;
y = 1;                   r2 = x;

In this case, both the compiler and the hardware are within their rights to reorder the statements, so that any of the four outcomes is now possible, including the r1 == 1 && r2 == 0 outcome that was prohibited by the previous example that had memory barriers.

It is now time to step back from the earlier specific example to the general rule:

CPU 0                    CPU 1

X0;                      Y1;
cmm_smp_mb();            cmm_smp_mb();
Y0;                      X1;

[memory barrier example] Here X0 and X1 are memory references involving the variable X, and Y0 and Y1 are memory references involving the variable Y.

If we can show that access Y0 preceded Y1 for a given execution, then this pair of memory barriers will guarantee that access X0 precedes X1 for that execution. This can also be expressed in the form of a diagram (seen to the right).

We say that the two memory barriers pair to make this guarantee. Again, a single isolated memory barrier cannot guarantee much of anything in most cases.

Quick Quiz 3: But why can't you instead show that X1 preceded X0, and then have this pair of memory barriers guarantee that Y1 also precedes Y0?
Answer

With this conceptual machinery in hand, we are ready for the memory-barrier menagerie.

The menagerie

The following table summarizes the two-variable memory-barrier scenarios. In all cases, the initial values of both x and y are zero. The values of the variables in the assertions are their final values, taken after all execution has ceased and the values from all assignment statements have propagated throughout the entire system. Although CPU 0 and CPU 1 are the main actors in these scenarios, some scenarios require additional stores, with Scenario 0 being the most extreme case. In fact, Scenario 0 is so extreme that we will ignore it.

Quick Quiz 4: But what if I don't want to ignore Scenario 0?
Answer
Quick Quiz 5: But what if I want to understand scenarios with more than two variables???
Answer

Each access to x and y must have an CMM_ACCESS_ONCE() to prevent the compiler from playing nasty games. As a service to those with finite displays, we substitute CAO() via the following definition:

    #define CAO(a) CMM_ACCESS_ONCE(a)

Scenario 3 is the most important, followed by Scenario 10, so on a first reading you might choose to focus on these two to the exclusion of the other 14.

Scenario	CPU 0	CPU 1	CPU 2	CPU 3	`BUG_ON()` Expression
0	`r1 = CAO(x); cmm_smp_mb(); r2 = CAO(y);`	`r3 = CAO(y); cmm_smp_mb(); r4 = CAO(x);`	`CAO(x) = 1;`	`CAO(y) = 1;`	`r1 == 1 && r2 == 0 && r3 == 1 && r4 == 0`	IRIW (mostly academic)
1	`r1 = CAO(x); cmm_smp_{r,}mb(); r2 = CAO(y);`	`r3 = CAO(y); cmm_smp_mb(); CAO(x) = 1;`		`CAO(y) = 1;`	`r1 == 1 && r2 == 0 && r3 == 1`
2	`r1 = CAO(x); cmm_smp_mb(); r2 = CAO(y);`	`CAO(y) = 1; cmm_smp_mb(); r4 = CAO(x);`	`CAO(x) = 1;`		`r1 == 1 && r2 == 0 && r4 == 0`
3	`r1 = CAO(x); cmm_smp_{r,}mb(); r2 = CAO(y);`	`CAO(y) = 1; cmm_smp_{w,}mb(); CAO(x) = 1;`			`r1 == 1 && r2 == 0`	Message (`y`) and flag (`x`)

4	`r1 = CAO(x); cmm_smp_mb(); CAO(y) = 1;`	`r3 = CAO(y); cmm_smp_{r,}mb(); r4 = CAO(x);`	`CAO(x) = 1;`		`r1 == 1 && r3 == 1 && r4 == 0`	Scenaro 1, swapping `x` and `y`
5	`r1 = CAO(x); cmm_smp_mb(); CAO(y) = 1;`	`r3 = CAO(y); cmm_smp_mb(); CAO(x) = 1;`			`r1 == 1 && r3 == 1`	Inverse Dekker
6	`r1 = CAO(x); cmm_smp_mb(); CAO(y) = 1;`	`CAO(y) = 2; cmm_smp_mb(); r4 = CAO(x);`	`CAO(x) = 1;`		`y == 2 && r1 == 1 && r4 == 0`
7	`r1 = CAO(x); cmm_smp_mb(); CAO(y) = 1;`	`CAO(y) = 2; cmm_smp_{w,}mb(); CAO(x) = 1;`			`y == 2 && r1 == 1`

8	`CAO(x) = 1; cmm_smp_mb(); r2 = CAO(y);`	`r3 = CAO(y); cmm_smp_mb(); r4 = CAO(x);`		`CAO(y) = 1;`	`r2 == 0 && r3 == 1 && r4 == 0`	Scenario 2, swapping `x` and `y`
9	`CAO(x) = 2; cmm_smp_mb(); r2 = CAO(y);`	`r3 = CAO(y); cmm_smp_mb(); CAO(x) = 1;`		`CAO(y) = 1;`	`x == 2 && r2 == 0 && r3 == 1`	Scenario 6, swapping `x` and `y`
10	`CAO(x) = 1; cmm_smp_mb(); r2 = CAO(y);`	`CAO(y) = 1; cmm_smp_mb(); r4 = CAO(x);`			`r2 == 0 && r4 == 0`	Dekker
11	`CAO(x) = 2; cmm_smp_mb(); r2 = CAO(y);`	`CAO(y) = 1; cmm_smp_mb(); CAO(x) = 1;`			`x == 2 && r2 == 0`

12	`CAO(x) = 1; cmm_smp_{w,}mb(); CAO(y) = 1;`	`r3 = CAO(y); cmm_smp_{r,}mb(); r4 = CAO(x);`			`r3 == 1 && r4 == 0`	Message (`x`) and flag (`y`) Scenario 3, swapping `x` and `y`
13	`CAO(x) = 2; cmm_smp_{w,}mb(); CAO(y) = 1;`	`r3 = CAO(y); cmm_smp_mb(); CAO(x) = 1;`			`x == 2 && r3 == 1`	Scenario 7, swapping `x` and `y`
14	`CAO(x) = 1; cmm_smp_mb(); CAO(y) = 1;`	`CAO(y) = 2; cmm_smp_mb(); r4 = CAO(x);`			`y == 2 && r4 == 0`	Scenario 11, swapping `x` and `y`
15	`CAO(x) = 1; cmm_smp_{w,}mb(); CAO(y) = 2;`	`CAO(y) = 1; cmm_smp_{w,}mb(); CAO(x) = 2;`			`x == 1 && y == 1`

The example in the previous section is Scenario 3 and Scenario 12, which are exact the same scenario, but with the roles of x and y interchanged. In fact, there are six pairs of scenarios that are related in this way: 1 and 4, 2 and 8, 3 and 12, 6 and 9, 7 and 13, and finally 11 and 14. This leaves 10 unique scenarios, namely 0-3, 5-7, 10, 11, and 15.

In some scenarios, weaker memory barriers can be used. These are indicated by cmm_smp_{r,}mb() and cmm_smp_{w,}mb(), indicating that the memory barrier can safely be weakened to a read or write barrier, respectively. For those with access to a compiler that handles C/C++11 atomics, a cmm_smp_rmb() can be combined with the preceding load via a C/C++11 memory_order_acquire load, and a cmm_smp_wmb() can be combined with the following store via a C/C++11 memory_order_release store.

Quick Quiz 6: Why can't Scenario 0's memory barriers be weakened to cmm_smp_rmb()?
Answer

Scenarios 3 and 12 are used very heavily for interprocess communication. For example, a writer fills out a message buffer, executes a memory barrier, and then sets a flag indicating that the message buffer is now ready. Readers check the flag, and if it is set, execute a memory barrier, and only then read out the message buffer. If either memory barrier is omitted, both the compiler and the CPU can rearrange both the writer's and the reader's accesses, resulting in the reader receiving invalid data.

Scenario 10 frequently appears in mutual-exclusion algorithms, in fact, it is named after Dekker's algorithm, which is a mutual-exclusion algorithm with which Scenario 10 is closely associated. Within the Linux kernel, SRCU makes heavy use of this scenario in order to ensure coordination of reads in an SRCU read-side critical section (which follow the write in the preceding srcu_read_lock()) with the counter reads in synchronize_srcu() (which follow the data-structure writes preceding the synchronize_srcu()).

Quick Quiz 7: Scenario 1's BUG_ON() expression reads r1 == 1 && r2 == 0 && r3 == 1. What does that even mean???
Answer

Of course, the other scenarios are sometimes used as well, as are combinations of these scenarios with each other or with locking, where an unlock followed by a lock sometimes acts as a full cmm_smp_mb() memory barrier (but check your user-level locking primitives, unlike those of the kernel, some do not make this guarantee). One special caution: When combining multiple scenarios, it is very easy to inadvertently rely on transitivity, which neither cmm_smp_rmb() or cmm_smp_wmb() provide.

To see this, consider the following situation:

CPU 0                    CPU 1                    CPU 2

x = 1;                   r1 = y;                  r3 = z;
cmm_smp_wmb();           cmm_smp_mb();            cmm_smp_rmb();
y = 1;                   z = 1;                   r4 = x;

One might hope that r1 == 1 && r3 == 1 would imply that r4 == 1, but the non-transitivity of CPU 1's cmm_smp_wmb() and CPU 2's cmm_smp_rmb() mean that one would be disappointed. You should use full memory barriers unless (1) you can actually measure the performance difference, and (2) you can prove that it is safe to use the weaker barriers.

Be careful. The use of memory barriers might not be a dark art, but it can be a bit tricky sometimes.

Quick Quiz 8: I'll say they can be tricky! But you said that memory barriers can sometimes act singly instead of in pairs. How does that happen?
Answer
Quick Quiz 9: But memory barriers are so expensive! Can't we get these guarantees more cheaply?
Answer

Conclusions

There are of course other possible combinations of memory barriers, for example, SRCU places memory barriers between repeated reads from the same counter variable, relying on other mechanisms to prevent counter overflow. Nevertheless, a careful study of the above table will give you an excellent basic understanding of memory barriers and how they can be used.

All that aside, if you find yourself frequently using memory barriers, you should take a step back and think carefully about what you are doing. Most of the time, you should instead use primitives (for example, locks, atomic operations, or RCU) that provide the needed memory barriers internally, so that you don't need to worry about them. If none of the pre-existing primitives does what you need, only then should you use memory barriers. And even then, you should use a good design that places the needed memory barriers into new primitives to make things easier for those who later read your code or try to use your primitives.

More Information

Is Parallel Programming Hard, And, If So, What Can You Do About It? by Paul E. McKenney. See Chapter 12 and Appendix C.
Relaxed-Memory Concurrency by Peter Sewell's and Susmit Sarkar's group.
Validating Memory Barriers and Atomic Instructions by Paul E. McKenney.
A Tutorial Introduction to the ARM and POWER Relaxed Memory Models [PDF] by Luc Maranget, Susmit Sarkar, and Peter Sewell. This reference includes a much larger menagerie that includes the effects of smp_read_barrier_depends() as well as more elaborate combinations of threads and variables. This reference is therefore well worth perusing for those who are seriously interested in fully understanding memory barriers.
Partial Orders for Efficient Bounded Model Checking of Concurrent Software [PDF] by Jade Alglave, Daniel Kroening, and Michael Tautschnig.

Answers to Quick Quizzes

Quick Quiz 1: Why can't CPU 0's cmm_smp_mb() be a cmm_smp_wmb()? For that matter, why can't CPU 1's cmm_smp_mb() be a cmm_smp_rmb()?

Answer: No reason. In this particular case both memory barriers could in fact be demoted, which would make the code run faster on many systems. However, this demotion would also give up the transitive property of the cmm_smp_mb() full memory barrier.

Back to Quick Quiz 1.

Quick Quiz 2: Suppose that some other CPU also stores to x. Are we still guaranteed that r2 == 1?

Answer: Not unless that other CPU only stores the value 1 to x. However, if that other CPU could store some other value, then we are guaranteed that the value of r2 will be either 1 or some value stored to x after CPU 0's store.

Back to Quick Quiz 2.

Quick Quiz 3: But why can't you instead show that X1 preceded X0, and then have this pair of memory barriers guarantee that Y1 also precedes Y0?

Answer: You can in fact do this, in which case the two CPUs have simply switched roles.

Back to Quick Quiz 3.

Quick Quiz 4: But what if I don't want to ignore Scenario 0?

Answer: OK, have it your way!

Let's suppose that r1 == 1. This means that CPU 0's load from x must have happened after CPU 2's store. But if r2 == 0, then CPU 0's load from y must have happened before CPU 3's store. Given the transitivity of this full memory barrier, this means that CPU 2's store must have happened before CPU 3's store.

Now, if r3 == 1, we know that CPU 1's load from y happened after CPU 3's store, which implies that it also happened after CPU 2's store. Therefore, CPU 1's load from x also happens after CPU 2's store, so that r4 == 1.

Similar lines of reasoning apply for arbitrary permutations of the assertion.

Hey, you asked!!!

Back to Quick Quiz 4.

Quick Quiz 5: But what if I want to understand scenarios with more than two variables???

Answer: Then you should peruse POWER and ARM Litmus Tests, which is also included in A Tutorial Introduction to the ARM and POWER Relaxed Memory Models by Luc Maranget, Susmit Sarkar, and Peter Sewell.

Back to Quick Quiz 5.

Quick Quiz 6: Why can't Scenario 0's memory barriers be weakened to cmm_smp_rmb()?

Answer: Because this scenario relies on transitivity, which is provided only by the full memory barrier, cmm_smp_mb().

Back to Quick Quiz 6.

Quick Quiz 7: Scenario 1's BUG_ON() expression reads r1 == 1 && r2 == 0 && r3 == 1. What does that even mean???

Answer: Relax, take a deep breath, and take it one equality at a time.

The first equality (r1 == 1) says that CPU 0's load from x happened after CPU 1's store to x. Then the second equality (r2 == 0) says that CPU 0's load from y happened after CPU 3's store to y. So far, so good.

But then the third equality (r3 == 1) says that CPU 1's load from y happened after CPU 3's store to y. But because CPU 0's load from x happened after CPU 1's store to x, the memory barriers force CPU 1's load from y to happen before CPU 0's load from y, which in turn happened before CPU 3's store to y. Given the memory barriers, CPU 3's store to y cannot happen in both places, so the BUG_ON() expression really cannot happen.

It is well worthwhile to do a similar analysis on the other assertions. It is also worthwhile to do the analysis in other orders. Even though the “&&” operator is commutative, the insight obtained from assertion analysis frequently is not. In fact, different orderings of the inequalities in an assertion often correspond to different execution orderings.

To see this, let's work Scenario 1's assertion backwards.

If r3 == 1 and r2 == 0, we know that CPU 3's store must have happened after CPU 0's load from y and before CPU 1's load from y. This in turn means that CPU 0's load from y must have happened before CPU 1's load from y.

But this means that the memory barriers will pair to guarantee that CPU 0's load from x happens before CPU 1's store to x. This guarantees that r1 == 0.

In other words, if r3 == 1 and r2 == 0, then it must be the case that r1 == 0. We also know that CPU 0 finished before CPU 1 started.

Back to Quick Quiz 7.

Quick Quiz 8: I'll say they can be tricky! But you said that memory barriers can sometimes act singly instead of in pairs. How does that happen?

Answer: One way is MMIO accesses, which can be ordered by a single memory barrier. Of course, you could argue that the memory barrier is still paired, but with a memory barrier in the I/O hardware or firmware. Besides which, MMIO accesses are rather rare in user-mode applications (though not unheard of).

Another way that one memory barrier can act on its own is as follows:

CPU 0                    CPU 1

r1 = x;                  x = 1;
cmm_smp_mb();
r2 = x;

In this case, any value seen by the load into r1 must also be seen by the load into r2.

Therefore, the corresponding BUG_ON() expression is r1 == 1 && r2 == 0, with the equivalent assertion being r1 == 0 || r2 == 1.

Of course, cache coherence alone covers this case, so that one might expect the following to provide the same guarantee:

CPU 0                        CPU 1

r1 = CMM_ACCESS_ONCE(x);     x = 1;
r2 = CMM_ACCESS_ONCE(x);

And in fact this does provide the same guarantee. Interestingly enough, the CMM_ACCESS_ONCE() macros are critically important on Itanium. Normal loads and stores on Itanium do not provide read-read cache coherence, so without the CMM_ACCESS_ONCE() there is no guarantee. Fortunately, Itanium's GCC emits ld.acq and st.rel instructions for volatile loads and stores, respectively, so proper use of CMM_ACCESS_ONCE() saves the day.

Back to Quick Quiz 8.

Quick Quiz 9: But memory barriers are so expensive! Can't we get these guarantees more cheaply?

Answer: You can!

One approach is to abandon portability, so that your software only runs on TSO systems such as x86 or the IBM Mainframe. On these systems, cmm_smp_rmb() and cmm_smp_wmb() constrain the compiler's code-reordering optimizations, but do not themselves generate any actual instructions.

If you need to run on weakly ordered systems such as ARM, Power, and Itanium, then you can use the RCU-based techniques described here.

Back to Quick Quiz 9.

User-space RCU: Memory-barrier menagerie

Posted Nov 15, 2013 1:30 UTC (Fri) by chantecode (subscriber, #54535) [Link] (1 responses)

Excellent document. I've been waiting for such thorough reference on the various possible ordering and their expected guarantees. There is Documentation/memory-barrier.txt but I don't remember seeing the Dekker and inverted Dekker examples inside.

Also I'm still confused about the guarantee provided by Dekker, because it's the only barrier that seem to guarantee a state machine forward progress, which I can't manage to understand why. But at least I know the effect of its ordering now, heh I have to start somewhere...

-- Frederic

User-space RCU: Memory-barrier menagerie

Posted Nov 15, 2013 20:37 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Glad you like it!

On Dekker, there are no more forward-progress guarantees than for any other scenario. To see this, consider that once CPU 0 and CPU 1 complete execution of the Dekker scenario, then at least one of them will have seen the other's store. Note that Scenario 15 has a similar guarantee: once both CPUs complete execution, then at least one of x and y will have the value 2.

Of course, the guarantee provided by the Dekker scenario is well-suited to mutual exclusion, but on the other hand I much prefer atomic operations. ;–)