Axiomatic validation of memory barriers and atomic instructions

August 19, 2014

This article was contributed by Paul McKenney and Alan Stern

Although memory barriers and atomic instructions are less obscure than they once were, they can still be rather confusing. The need for atomic instructions is easy to see: if multiple CPUs execute a++ simultaneously, some increments could be lost. For example, if the initial value of a was zero, then all (say) eight CPUs might concurrently load zero, add one to it, and then all store the value one, rather than the correct value of eight. In contrast, atomic_inc(&a) would faithfully record each and every increment to a.

To see the need for memory barriers, keep in mind that both the compiler and the CPU can cause memory-reference instructions to be reordered. It is important to remember that even strongly ordered systems such as x86 and the IBM Mainframe still do some reordering, namely, allowing prior stores to be executed after subsequent loads. For example, consider the following x86 litmus test, with x and y both initially zero:

    CPU 0		CPU 1
    -----		-----
    mov [x],$1		mov [y],$1
    mov eax,[y]		mov eax,[x]

It is possible for both CPUs' eax registers to be equal to zero at the end of this test, because both of the loads can be executed before either of the stores. (In fact, this outcome could occur even if just one of the loads was executed before both stores.) To prevent this reordering, we can insert mfence instructions as follows:

Quick Quiz 1: Why can't the CPU designers simply make their hardware do things in order???
Answer

    CPU 0		CPU 1
    -----		-----
    mov [x],$1		mov [y],$1
    mfence		mfence
    mov eax,[y]		mov eax,[x]

These added mfence instructions ensure that at least one of the CPUs will end up with a non-zero eax register.

But this leads to the question: “Exactly where do I need to place memory-barrier instructions?” For most developers, even Linux kernel hackers, the answer is “Nowhere.” The reason for this reassuring answer is that most synchronization primitives already contain the needed memory barriers. Therefore, if you use locks, sequence counters, atomic operations, or read-copy-update (RCU) in the intended manner, you don't need to add any explicit memory-barrier instructions. Instead, the required memory-barrier instructions are included in these synchronization primitives.

Of course, if you are writing tricky code (in which case, you should think twice) or if you are writing your own synchronization primitive (in which case you should also think twice), you will need to add your own memory barriers and atomic instructions. One indispensable source of help is the documentation in Documentation/atomic_ops.txt and Documentation/memory-barriers.txt. You most definitely need to read and understand these documents before using atomic operations or explicit memory-barrier instructions.

armmem/ppcmem tool

Once you have read this documentation, another source of help is the armmem/ppcmem tool discussed in an LWN article that appeared in 2011. This tool has been extremely helpful for validating short sequences of critical code for the Linux kernel and elsewhere; however, it can be a bit slow. For example, consider the “independent reads of independent writes” (IRIW) litmus test, in which one thread (call it “P0”) writes the value 1 to variable x, another thread (P1) writes the value 1 to variable y, a third thread (P2) reads x then y separated by a full barrier, and a fourth thread (P3) reads these two variables in the opposite order. Both x and y are initially zero. The corresponding PowerPC litmus test is as follows:

 1 PPC IRIW.litmus
 2 ""
 3 (* Traditional IRIW. *)
 4 {
 5 0:r1=1; 0:r2=x;
 6 1:r1=1;         1:r4=y;
 7         2:r2=x; 2:r4=y;
 8         3:r2=x; 3:r4=y;
 9 }
10  P0           | P1           | P2                 | P3                 ;
11  stw r1,0(r2) | stw r1,0(r4) | lwz r3,0(r2)       | lwz r3,0(r4)       ;
12               |              | sync               | sync               ;
13               |              | lwz r5,0(r4)       | lwz r5,0(r2)       ;
14 
15 exists
16 (2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0)

Line 1 identifies this as a PowerPC litmus file, and gives it the name “IRIW.litmus”. An ARM litmus file would, of course, substitute “ARM” for the above “PPC”. Line 2 provides a place for an alternative name for the test, which you will usually want to leave blank as shown. Comments may be inserted at this point, as shown on line 3, using the OCaml (or Pascal) syntax of “(* *)”. These comments can span multiple lines if desired.

Lines 4-9 provide initial values for the machine registers, and are of the form “P:R=V”, where “P” is the process identifier, “R” is the register identifier, and “V” is the value. For example, Process P0's register r1 initially contains the value 1, and its register r2 initially contains the address of variable x. Uninitialized variables default to the value zero, so that, in the example, x and y are both initially zero. If desired, x and y can be initialized as well, for example, adding a “x=1; y=2;” after line 4 would initialize x to the value one and y to the value two.

Line 10 provides identifiers for the four processes, and must be of the form “PN” numbered consecutively starting from zero. Strict, perhaps, but it does help avoid confusion.

Lines 11-13 contain the code, so that P0 and P1 do their respective stores, and P2 and P3 do their loads in opposite orders, separated by a full memory barrier in the form of a sync instruction.

Finally, lines 15 and 16 contain the assertion, which in this case will trigger if P2 and P3 see the two stores from P0 and P1 happening in different orders.

Running this example using the full state-space search tool ppcmem gives the correct answer, namely that the assertion cannot fire. Automatic computation of this result is a great improvement over the alternative of poring over manuals and, in fact, armmem/ppcmem has been an invaluable aid in tracking down several Linux kernel issues over the past few years. However, armmem/ppcmem does take about 14 CPU hours to compute this result, and also requires about 10GB of main memory to hold the state space. Of course, 14 hours is a great improvement over the weeks or months required by manual methods, but these two kernel hackers are impatient and prize quicker answers to hard questions.

herd tool

Jade Alglave and Luc Maranget, who were members of the original group behind ppcmem, were joined by Michael Tautschnig in some recent work [PDF] that replaces the full state-space search with an axiomatic, event-based approach. This axiomatic approach can greatly reduce the time required to analyze a particular code fragment because its use of partially ordered relations means that it does not need to examine large numbers of equivalent, totally ordered sequences of states. Instead, the axiomatic approach constructs the set of possible candidate executions, rejecting those that do not meet the constraints of the underlying memory model. In herd, a candidate execution consists of little more than a mapping giving, for each load, the store that provided the memory value that was read. This compact representation results in a much smaller number of candidate executions than are considered by the full-state-space approach used by armmem/ppcmem, which should allow herd to run somewhat faster.

Running herd (as in “herd -model ppc.cat IRIW.litmus -o /tmp” after downloading from here or via https://github.com/herd/herdtools) on the IRIW litmus test file gives the following output, which, like ppcmem, confirms that the assertion can never fire (note the second-to-last line beginning with “Observation”):

Test IRIW Allowed
States 15
2:r3=0; 2:r5=0; 3:r3=0; 3:r5=0;
2:r3=0; 2:r5=0; 3:r3=0; 3:r5=1;
2:r3=0; 2:r5=0; 3:r3=1; 3:r5=0;
2:r3=0; 2:r5=0; 3:r3=1; 3:r5=1;
2:r3=0; 2:r5=1; 3:r3=0; 3:r5=0;
2:r3=0; 2:r5=1; 3:r3=0; 3:r5=1;
2:r3=0; 2:r5=1; 3:r3=1; 3:r5=0;
2:r3=0; 2:r5=1; 3:r3=1; 3:r5=1;
2:r3=1; 2:r5=0; 3:r3=0; 3:r5=0;
2:r3=1; 2:r5=0; 3:r3=0; 3:r5=1;
2:r3=1; 2:r5=0; 3:r3=1; 3:r5=1;
2:r3=1; 2:r5=1; 3:r3=0; 3:r5=0;
2:r3=1; 2:r5=1; 3:r3=0; 3:r5=1;
2:r3=1; 2:r5=1; 3:r3=1; 3:r5=0;
2:r3=1; 2:r5=1; 3:r3=1; 3:r5=1;
No
Witnesses
Positive: 0 Negative: 15
Condition exists (2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0)
Observation IRIW Never 0 15
Hash=41423414f4e33c57cc1c9f17cd585c4d

However, instead of taking 14 hours to compute this result, herd completes in about 16 milliseconds, representing a speedup in excess of 3,000,000x, which is something that you do not see every day. (The herd paper claims only a 45,000x speedup.) However, it is only fair to add that ppcmem/armmem was implemented in a deliberately straightforward way, without performance optimizations, in order to increase the probability that it was in fact implementing the intended memory model, a point that is also called out on pages 3 and 61 of the herd paper.

The tool also reports assertion violations. For example, substituting the lightweight-barrier lwsync instruction for the full-barrier sync instruction gives the following output:

Test IRIW Allowed
States 16
2:r3=0; 2:r5=0; 3:r3=0; 3:r5=0;
2:r3=0; 2:r5=0; 3:r3=0; 3:r5=1;
2:r3=0; 2:r5=0; 3:r3=1; 3:r5=0;
2:r3=0; 2:r5=0; 3:r3=1; 3:r5=1;
2:r3=0; 2:r5=1; 3:r3=0; 3:r5=0;
2:r3=0; 2:r5=1; 3:r3=0; 3:r5=1;
2:r3=0; 2:r5=1; 3:r3=1; 3:r5=0;
2:r3=0; 2:r5=1; 3:r3=1; 3:r5=1;
2:r3=1; 2:r5=0; 3:r3=0; 3:r5=0;
2:r3=1; 2:r5=0; 3:r3=0; 3:r5=1;
2:r3=1; 2:r5=0; 3:r3=1; 3:r5=0;
2:r3=1; 2:r5=0; 3:r3=1; 3:r5=1;
2:r3=1; 2:r5=1; 3:r3=0; 3:r5=0;
2:r3=1; 2:r5=1; 3:r3=0; 3:r5=1;
2:r3=1; 2:r5=1; 3:r3=1; 3:r5=0;
2:r3=1; 2:r5=1; 3:r3=1; 3:r5=1;
Ok
Witnesses
Positive: 1 Negative: 15
Condition exists (2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0)
Observation IRIW Sometimes 1 15
Hash=a886ed63a2b5bddbf5ddc43195a85db7

This confirms that full barriers are required in order to prevent the assertion from firing: Compare the “Sometimes” on the second-to-last line of the output for lwsync to the “Never” on that same line in the previous output. This means that in the version of this litmus test using the sync instruction, the assertion can “Never” happen, while in the version using the lwsync instruction, it “Sometimes” can. Although this is nice to know, it would be even nicer to understand how the lwsync barriers allow this result to happen. Such understanding can be used to drive an interactive ppcmem/armmem session, providing a useful cross-check of herd's results. (And performance is not an issue when you are interactively studying a single path of execution rather than conducting a full state-space search.)

And that is what the “-graph cluster -show prop” arguments to the herd command are for. This generates a graphviz dot file as follows:

This diagram shows that while the lwsync instructions enforce the order of the individual reads, they do not force the writes to be propagated in the same order to Thread 2 and Thread 3. This may not be a surprise, but it is still nice to see it in diagram form. The labels on the arrows are defined as follows:

Quick Quiz 2: But the diagram has a cycle involving the two fr arrows! How can that possibly happen? Do these computer systems have some sort of built-in time machine???
Answer

fr: From read, which connects earlier loads to later stores at the same address, for example, from Thread 3's read of x to Thread 0's later write to this same variable.
GHB: Global happens before, which is based on the effects of memory-barrier instructions and dataflow dependencies, illustrating the limited transitivity (aka “cumulativity”) of the lwsync instruction.
po: Program order, for example, between the pairs of reads.
propbase: Base store-propagation relationship that the herd tool uses to construct the full set of store-propagation relationships.
rf: Read from, for example from the store to x to the load that returns the value written by that store.

Quick Quiz 3: How could anyone possibly describe something as complex as a memory model in only 23 lines of text?
Answer

The herd tool handles PowerPC and a few variants of ARM, as well as x86, a fragment of the C++ language (including its memory model), and several generations of NVIDIA GPUs, with the C++11 and GPU models having been supplied by Jade Alglave, Mark Batty, Luc Maranget, Daniel Poetzl, Tyler Sorensen and John Wickerson. In fact, one of the strengths of the herd tool is that it uses a very compact description of memory models, so that the ppc.cat file describing the PowerPC memory model is 67 lines long, and the x86tso.cat file describing the x86 memory model is only 23 lines long.

An example x86 litmus test follows, which was discussed on LKML in November 2013:

 1 X86 LockRelease
 2 ""
 3 { 
 4 x=0; y=0; z=0;
 5 }
 6  P0          | P1          | P2          ;
 7  MOV [x],$1  | MOV EAX,[z] | MOV [y],$1  ;
 8  MOV [z],$1  | MOV EBX,[y] | MFENCE      ;
 9              |             | MOV EAX,[x] ;
10 exists (1:EAX=1 /\ 1:EBX=0 /\ 2:EAX=0)

The incantation “./herd -model x86tso.cat x86-lock-barrier.litmus -o ~/tmp” gives the following output:

Quick Quiz 4: What kind of locking primitive gets away without using atomic instructions, anyway???
Answer

Test LockRelease Allowed
States 7
1:EAX=0; 1:EBX=0; 2:EAX=0;
1:EAX=0; 1:EBX=0; 2:EAX=1;
1:EAX=0; 1:EBX=1; 2:EAX=0;
1:EAX=0; 1:EBX=1; 2:EAX=1;
1:EAX=1; 1:EBX=0; 2:EAX=1;
1:EAX=1; 1:EBX=1; 2:EAX=0;
1:EAX=1; 1:EBX=1; 2:EAX=1;
No
Witnesses
Positive: 0 Negative: 7
Condition exists (1:EAX=1 /\ 1:EBX=0 /\ 2:EAX=0)
Observation LockRelease Never 0 7
Hash=98f5658ae7325da4775b1358920d466a

This is in agreement with the verdict from Intel's CPU architects—and herd delivered this verdict after a delay that extended for a full eleven milliseconds.

Linked data structures

It turns out that both ppcmem/armmem and herd can handle simple linked data structures, namely a singly linked list with no data. This makes it possible for these tools to validate some simple uses of RCU, for example, the 3-light scenario in the LWN article on the RCU-barrier menagerie:

 1 PPC RCU-3-tmp
 2 ""
 3 (* https://lwn.net/Articles/573497/ Scenario 3-light for Power.
 4    Also adds a temporary. *)
 5 {
 6 0:r0=1; 0:r3=x; 0:r4=y; 0:r5=z;
 7 1:r0=1; 1:r3=x; 1:r4=y; 1:r5=z;
 8 }
 9  P0             | P1            ;
10   lwz r6,0(r3)  | stw r0,0(r4)  ;
11   cmpwi r6,0    | lwsync        ;
12   beq Fail      | stw r4,0(r3)  ;
13   stw r6,0(r5)  |               ;
14   lwz r8,0(r5)  |               ;
15   lwz r7,0(r8)  |               ;
16  Fail:          |               ;
17 
18 exists
19 (0:r6=y /\ 0:r7=0)

Processor P1 initializes the new node (which is variable y) on line 10 and does an rcu_assign_pointer() into the list header (which is variable x) on lines 11 and 12. Variable x now contains a pointer to variable y, which in turn contains the value one. Meanwhile, processor P0 does an rcu_dereference() (line 10), checks for a NULL pointer (lines 11 and 12), stores the resulting non-NULL pointer to a temporary (line 13), loads the pointer back from the temporary (line 14), and finally loads the contents of the first element in the list.

If the value returned from rcu_dereference() is a pointer to y, but dereferencing that pointer returns a value of zero, we are in trouble. Therefore, the condition on line 19 should never happen. And, fortunately, both tools agree that it never does:

Test RCU-3-tmp Allowed
States 2
0:r6=0; 0:r7=0;
0:r6=y; 0:r7=1;
No
Witnesses
Positive: 0 Negative: 2
Condition exists (0:r6=y /\ 0:r7=0)
Observation RCU-3-tmp Never 0 2
Hash=413c6e17f142bfa3835ed993c8ebf814

There is some work in progress to handle more complex data structures — though, in the words of one of the researchers: “Don't hold your breath!” Nevertheless, the above litmus test shows that even this limited ability to handle single-member data structures can be quite useful.

Scalability

We already saw that herd is blindingly fast, but it is still interesting to look at its scalability. Two tests will demonstrate this, the first being an extension of the IRIW litmus test discussed above. This test is augmented so that the writing processors do additional (useless) writes, so that the five-writes-per-processor litmus test appears as follows:

 1 PPC IRIW5.litmus
 2 ""
 3 (* Traditional IRIW, but with five stores instead of just one. *)
 4 {
 5 0:r1=1; 0:r2=x;
 6 1:r1=1;         1:r4=y;
 7         2:r2=x; 2:r4=y;
 8         3:r2=x; 3:r4=y;
 9 }
10  P0           | P1           | P2                 | P3                 ;
11  stw r1,0(r2) | stw r1,0(r4) | lwz r3,0(r2)       | lwz r3,0(r4)       ;
12  addi r1,r1,1 | addi r1,r1,1 | sync               | sync               ;
13  stw r1,0(r2) | stw r1,0(r4) | lwz r5,0(r4)       | lwz r5,0(r2)       ;
14  addi r1,r1,1 | addi r1,r1,1 |                    |                    ;
15  stw r1,0(r2) | stw r1,0(r4) |                    |                    ;
16  addi r1,r1,1 | addi r1,r1,1 |                    |                    ;
17  stw r1,0(r2) | stw r1,0(r4) |                    |                    ;
18  addi r1,r1,1 | addi r1,r1,1 |                    |                    ;
19  stw r1,0(r2) | stw r1,0(r4) |                    |                    ;
20 
21 exists
22 (2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0)

The time required as a function of the number of per-thread writes is as follows:

Writes
Per
Processor Time (s) Relative
Slowdown Slowdown
vs.
1 Write

1 0.017

2 0.152 8.9x 8.9x

3 1.940 12.8x 114.1x

4 25.182 13.0x 1,481.3x

5 984.798 39.1x 57,929.3x

Writes Per Processor	Time (s)	Relative Slowdown	Slowdown vs. 1 Write
1	0.017
2	0.152	8.9x	8.9x
3	1.940	12.8x	114.1x
4	25.182	13.0x	1,481.3x
5	984.798	39.1x	57,929.3x

So although herd is quite fast compared to full-state-space search, it is still most emphatically exponential. However, one of the reasons for the slowness is that “./herd -model ppc.cat” calculates all possible coherence orders (orders of writes to a given single variable), even those that do not matter to the reads. However, herd provides an alternative memory model that computes only the subset of the coherence order that actually matters to readers. Substituting this alternative “prettycat.cat” model for “ppc.cat” runs the five-write-per-processor litmus test in 3.888 seconds, which is a few orders of magnitude faster than the 984.798 seconds that ppc.cat manages. Of course, as with other software tools, adding optimization adds risk!

The second test chains the WWC+lwsyncs litmus test, but instead of extending the number of writes per processor, extends the number of processors. Here is the base-level three-processor version:

 1 PPC WWC+lwsyncs
 2 "Rfe LwSyncdRW Rfe LwSyncdRW Wse"
 3 Cycle=Rfe LwSyncdRW Rfe LwSyncdRW Wse
 4 {
 5 0:r2=x;
 6 1:r2=x; 1:r4=y;
 7 2:r2=y; 2:r4=x;
 8 }
 9  P0           | P1           | P2           ;
10  li r1,2      | lwz r1,0(r2) | lwz r1,0(r2) ;
11  stw r1,0(r2) | lwsync       | lwsync       ;
12               | li r3,1      | li r3,1      ;
13               | stw r3,0(r4) | stw r3,0(r4) ;
14 exists
15 (x=2 /\ 1:r1=2 /\ 2:r1=1)

In C, this looks something like the following (but in real life, don't forget the ACCESS_ONCE() calls!):

 1 CPU 0          CPU 1          CPU 2
 2 -----          -----          -----
 3 x = 2;         r1 = x;        r2 = y;
 4                lwsync();      lwsync();
 5                y = 1;         x = 1;
 6 
 7 assert(!(x==2 && r1==2 && r2==1));

(The assertion is intended to run after all the parallel code has finished.)

This can be extended to add a fourth processor as follows:

 1 PPC WWC4+lwsyncs
 2 "Rfe LwSyncdRW Rfe LwSyncdRW Wse"
 3 Cycle=Rfe LwSyncdRW Rfe LwSyncdRW Wse
 4 {
 5 0:r2=x;
 6 1:r2=x; 1:r4=y;
 7 2:r2=y; 2:r4=z;
 8 3:r2=z; 3:r4=x;
 9 }
10  P0           | P1           | P2           | P3           ;
11  li r1,2      | lwz r1,0(r2) | lwz r1,0(r2) | lwz r1,0(r2) ;
12  stw r1,0(r2) | lwsync       | lwsync       | lwsync       ;
13               | li r3,1      | li r3,1      | li r3,1      ;
14               | stw r3,0(r4) | stw r3,0(r4) | stw r3,0(r4) ;
15 exists
16 (x=2 /\ 1:r1=2 /\ 2:r1=1 /\ 3:r1=1)

Which looks something like the following in C:

 1 CPU 0          CPU 1          CPU 2          CPU 3
 2 -----          -----          -----          -----
 3 x = 2;         r1 = x;        r2 = y;        r3 = z;
 4                lwsync();      lwsync();      lwsync();
 5                y = 1;         z = 1;         x = 1;
 6 
 7 assert(!(x==2 && r1==2 && r2==1 && r3 == 1));

This test continues adding processors up through 12 processors, resulting in this monstrosity, which is best viewed in full-screen mode. Surprisingly, herd deals just fine with twelve threads. And although herd still shows exponential behavior, the slowdowns are considerably milder than for the previous augmented IRIW example:

Threads Time (s) Relative
Slowdown Slowdown vs.
3 Processors

3 0.017

4 0.023 1.4x 1.4x

5 0.049 2.1x 2.9x

6 0.107 2.2x 6.3x

7 0.255 2.4x 15.0x

8 0.613 2.4x 36.1x

9 1.471 2.4x 86.5x

10 3.444 2.3x 202.6x

11 7.938 2.3x 466.9x

12 17.711 2.2x 1,041.8x

Threads	Time (s)	Relative Slowdown	Slowdown vs. 3 Processors
3	0.017
4	0.023	1.4x	1.4x
5	0.049	2.1x	2.9x
6	0.107	2.2x	6.3x
7	0.255	2.4x	15.0x
8	0.613	2.4x	36.1x
9	1.471	2.4x	86.5x
10	3.444	2.3x	202.6x
11	7.938	2.3x	466.9x
12	17.711	2.2x	1,041.8x

In short, herd appears to be able to handle most reasonably sized litmus tests with eminently reasonable runtime. Perhaps surprisingly, herd seems happier with multiple processors than with multiple writes to the same memory location.

Limitations and summary

However, it is important to consider some intrinsic limitations of the herd tool, which are similar to those of ppcmem/armmem:

These tools do not constitute official statements by ARM, IBM, or Intel on their respective CPU architectures. For example, all three corporations reserve the right to report a bug at any time against any version of any of these tools. These tools are therefore not a substitute for careful stress testing on real hardware. Moreover, both the tools and the model that they are based on are under active development and might change at any time.
Contrariwise, at least one counterexample (observed on real hardware) to the results reported by the tools has been recognized by the official architects as demonstrating a silicon bug. The processors are also under active development.
ppcmem/armmem on the one hand and herd on the other do not always agree. These tools do give the same results for each litmus test in a suite of more than 8,000 such tests; however, there are a couple of examples where they differ. One example is Figure 37 of the herd paper, and Alan found a smaller example. Both examples are quite obscure. But there is a formal result in one direction; Section 7.3 of the paper (pages 45-49) contains a proof that whenever ppcmem/armmem believes an assertion can be triggered, herd will report the same. In other words, herd is more conservative: it may indicate that a litmus test contains a synchronization error even though ppcmem/armmem thinks the test is okay. Nevertheless, when either one reports a possible assertion violation, it is wise to use the tools' reporting and visualization capabilities to help understand the reason for the violation.
The herd tool is still exponential in nature, though much much faster than its predecessors.
Neither armmem/ppcmem nor herd is much good for complex data structures, although you can create and traverse super-simple linked lists using initialization statements of the form “x=y; y=z; z=42;”.
Neither armmem/ppcmem nor herd handles memory-mapped I/O or device registers. Although there might come a time when we have formal models not only of normal memory, but of I/O devices as well, now is not yet that time.
The tools will detect only those problems for which you code an assertion. This weakness is common to all formal methods, and is yet another reason why testing remains important. In the immortal words of Donald Knuth: “Beware of bugs in the above code; I have only proved it correct, not tried it.”

On the other hand, one nice thing about tools such as ppcmem/armmem and herd is that they are designed to model the full range of behaviors permitted by the architecture. This full range includes behaviors that current hardware does not exhibit, but which future hardware might. Use of these tools is thus one good way to future-proof your code.

All that said, it is best to design your code to be easily partitioned and to use higher-level synchronization primitives such as locks, sequence counters, atomic operations, and RCU. But for those of us who design and implement these synchronization primitives, ppcmem/armmem and herd are likely to be quite helpful.

Acknowledgments

We owe thanks to H. Peter Anvin (and the anonymous Intel CPU architects) who validated the x86 litmus test, and to Richard Grisenthwaite and Derek Williams for many helpful discussions of the ARM and the Power architectures, respectively. We are all indebted to Luc Maranget, Jade Alglave, Peter Sewell, and Susmit Sarkar for their review of an early draft of this article. We are grateful to Jim Wasko for his support of this effort.

Answers to Quick Quizzes

Quick Quiz 1: Why can't the CPU designers simply make their hardware do things in order???

Answer: Because full ordering destroys performance and scalability. This reordering of earlier stores against later loads allows the hardware to buffer stores, so that a long series of stores need not force the hardware to stall while waiting for ownership of the corresponding cache lines to make its way to the CPU doing the stores. In fact, thread-local reordering is just the beginning: compilers can also eliminate and introduce memory accesses, and in the Power and ARM architectures different writes by a given processor can propagate to the other processors in different orders.

Back to Quick Quiz 1.

Quick Quiz 2: But the diagram has a cycle involving the two fr arrows! How can that possibly happen? Do these computer systems have some sort of built-in time machine???

Answer: The hardware uses exactly the same time machine that we all use, namely the one that propels us into the future at a rate of one second per second. To see how that cycle can happen, imagine a system where Thread 0 and Thread 2 are close to each other, for example, they might be a pair of hardware threads running in the same CPU core. Imagine that Thread 1 and Thread 3 are similarly close to each other. This could cause Thread 2's loads to become aware of Thread 0's store much more quickly than Thread 3's loads do, as shown in the following diagram, with time advancing from top to bottom.

As you can see, this “counter-intuitive” result can be explained by simple propagation delay. The question of how sync instructions manage to prevent executions like this one is left as an open exercise for advanced students.

Back to Quick Quiz 2.

Quick Quiz 3: How could anyone possibly describe something as complex as a memory model in only 23 lines of text?

Answer: By using high-level notation, for example as follows for x86:

"X86 TSO"

(* Uniproc check *)
let com = rf | fr | co
acyclic po-loc | com

(* Atomic *)
empty atom & (fre;coe)

(* GHB *)
#ppo
let po_ghb = WW(po) | RM(po)

#implied barriers
let poWR = WR(po)
let i1 = PA(poWR)
let i2 = AP(poWR)
let implied = i1 | i2

let GHB = mfence | implied | po_ghb | rfe | fr | co
show mfence
show implied
acyclic GHB

This might serve as a useful reminder that “concise” and “easy to read” are orthogonal concepts. On the other hand, the 23 lines include both blank lines and comments. The notation is explained in the herd paper.

Back to Quick Quiz 3.

Quick Quiz 4: What kind of locking primitive gets away without using atomic instructions, anyway???

Answer: High-contention handoff of a queued locking primitive.

Now, the initial acquisition of a queued lock does in fact use atomic instructions, as it must to handle the possibility of multiple CPUs concurrently attempting to acquire the lock at a time when no CPU holds it. However, the high-contention handoff is from the CPU holding the lock to the CPU that is first in line to acquire it. Because there are only two CPUs involved, atomic instructions are not needed.

This should come as no surprise, as the classic single-producer/single-consumer array-based FIFO implementation also needs no atomic instructions, and for the same reason.

Back to Quick Quiz 4.

Index entries for this article
GuestArticles	McKenney, Paul E. and Alan Stern

Axiomatic validation of memory barriers and atomic instructions

Posted Aug 22, 2014 6:37 UTC (Fri) by kernel.developer1 (guest, #74895) [Link] (5 responses)

In above it mention that "atomic operations" you don't need to add any explicit memory-barrier instructions

When I read Documentation/atomic_ops.txt

*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***

So this made me confusing.

Axiomatic validation of memory barriers and atomic instructions

Posted Aug 22, 2014 9:30 UTC (Fri) by cladisch (✭ supporter ✭, #50193) [Link]

Atomic operations are synchronized, but only as far as their own value is concerned.

If you combine an atomic_t variable with other data, you need barriers to describe the relationships between them.

Locks and RCU are intended to be used to protect other data.

Axiomatic validation of memory barriers and atomic instructions

Posted Aug 26, 2014 0:42 UTC (Tue) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (3 responses)

The atomic operations that return values (other than, as you noted, atomic_read()) imply full barriers before and after. And these are the ones that need it, see http://lwn.net/Articles/423994/ for more information on this.

That said, controlling your concurrency with locks is usually easier than using atomic operations, at least in cases where locking's performance and scalability suffices.

Axiomatic validation of memory barriers and atomic instructions

Posted Aug 26, 2014 8:05 UTC (Tue) by jzbiciak (guest, #5246) [Link] (2 responses)

I think I missed that article the first time 'round. Thank you for linking it.

FWIW, on "future work" item #3, it seems like unconditional lock acquisition is equivalent to either:

A function with a return value that says that you acquired the lock (and just always happens to be 'true'), or
A function with a while-loop around a call to a conditional lock acquisition function that spins until the lock is actually acquired.

Those two statements actually say the same thing under the hood—that there's a fundamental step with a return value (whether you acquired the lock) and a composite step that repeats the fundamental step until a goal is achieved (you actually acquired the lock).

To tease this apart in the SNC framework, it seems like you need some additional criterion to determine whether your measurement point for SNC-ness is at the appropriate level, or that you must drill down.

In this case, the execution time of "unconditional lock acquisition" can not be bounded because it depends on the state of the lock and the behavior of the other lock holders, and so represents SNC communication. (In this case, to the CPU's conditional branch circuitry.) That suggests a potential criterion stated in terms of the termination condition for the API. If the termination condition that allows the API to exit is SNC, the API is SNC.

At least, that's my initial gut feel after reading that article.

Axiomatic validation of memory barriers and atomic instructions

Posted Aug 26, 2014 8:22 UTC (Tue) by jzbiciak (guest, #5246) [Link] (1 responses)

Minor clarification. On this point:

the execution time of "unconditional lock acquisition" can not be bounded

I'd like to add "... on the basis of the inputs to the lock acquisition API alone."

Basically, I was trying to capture the idea that if some code tries to take a lock unconditionally, the lock-taker's execution time is almost entirely at the mercy of the lock holder regardless of how clever the lock-taker's algorithms are. You cannot progress until the lock holder releases the lock.

And, of course, if multiple threads try to take the lock simultaneously, other threads may win ahead of this one, so the identity of "lock holder" may change during the process of attempting to acquire the lock. I don't think that changes the idea I'm attempting to express.

Stating these ideas precisely is simultaneously necessary and difficult. :-)

Axiomatic validation of memory barriers and atomic instructions

Posted Aug 26, 2014 15:12 UTC (Tue) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Indeed, this stuff is not easy. Ten years from now, when all the details are worked out, it will probably seem a lot easier, but of course other problems will arise in the meantime. (Where would be the challenge otherwise?)

Believe it or not, in some special cases, the execution of lock acquisition can be bounded. I briefly cover this in a blog post (http://paulmck.livejournal.com/37329.html), and Bjoern Brandenburg analyzed it in depth in his dissertation (http://www.mpi-sws.org/~bbb/papers/index.html#thesis). Surprising, perhaps, but true!

Nevertheless, your original statement that lock acquisition cannot be bounded is correct in the not-uncommon case of non-FIFO locking primitives and potentially unbounded critical sections (e.g., waiting for a network packet to arrive while holding a lock). Even with FIFO locking primitives, the upper bounds for lock can grow extremely large as the number of threads increases, as the permitted lock nesting levels increase, and as the number of real-time priority levels in use increases.

Nevertheless, it is worth noting that locking does not necessarily imply unbounded execution times.

Axiomatic validation of memory barriers and atomic instructions

Posted Aug 24, 2014 11:47 UTC (Sun) by meyert (subscriber, #32097) [Link] (3 responses)

Why can R5 be 0 for P2 and P3 in the PPC IRIW litmus test? Both CPUs executed a sync, so ŵhy can R5 be 0 as the 15 states in the herd output suggests? Doesn't make the sync operation make these reads dependent to the writes?

So the only assertion to make from this test seems to be that "P2 and P3 doesn't see the two stores from P0 and P1 happening in different orders." How can this assertion be useful at all? Say I want to make a decision based on R5 in P2 and P3, to only execute some code when P0 or P1 did set the value to 1?

Axiomatic validation of memory barriers and atomic instructions

Posted Aug 25, 2014 7:50 UTC (Mon) by jzbiciak (guest, #5246) [Link] (2 responses)

Including the IRIW litmus test here for convenience:

 1 PPC IRIW.litmus
 2 ""
 3 (* Traditional IRIW. *)
 4 {
 5 0:r1=1; 0:r2=x;
 6 1:r1=1;         1:r4=y;
 7         2:r2=x; 2:r4=y;
 8         3:r2=x; 3:r4=y;
 9 }
10  P0           | P1           | P2                 | P3                 ;
11  stw r1,0(r2) | stw r1,0(r4) | lwz r3,0(r2)       | lwz r3,0(r4)       ;
12               |              | sync               | sync               ;
13               |              | lwz r5,0(r4)       | lwz r5,0(r2)       ;
14 
15 exists
16 (2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0)

So, you ask how R5 can be zero. I assume you mean to ask "How can R5 be zero for either processor when that processor's R3 is not zero," correct? The trivial answer to your question otherwise is "P2 and P3 ran before P0 and P1", which I doubt is satisfactory. :-)

Let's focus on the two most interesting states allowed by IRIW:

P2's R3=1 and R5=1, but P3's R3=1 and R5=0
P3's R3=1 and R5=1, but P2's R3=1 and R5=0

In the first case, P2 saw both writes to X and Y, but P3 only saw the write to Y. That is allowed, because P0 and P1 are not ordered with respect to each other. You could reason that P2 read strictly after P0 and P1, but P3 read after P1 but before P0.

In the second case, P3 saw both writes to X and Y, but P2 only saw the write to X. That is also allowed, because P0 and P1 are not ordered with respect to each other. You could reason that P3 read strictly after both P0 and P1, but P2 read after P0 but before P1.

So, then how about the assertion at the end?

The assertion at the end effectively says "If P2 saw X=1 and P3 saw Y=1, then either P2 also sees Y=1 or P3 sees X=1." What sync apparently does (and lwsync does not do) is provide an illusion of a global order for stores. Regardless of whether P0 wrote before P1 or vice versa, it appears that 'sync' ensures that all writes visible before the 'sync' have a consistent ordering with respect to all writes visible after the 'sync'; furthermore, this ordering is the same for all processors issuing a 'sync'.

In this example, the 'sync' on P2 and P3 has the effect of making them agree on whether P0 or P1 happened first. (In contrast, 'lwsync' apparently only locally orders their loads, which does not cause them to agree on which of P0 or P1 happened first.)

How could this assertion be useful? I'm hard pressed to say with certainty off the top of my head; however, I think some primitives such as Lamport's Bakery and certain other synchronization barriers rely on all readers seeing a consistent order of writers for multiple processors joining the primitive.

Axiomatic validation of memory barriers and atomic instructions

Posted Aug 25, 2014 18:10 UTC (Mon) by meyert (subscriber, #32097) [Link] (1 responses)

Thanks for this explanation, I think I got it now...

So the presentation in the litmus test is a bit misleading and has no information of when the given code is executed on each processor/thread. So line 10 does not execute exactly at the same time on each thread!

So, as an example, a correct usage of the sync would be:
Given a data structure that is written/created by only one thread:
One field in this data structure could be used as a indicator if this data structure is valid now, I.e. the creation is finished and reader is allowed to use this information now.

So the writer should write all necessary fields in the structure and do a sync to establishe a global ordering. Then the writer could set the "information is valid" field to true and sync again.

All readers must spin till they see the indicator field set to true. After that they can begin to use the information in the structure.
I'm not sure if the readers need to call sync in this case to establish the same view on the global ordering.

Hope above example isn't that worse :-)

Thanks again for your explanation.

Axiomatic validation of memory barriers and atomic instructions

Posted Aug 25, 2014 18:48 UTC (Mon) by jzbiciak (guest, #5246) [Link]

So the presentation in the litmus test is a bit misleading and has no information of when the given code is executed on each processor/thread. So line 10 does not execute exactly at the same time on each thread!

Precisely. There are four threads, and their relative execution timelines could vary arbitrarily, except for any ordering the architecture imposes on them (say, through sync or other architectural guarantees).

I read up a little more on the PPC sync instruction today, here. The sync instruction is a true full memory barrier, in that for the processor executing sync, all memory accesses in the system are either strictly before the barrier, or strictly after the barrier.

They also appear to speak to the situation in the example directly in the page linked above:

The first form is heavy-weight sync, or commonly just referred to as sync. This form is often needed by adapter device drivers, for ordering system memory accesses made by a driver critical section with accesses made to its I/O adapter. Executing sync ensures that all instructions preceding the sync instruction have completed before the sync instruction completes, and that no subsequent instructions are initiated until after the sync instruction completes. This does not mean that the previous storage accesses have completed before the sync instruction completes. A store access has not been performed until a subsequent load is guaranteed to see the results of the store. Similarly, a load is performed only when the value to be returned can no longer be altered by a store.

The memory barrier created by the sync instruction provides "cumulative" ordering. This means that after a program executing on processor P0 observes accesses performed by a program executing on processor P1 and then executes a sync instruction, subsequent storage accesses performed by the program executing on processor P0 will be performed with respect to other processors after the accesses it observed to have been performed by the program executing on processor P1. For example, a program executing on processor P3 will not see a sequence of storage accesses that would conflict with the order of the sequence observed and performed by the program executing on processor P0.

So calling it a global order of stores as I did in my previous comment is perhaps inaccurate. Rather, if one processor observes stores ordered in a particular way across a sync, then no other processor will view an ordering of stores inconsistent with that.

The page linked above is an interesting read, if you're curious about PPC memory ordering.