LWN: Comments on "Axiomatic validation of memory barriers and atomic instructions"

Axiomatic validation of memory barriers and atomic instructions

PaulMcKenney — Tue, 26 Aug 2014 15:12:37 +0000

Indeed, this stuff is not easy. Ten years from now, when all the details are worked out, it will probably seem a lot easier, but of course other problems will arise in the meantime. (Where would be the challenge otherwise?)

Believe it or not, in some special cases, the execution of lock acquisition can be bounded. I briefly cover this in a blog post (http://paulmck.livejournal.com/37329.html), and Bjoern Brandenburg analyzed it in depth in his dissertation (http://www.mpi-sws.org/~bbb/papers/index.html#thesis). Surprising, perhaps, but true!

Nevertheless, your original statement that lock acquisition cannot be bounded is correct in the not-uncommon case of non-FIFO locking primitives and potentially unbounded critical sections (e.g., waiting for a network packet to arrive while holding a lock). Even with FIFO locking primitives, the upper bounds for lock can grow extremely large as the number of threads increases, as the permitted lock nesting levels increase, and as the number of real-time priority levels in use increases.

Nevertheless, it is worth noting that locking does not necessarily imply unbounded execution times.

Axiomatic validation of memory barriers and atomic instructions

jzbiciak — Tue, 26 Aug 2014 08:22:52 +0000

Minor clarification. On this point:

the execution time of "unconditional lock acquisition" can not be bounded

I'd like to add "... on the basis of the inputs to the lock acquisition API alone."

Basically, I was trying to capture the idea that if some code tries to take a lock unconditionally, the lock-taker's execution time is almost entirely at the mercy of the lock holder regardless of how clever the lock-taker's algorithms are. You cannot progress until the lock holder releases the lock.

And, of course, if multiple threads try to take the lock simultaneously, other threads may win ahead of this one, so the identity of "lock holder" may change during the process of attempting to acquire the lock. I don't think that changes the idea I'm attempting to express.

Stating these ideas precisely is simultaneously necessary and difficult. :-)

Axiomatic validation of memory barriers and atomic instructions

jzbiciak — Tue, 26 Aug 2014 08:05:57 +0000

I think I missed that article the first time 'round. Thank you for linking it.

FWIW, on "future work" item #3, it seems like unconditional lock acquisition is equivalent to either:

A function with a return value that says that you acquired the lock (and just always happens to be 'true'), or
A function with a while-loop around a call to a conditional lock acquisition function that spins until the lock is actually acquired.

Those two statements actually say the same thing under the hood—that there's a fundamental step with a return value (whether you acquired the lock) and a composite step that repeats the fundamental step until a goal is achieved (you actually acquired the lock).

To tease this apart in the SNC framework, it seems like you need some additional criterion to determine whether your measurement point for SNC-ness is at the appropriate level, or that you must drill down.

In this case, the execution time of "unconditional lock acquisition" can not be bounded because it depends on the state of the lock and the behavior of the other lock holders, and so represents SNC communication. (In this case, to the CPU's conditional branch circuitry.) That suggests a potential criterion stated in terms of the termination condition for the API. If the termination condition that allows the API to exit is SNC, the API is SNC.

At least, that's my initial gut feel after reading that article.

Axiomatic validation of memory barriers and atomic instructions

PaulMcKenney — Tue, 26 Aug 2014 00:42:38 +0000

The atomic operations that return values (other than, as you noted, atomic_read()) imply full barriers before and after. And these are the ones that need it, see http://lwn.net/Articles/423994/ for more information on this.

That said, controlling your concurrency with locks is usually easier than using atomic operations, at least in cases where locking's performance and scalability suffices.

Axiomatic validation of memory barriers and atomic instructions

jzbiciak — Mon, 25 Aug 2014 18:48:10 +0000

So the presentation in the litmus test is a bit misleading and has no information of when the given code is executed on each processor/thread. So line 10 does not execute exactly at the same time on each thread!

Precisely. There are four threads, and their relative execution timelines could vary arbitrarily, except for any ordering the architecture imposes on them (say, through sync or other architectural guarantees).

I read up a little more on the PPC sync instruction today, here. The sync instruction is a true full memory barrier, in that for the processor executing sync, all memory accesses in the system are either strictly before the barrier, or strictly after the barrier.

They also appear to speak to the situation in the example directly in the page linked above:

The first form is heavy-weight sync, or commonly just referred to as sync. This form is often needed by adapter device drivers, for ordering system memory accesses made by a driver critical section with accesses made to its I/O adapter. Executing sync ensures that all instructions preceding the sync instruction have completed before the sync instruction completes, and that no subsequent instructions are initiated until after the sync instruction completes. This does not mean that the previous storage accesses have completed before the sync instruction completes. A store access has not been performed until a subsequent load is guaranteed to see the results of the store. Similarly, a load is performed only when the value to be returned can no longer be altered by a store.

The memory barrier created by the sync instruction provides "cumulative" ordering. This means that after a program executing on processor P0 observes accesses performed by a program executing on processor P1 and then executes a sync instruction, subsequent storage accesses performed by the program executing on processor P0 will be performed with respect to other processors after the accesses it observed to have been performed by the program executing on processor P1. For example, a program executing on processor P3 will not see a sequence of storage accesses that would conflict with the order of the sequence observed and performed by the program executing on processor P0.

So calling it a global order of stores as I did in my previous comment is perhaps inaccurate. Rather, if one processor observes stores ordered in a particular way across a sync, then no other processor will view an ordering of stores inconsistent with that.

The page linked above is an interesting read, if you're curious about PPC memory ordering.

Axiomatic validation of memory barriers and atomic instructions

meyert — Mon, 25 Aug 2014 18:10:47 +0000

Thanks for this explanation, I think I got it now...

So the presentation in the litmus test is a bit misleading and has no information of when the given code is executed on each processor/thread. So line 10 does not execute exactly at the same time on each thread!

So, as an example, a correct usage of the sync would be:
Given a data structure that is written/created by only one thread:
One field in this data structure could be used as a indicator if this data structure is valid now, I.e. the creation is finished and reader is allowed to use this information now.

So the writer should write all necessary fields in the structure and do a sync to establishe a global ordering. Then the writer could set the "information is valid" field to true and sync again.

All readers must spin till they see the indicator field set to true. After that they can begin to use the information in the structure.
I'm not sure if the readers need to call sync in this case to establish the same view on the global ordering.

Hope above example isn't that worse :-)

Thanks again for your explanation.

Axiomatic validation of memory barriers and atomic instructions

jzbiciak — Mon, 25 Aug 2014 07:50:18 +0000

Including the IRIW litmus test here for convenience:

 1 PPC IRIW.litmus
 2 ""
 3 (* Traditional IRIW. *)
 4 {
 5 0:r1=1; 0:r2=x;
 6 1:r1=1;         1:r4=y;
 7         2:r2=x; 2:r4=y;
 8         3:r2=x; 3:r4=y;
 9 }
10  P0           | P1           | P2                 | P3                 ;
11  stw r1,0(r2) | stw r1,0(r4) | lwz r3,0(r2)       | lwz r3,0(r4)       ;
12               |              | sync               | sync               ;
13               |              | lwz r5,0(r4)       | lwz r5,0(r2)       ;
14 
15 exists
16 (2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0)

So, you ask how R5 can be zero. I assume you mean to ask "How can R5 be zero for either processor when that processor's R3 is not zero," correct? The trivial answer to your question otherwise is "P2 and P3 ran before P0 and P1", which I doubt is satisfactory. :-)

Let's focus on the two most interesting states allowed by IRIW:

P2's R3=1 and R5=1, but P3's R3=1 and R5=0
P3's R3=1 and R5=1, but P2's R3=1 and R5=0

In the first case, P2 saw both writes to X and Y, but P3 only saw the write to Y. That is allowed, because P0 and P1 are not ordered with respect to each other. You could reason that P2 read strictly after P0 and P1, but P3 read after P1 but before P0.

In the second case, P3 saw both writes to X and Y, but P2 only saw the write to X. That is also allowed, because P0 and P1 are not ordered with respect to each other. You could reason that P3 read strictly after both P0 and P1, but P2 read after P0 but before P1.

So, then how about the assertion at the end?

The assertion at the end effectively says "If P2 saw X=1 and P3 saw Y=1, then either P2 also sees Y=1 or P3 sees X=1." What sync apparently does (and lwsync does not do) is provide an illusion of a global order for stores. Regardless of whether P0 wrote before P1 or vice versa, it appears that 'sync' ensures that all writes visible before the 'sync' have a consistent ordering with respect to all writes visible after the 'sync'; furthermore, this ordering is the same for all processors issuing a 'sync'.

In this example, the 'sync' on P2 and P3 has the effect of making them agree on whether P0 or P1 happened first. (In contrast, 'lwsync' apparently only locally orders their loads, which does not cause them to agree on which of P0 or P1 happened first.)

How could this assertion be useful? I'm hard pressed to say with certainty off the top of my head; however, I think some primitives such as Lamport's Bakery and certain other synchronization barriers rely on all readers seeing a consistent order of writers for multiple processors joining the primitive.

Axiomatic validation of memory barriers and atomic instructions

meyert — Sun, 24 Aug 2014 11:47:16 +0000

Why can R5 be 0 for P2 and P3 in the PPC IRIW litmus test? Both CPUs executed a sync, so ŵhy can R5 be 0 as the 15 states in the herd output suggests? Doesn't make the sync operation make these reads dependent to the writes?

So the only assertion to make from this test seems to be that "P2 and P3 doesn't see the two stores from P0 and P1 happening in different orders." How can this assertion be useful at all? Say I want to make a decision based on R5 in P2 and P3, to only execute some code when P0 or P1 did set the value to 1?

Axiomatic validation of memory barriers and atomic instructions

cladisch — Fri, 22 Aug 2014 09:30:12 +0000

Atomic operations are synchronized, but only as far as their own value is concerned.

If you combine an atomic_t variable with other data, you need barriers to describe the relationships between them.

Locks and RCU are intended to be used to protect other data.

Axiomatic validation of memory barriers and atomic instructions

kernel.developer1 — Fri, 22 Aug 2014 06:37:46 +0000

But this leads to the question: “Exactly where do I need to place memory-barrier instructions?” For most developers, even Linux kernel hackers, the answer is “Nowhere.” The reason for this reassuring answer is that most synchronization primitives already contain the needed memory barriers. Therefore, if you use locks, sequence counters, atomic operations, or read-copy-update (RCU) in the intended manner, you don't need to add any explicit memory-barrier instructions. Instead, the required memory-barrier instructions are included in these synchronization primitives.

In above it mention that "atomic operations" you don't need to add any explicit memory-barrier instructions

When I read Documentation/atomic_ops.txt

*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***

So this made me confusing.