Protecting control dependencies with volatile_if()

Posted Jun 20, 2021 20:11 UTC (Sun) by itsmycpu (guest, #139639)
In reply to: Protecting control dependencies with volatile_if() by tialaramex
Parent article: Protecting control dependencies with volatile_if()

So why (and how) can this be implemented more efficiently (on any architecture) than a combination of load_acquire(A) and store_release(B) ?

Protecting control dependencies with volatile_if()

Posted Jun 20, 2021 20:31 UTC (Sun) by itsmycpu (guest, #139639) [Link] (12 responses)

...or, in this case, just a load_acquire.

How is a barrier like

> #define barrer() __asm__ __volatile__("":::"memory")

more effective than the barrier implied by load_acquire (on any architecture) ?
(Not a rhetoric question.)

Protecting control dependencies with volatile_if()

Posted Jun 20, 2021 22:41 UTC (Sun) by tialaramex (subscriber, #21167) [Link] (11 responses)

This barrier has *zero* impact on the machine, it doesn't emit any instructions for the machine to run. This barrier only changes what the compiler will do. I can't emphasise that enough. There might in fact be no difference in the machine code at all, the goal of barrier() is only to prevent the compiler from potentially spotting an optimisation that's inappropriate and mustn't happen. If your compiler was not in fact going to attempt an optimisation that's now impossible the actual machine code is literally identical.

In contrast acquire/ release have consequences for the machine as well as the compiler. So the compiler will emit different machine code to ensure that the sequence of operations has acquire/ release semantics, and then at runtime, the actual CPU may be doing something different to ensure the semantics you asked for are achieved.

On x86-64 the instructions are only slightly different, and only in some cases, you actually pay for implicit acquire/ release semantics all the time on x86 without knowing it; on the original DEC Alpha it's a horrible nightmare of locking and CPU barriers; and on some ARM CPUs you can literally say "This load has acquire semantics" and "This store has release semantics" and everything works nicely.

So barrier() definitely CAN be faster because it doesn't emit any instructions at all. However of course just because you wrote fewer instructions doesn't make you faster. To find that out reliably in each case you'd have to benchmark. But first you'd have to be sure you care enough to go to that effort.

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 0:24 UTC (Mon) by itsmycpu (guest, #139639) [Link] (10 responses)

> This barrier has *zero* impact on the machine, it doesn't emit any instructions for the machine to run. This barrier only changes what the compiler will do.

Are you sure? I'm not so sure that architectures with a so-called "weaker" memory model don't require some kind of ordering instruction to avoid hardware re-ordering across instructions. In this case, it sounds like the conditional branch instruction might have (welcome) effects on hardware ordering. It is even called a "dummy" conditional branch. Specifically, there appears to be a certain equivalence on arm64 between

LDR X0, [X1]
CBNZ X0, 1f // Dummy ctrl

and ("with LTO")

LDAPR X0, [X1] // RCpc

> In contrast acquire/ release have consequences for the machine as well as the compiler.

Not necessarily. On modern x86, load_acquire and store_release are usually simple MOV instructions.

My question is about the specifics of architectures where this isn't so simple.

The ordering required here isn't really very different from acquire and/or release, and barrier() is apparently a very general tool. So it isn't clear what advantages it is meant to have on specific architectures, especially as it would seem that otherwise barrier() could be used to implement even load_acquire.

Or perhaps at the end of the email discussion, volatile_if is meant more as a convenience than as a possible performance optimization.

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 2:16 UTC (Mon) by tialaramex (subscriber, #21167) [Link] (9 responses)

> Are you sure?

Yes of course I'm sure. Why else would I write that I cannot emphasise this enough?

> I'm not so sure that architectures with a so-called "weaker" memory model don't require some kind of ordering instruction to avoid hardware re-ordering across instructions

Yes, most machines would need some explicit machine instructions to prevent the machine from re-ordering memory access if that was what you wanted to achieve. The Linux barrier() is a compiler barrier, not a CPU barrier. So it does not forbid the CPU from re-ordering memory accesses as it wishes, that is intentional. Use the right tool for the job.

> Not necessarily. On modern x86, load_acquire and store_release are usually simple MOV instructions.

If you'd only read a whole sentence further I more or less spelled this out, it's actually the opposite, with all that implies, simple MOV instructions have acquire/release semantics on x86. You are incurring the cost of an acquire or release and sometimes both for every access. Yet, that still isn't quite enough for full-on sequential consistency. So naive x86 programs still have surprising races.

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 3:00 UTC (Mon) by itsmycpu (guest, #139639) [Link] (8 responses)

> Yes, most machines would need some explicit machine instructions to prevent the machine from re-ordering memory access if that was what you wanted to achieve. The Linux barrier() is a compiler barrier, not a CPU barrier. So it does not forbid the CPU from re-ordering memory accesses as it wishes, that is intentional. Use the right tool for the job.

I'm afraid your answers are more confusing than clarifying.

If __asm__ __volatile__("":::"memory") results only in a CPU barrier even on "weaker" architectures, then in this case, on some architectures, that alone would not be sufficient. It is those architectures that my question is about.

It means that it requires something like the conditional branch instruction to complete the ordering.

So the question remains: why would that not also be enough to implement load_aquire, such that load_acquire has the same performance as barrier + cond.branch ?

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 3:03 UTC (Mon) by itsmycpu (guest, #139639) [Link] (2 responses)

> If __asm__ __volatile__("":::"memory") results only in a CPU barrier even on "weaker" architectures,

Obviously, I meant: if it only results in a *compiler* barrier....

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 9:59 UTC (Mon) by tialaramex (subscriber, #21167) [Link] (1 responses)

> in this case, on some architectures, that alone would not be sufficient. It is those architectures that my question is about.

On which architectures do you believe that the CPU exposes memory writes that only happen after a conditional branch, before actually evaluating that branch? Because that's the scenario under consideration here. The compiler barrier is to prevent compilers from doing the same re-ordering, which can look fairly reasonable to them because they've got a different view of the situation.

Linux is not a hypothetical program, so, it isn't focused on executing correctly on hypothetical hardware, only real hardware.

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 11:56 UTC (Mon) by itsmycpu (guest, #139639) [Link]

> On which architectures do you believe that the CPU exposes memory writes that only happen after a conditional branch, before actually evaluating that branch?

On none, as a hardware optimization I don't expect that to happen.

The question I'm asking is if and why (or why not), on any architecture (perhaps arm64), this:

volatile_if(READ_ONCE(A)) {
WRITE_ONCE(B,1);
}

might be faster than this:

if (load_acquire(A)) {
WRITE_ONCE(B,1);
}

especially if the latter is well optimized by a compiler. I have the tendency to think that the second form using load_acquire can be just as fast, at least if the compiler does a good job. However perhaps there are architectures on which that isn't true, and that is what my question is about.

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 16:42 UTC (Mon) by farnz (subscriber, #17727) [Link] (4 responses)

Because, on the very weakest implementations of (e.g.) AArch64, load acquire is more expensive than load followed by a dependent branch, but the load followed by a dependent branch is a sufficient barrier for the purposes at hand.

Consider in terms of the following pseudo-code; thread 0 is writing to two memory locations, and before the two threads run on separate CPU cores, the memory at adr1 and adr2 is set to 0.


# Thread 0
MOV R0, #42
MOV R1, #99
STR [addr3], R1
STR [addr2], R1
STR_REL [addr1], R0

# Thread 1
MOV R5, 1
MOV R6, 1
LDR R4, [addr1]
CMP R4, #42
BEQ load
JMP end
.load
LDR R5, [addr2]
.end
LDR R6, [addr3]

The question we're trying to answer is "what values are in R5 and R6 of thread 1 at the label end?".

On an Alpha, there are 3 possible answers: if R4 is 42, then R5 and R6 must both be either 0 or 99, depending on whether or not the LDR R5 and LDR R6 observed the store to addr2 or not. If R4 is not 42, then R5 and R6 must be 1. Alpha is allowed to re-order the loads in thread 1 freely because they are independent of each other, and thus all of (R5 = 0, R6 = 0), (R5 = 0, R6 = 99), (R5 = 99, R6 = 0) and (R5 = 99, R6 = 99) are possible.

On x86, the answer is either R5 = R6 = 1 or R5 = R6 = 99; once the load of R4 has taken place, thread 1 is guaranteed to see the stores to both addr1 and addr2, and thus the loads to R5 and R6 must see the store to addr2. Thus, if R4 is 42, R5 is guaranteed to be 99, while if R4 is not 42, R5 is guaranteed to be 1. Same reasoning applies to R6 as to R5.

ARM's memory ordering is mostly weak, like Alpha's, but it adds in one more wrinkle; dependency chains act as ordering. Because of this, R5 is again either 1 or 99; however, this happens for slightly different reasons to x86. On ARM, the STR_REL guarantees that a load that observes the store to addr1 will also be able to observe the store to addr2; because the load of R5 has a dependency on the value of R4, the ARM memory ordering model guarantees that the load of R5 will take place after the load of R4, and hence will also observe the store to addr1. However, there's no such dependency in R6's control flow, and thus R6 is allowed to be either 0 or 99 if R4 is 42, because the CPU can reorder the load to happen before the conditional.

If the load to R4 was a load-acquire, then all three processors would observe the same memory contents. However, both ARM and Alpha would be denied reordering opportunities by the load-acquire that they have without it, and hence could be slower.

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 22:43 UTC (Mon) by itsmycpu (guest, #139639) [Link] (3 responses)

>Because, on the very weakest implementations of (e.g.) AArch64, load acquire is more expensive than load followed by a dependent branch, but the load followed by a dependent branch is a sufficient barrier for the purposes at hand.

This is the claim I am interested in.

> Consider in terms of the following pseudo-code; thread 0 is writing to two memory locations, and before the two threads run on separate CPU cores, the memory at adr1 and adr2 is set to 0.

Thanks for going into this detail.
I'm assuming that you also meant addr3 to be initialized with 0. It seems you changed the location of the label ".end" while you were writing your post. As it is, with the end label between the loads of R5 and R6, R6 couldn't be 1 (assuming addr3 is initialized with 0).

> On an Alpha, there are 3 possible answers: if R4 is 42, then R5 and R6 must both be either 0 or 99, depending on whether or not the LDR R5 and LDR R6 observed the store to addr2 or not. If R4 is not 42, then R5 and R6 must be 1. Alpha is allowed to re-order the loads in thread 1 freely because they are independent of each other, and thus all of (R5 = 0, R6 = 0), (R5 = 0, R6 = 99), (R5 = 99, R6 = 0) and (R5 = 99, R6 = 99) are possible.

R6 can be 1 only if the ".end" label is moved to the actual end. So here I am putting R6 aside, since then it will be like R5 (but see discussion of R6 for ARM.).

My understanding is that "volatile_if" is meant to prevent a re-ordering of the effective load of R5 to a point before the load of R4. In fact, that appears to be the whole point (I don't see any indication that it would be meant to affect only stores). If it does succeed in preventing that, then only (R5 = 99) is possible if R4 is 42, since 42 is written with a store_release.

So if Alpha behaves as you say, I don't see how the pseudo code would be sufficient to implement "volatile_if". And I wouldn't know what it takes to fix that, or how that would affect our discussion.

> On x86, the answer is either R5 = R6 = 1 or R5 = R6 = 99; once the load of R4 has taken place, thread 1 is guaranteed to see the stores to both addr1 and addr2, and thus the loads to R5 and R6 must see the store to addr2. Thus, if R4 is 42, R5 is guaranteed to be 99, while if R4 is not 42, R5 is guaranteed to be 1. Same reasoning applies to R6 as to R5.

Yes.
(R6 can be 1 only if the ".end" label is moved to the actual end.)

> ARM's memory ordering is mostly weak, like Alpha's, but it adds in one more wrinkle; dependency chains act as ordering. Because of this, R5 is again either 1 or 99; however, this happens for slightly different reasons to x86. On ARM, the STR_REL guarantees that a load that observes the store to addr1 will also be able to observe the store to addr2; because the load of R5 has a dependency on the value of R4, the ARM memory ordering model guarantees that the load of R5 will take place after the load of R4, and hence will also observe the store to addr1. However, there's no such dependency in R6's control flow, and thus R6 is allowed to be either 0 or 99 if R4 is 42, because the CPU can reorder the load to happen before the conditional.

Here it seems that you want the ".end" label to really be between the loads of R5 and R6, as you also don't claim that R6 could be 1. That means that you mean R6 to be outside of the volatile-if statement, and since also at the language level, volatile_if is meant to allow R6 to be 0 (regardless of R4's value) in that case, I can see that ARM would be able to take advantage of that and perform the load before or during the conditional. Whereas x86 would not.

In situations where an R6 does not exist, it still seems the compiler could optimize load_acquire to produce equivalent code also without a construct like volatile_if (in so far as that might be a distinct advantage), since R5 must be 99 either way, if R4 is 42.

Protecting control dependencies with volatile_if()

Posted Jun 22, 2021 3:38 UTC (Tue) by itsmycpu (guest, #139639) [Link] (2 responses)

A follow-up question for ARM would be:

What if the load of R6 appears in both the 'then' path and the 'else' path of the BEQ conditional branch? Instead of after the conditional paths join.

Then there would be two load instructions, each dependent on the conditional. Would that mean that loading R6 is now ordered, or is the CPU allowed to internally join the two paths and remove the dependency? Is ARM's behavior defined in this case? I think it needs to be, and the straightforward definition would be that R6 is then ordered. (And that the compiler could use this for optimizations.)

Protecting control dependencies with volatile_if()

Posted Jun 22, 2021 9:47 UTC (Tue) by farnz (subscriber, #17727) [Link] (1 responses)

First, thank you for reading my previous comment as intended, not as written - I did make a lot of mistakes trying to put together a case where ARM's ordering is weaker than store-acquire, but you caught my intentions.

Yes, if the load of R6 appears in both paths, then it's an ordered load, due to the data dependency affecting both cases, and thus ARM can't reorder in their memory model; in theory, the compiler can exploit this to get R6 ordered even though the load of R6 is unconditional, while still allowing future loads to be reordered before the load of R6.

The point of the ARM behaviour here is to give you a guarantee that code like the following Cish does what you intended without forcing full ordering between the two threads:


const int global42 = 42;
int global_int;
int *global_ptr;
void main() {
    global_ptr = &global42;
    full_memory_barrier();
    start_threads(thread0, thread1);
    wait_for_threads();
}

void thread0() {
    global_int = 123;
    atomic_store_explicit(global_ptr, &global_int, memory_order_release);
}

void thread1() {
    int *i = global_ptr;
    int j = *i;
    if (i == &global_int) {
        assert(j == 123);
    } else {
        assert(j == 42);
    }
}

Because the store to global_ptr is a release store, any thread which observes that store also observes the store to global_int. thread1 can thus use the value of i to determine which value it loaded.

On Alpha's memory model (which is as weak as you can get), it is permissible for i to point at global42, but for the load to read 123, or vice-versa. ARM disallows this particular case.

Another way of looking at it would be that you can fix thread1 to work correctly on any standards-compliant C compiler by making it:


void thread1() {
    int *i = atomic_load_explicit(global_ptr, memory_order_consume);
    int j = *i;
    if (i == &global_int) {
        assert(j == 123);
    } else {
        assert(j == 42);
    }
}

And on Alpha, the atomic_load_explicit needs to put appropriate memory ordering instructions into the instruction stream for thread1; on ARM, those ordering instructions are implicit in ordinary loads.

Protecting control dependencies with volatile_if()

Posted Jun 22, 2021 21:03 UTC (Tue) by itsmycpu (guest, #139639) [Link]

> The point of the ARM behaviour here ....

Makes sense, and gives a good idea of ARM (and Alpha) in this regard.