Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 16:42 UTC (Mon) by farnz (subscriber, #17727)
In reply to: Protecting control dependencies with volatile_if() by itsmycpu
Parent article: Protecting control dependencies with volatile_if()

Because, on the very weakest implementations of (e.g.) AArch64, load acquire is more expensive than load followed by a dependent branch, but the load followed by a dependent branch is a sufficient barrier for the purposes at hand.

Consider in terms of the following pseudo-code; thread 0 is writing to two memory locations, and before the two threads run on separate CPU cores, the memory at adr1 and adr2 is set to 0.


# Thread 0
MOV R0, #42
MOV R1, #99
STR [addr3], R1
STR [addr2], R1
STR_REL [addr1], R0

# Thread 1
MOV R5, 1
MOV R6, 1
LDR R4, [addr1]
CMP R4, #42
BEQ load
JMP end
.load
LDR R5, [addr2]
.end
LDR R6, [addr3]

The question we're trying to answer is "what values are in R5 and R6 of thread 1 at the label end?".

On an Alpha, there are 3 possible answers: if R4 is 42, then R5 and R6 must both be either 0 or 99, depending on whether or not the LDR R5 and LDR R6 observed the store to addr2 or not. If R4 is not 42, then R5 and R6 must be 1. Alpha is allowed to re-order the loads in thread 1 freely because they are independent of each other, and thus all of (R5 = 0, R6 = 0), (R5 = 0, R6 = 99), (R5 = 99, R6 = 0) and (R5 = 99, R6 = 99) are possible.

On x86, the answer is either R5 = R6 = 1 or R5 = R6 = 99; once the load of R4 has taken place, thread 1 is guaranteed to see the stores to both addr1 and addr2, and thus the loads to R5 and R6 must see the store to addr2. Thus, if R4 is 42, R5 is guaranteed to be 99, while if R4 is not 42, R5 is guaranteed to be 1. Same reasoning applies to R6 as to R5.

ARM's memory ordering is mostly weak, like Alpha's, but it adds in one more wrinkle; dependency chains act as ordering. Because of this, R5 is again either 1 or 99; however, this happens for slightly different reasons to x86. On ARM, the STR_REL guarantees that a load that observes the store to addr1 will also be able to observe the store to addr2; because the load of R5 has a dependency on the value of R4, the ARM memory ordering model guarantees that the load of R5 will take place after the load of R4, and hence will also observe the store to addr1. However, there's no such dependency in R6's control flow, and thus R6 is allowed to be either 0 or 99 if R4 is 42, because the CPU can reorder the load to happen before the conditional.

If the load to R4 was a load-acquire, then all three processors would observe the same memory contents. However, both ARM and Alpha would be denied reordering opportunities by the load-acquire that they have without it, and hence could be slower.

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 22:43 UTC (Mon) by itsmycpu (guest, #139639) [Link] (3 responses)

>Because, on the very weakest implementations of (e.g.) AArch64, load acquire is more expensive than load followed by a dependent branch, but the load followed by a dependent branch is a sufficient barrier for the purposes at hand.

This is the claim I am interested in.

> Consider in terms of the following pseudo-code; thread 0 is writing to two memory locations, and before the two threads run on separate CPU cores, the memory at adr1 and adr2 is set to 0.

Thanks for going into this detail.
I'm assuming that you also meant addr3 to be initialized with 0. It seems you changed the location of the label ".end" while you were writing your post. As it is, with the end label between the loads of R5 and R6, R6 couldn't be 1 (assuming addr3 is initialized with 0).

> On an Alpha, there are 3 possible answers: if R4 is 42, then R5 and R6 must both be either 0 or 99, depending on whether or not the LDR R5 and LDR R6 observed the store to addr2 or not. If R4 is not 42, then R5 and R6 must be 1. Alpha is allowed to re-order the loads in thread 1 freely because they are independent of each other, and thus all of (R5 = 0, R6 = 0), (R5 = 0, R6 = 99), (R5 = 99, R6 = 0) and (R5 = 99, R6 = 99) are possible.

R6 can be 1 only if the ".end" label is moved to the actual end. So here I am putting R6 aside, since then it will be like R5 (but see discussion of R6 for ARM.).

My understanding is that "volatile_if" is meant to prevent a re-ordering of the effective load of R5 to a point before the load of R4. In fact, that appears to be the whole point (I don't see any indication that it would be meant to affect only stores). If it does succeed in preventing that, then only (R5 = 99) is possible if R4 is 42, since 42 is written with a store_release.

So if Alpha behaves as you say, I don't see how the pseudo code would be sufficient to implement "volatile_if". And I wouldn't know what it takes to fix that, or how that would affect our discussion.

> On x86, the answer is either R5 = R6 = 1 or R5 = R6 = 99; once the load of R4 has taken place, thread 1 is guaranteed to see the stores to both addr1 and addr2, and thus the loads to R5 and R6 must see the store to addr2. Thus, if R4 is 42, R5 is guaranteed to be 99, while if R4 is not 42, R5 is guaranteed to be 1. Same reasoning applies to R6 as to R5.

Yes.
(R6 can be 1 only if the ".end" label is moved to the actual end.)

> ARM's memory ordering is mostly weak, like Alpha's, but it adds in one more wrinkle; dependency chains act as ordering. Because of this, R5 is again either 1 or 99; however, this happens for slightly different reasons to x86. On ARM, the STR_REL guarantees that a load that observes the store to addr1 will also be able to observe the store to addr2; because the load of R5 has a dependency on the value of R4, the ARM memory ordering model guarantees that the load of R5 will take place after the load of R4, and hence will also observe the store to addr1. However, there's no such dependency in R6's control flow, and thus R6 is allowed to be either 0 or 99 if R4 is 42, because the CPU can reorder the load to happen before the conditional.

Here it seems that you want the ".end" label to really be between the loads of R5 and R6, as you also don't claim that R6 could be 1. That means that you mean R6 to be outside of the volatile-if statement, and since also at the language level, volatile_if is meant to allow R6 to be 0 (regardless of R4's value) in that case, I can see that ARM would be able to take advantage of that and perform the load before or during the conditional. Whereas x86 would not.

In situations where an R6 does not exist, it still seems the compiler could optimize load_acquire to produce equivalent code also without a construct like volatile_if (in so far as that might be a distinct advantage), since R5 must be 99 either way, if R4 is 42.

Protecting control dependencies with volatile_if()

Posted Jun 22, 2021 3:38 UTC (Tue) by itsmycpu (guest, #139639) [Link] (2 responses)

A follow-up question for ARM would be:

What if the load of R6 appears in both the 'then' path and the 'else' path of the BEQ conditional branch? Instead of after the conditional paths join.

Then there would be two load instructions, each dependent on the conditional. Would that mean that loading R6 is now ordered, or is the CPU allowed to internally join the two paths and remove the dependency? Is ARM's behavior defined in this case? I think it needs to be, and the straightforward definition would be that R6 is then ordered. (And that the compiler could use this for optimizations.)

Protecting control dependencies with volatile_if()

Posted Jun 22, 2021 9:47 UTC (Tue) by farnz (subscriber, #17727) [Link] (1 responses)

First, thank you for reading my previous comment as intended, not as written - I did make a lot of mistakes trying to put together a case where ARM's ordering is weaker than store-acquire, but you caught my intentions.

Yes, if the load of R6 appears in both paths, then it's an ordered load, due to the data dependency affecting both cases, and thus ARM can't reorder in their memory model; in theory, the compiler can exploit this to get R6 ordered even though the load of R6 is unconditional, while still allowing future loads to be reordered before the load of R6.

The point of the ARM behaviour here is to give you a guarantee that code like the following Cish does what you intended without forcing full ordering between the two threads:


const int global42 = 42;
int global_int;
int *global_ptr;
void main() {
    global_ptr = &global42;
    full_memory_barrier();
    start_threads(thread0, thread1);
    wait_for_threads();
}

void thread0() {
    global_int = 123;
    atomic_store_explicit(global_ptr, &global_int, memory_order_release);
}

void thread1() {
    int *i = global_ptr;
    int j = *i;
    if (i == &global_int) {
        assert(j == 123);
    } else {
        assert(j == 42);
    }
}

Because the store to global_ptr is a release store, any thread which observes that store also observes the store to global_int. thread1 can thus use the value of i to determine which value it loaded.

On Alpha's memory model (which is as weak as you can get), it is permissible for i to point at global42, but for the load to read 123, or vice-versa. ARM disallows this particular case.

Another way of looking at it would be that you can fix thread1 to work correctly on any standards-compliant C compiler by making it:


void thread1() {
    int *i = atomic_load_explicit(global_ptr, memory_order_consume);
    int j = *i;
    if (i == &global_int) {
        assert(j == 123);
    } else {
        assert(j == 42);
    }
}

And on Alpha, the atomic_load_explicit needs to put appropriate memory ordering instructions into the instruction stream for thread1; on ARM, those ordering instructions are implicit in ordinary loads.

Protecting control dependencies with volatile_if()

Posted Jun 22, 2021 21:03 UTC (Tue) by itsmycpu (guest, #139639) [Link]

> The point of the ARM behaviour here ....

Makes sense, and gives a good idea of ARM (and Alpha) in this regard.