LWN: Comments on "An instruction-level BPF memory model"

An instruction-level BPF memory model

PaulMcKenney — Tue, 11 Jun 2024 17:08:02 +0000

Yes, I should have said "HOB", apologies for my confusion.

On the "exists" clause, no memory model that enforces single-object sequential consistency can have 0:r3=1 and 0:r5=0. I was therefore looking for some consequence of HOB beyond single-object sequential consistency, hence my "exists" clause matching the HOB defintion.

A simpler approach, which I should have thought of to begin with, is to disable HOB in the Aarch64 memory model, for example, like this:

let haz-ob =
  (* Exp-haz-ob | *) TLBI-ob | IC-ob

(If you try this in a clone of the herd7tools repo, don't forget to “make install” so that your herd7 command knows about the change. And don't forget to do so again after undoing this act of vandalism!)

With this change, running herd7 on catalogue/aarch64/tests/CoRR.litmus says Sometimes, which supports your analysis of HOB, and invalidates mine. So, like DOB, HOB is not optional from a hardware viewpoint.

I agree that LKMM should continue to ignore DOB, at least until someone comes up with a good use case. But given that leaving out HOB causes the ARMv8 memory model to say Sometimes for CoRR.litmus, HOB is not optional.

Thank you for keeping me honest!

An instruction-level BPF memory model

Alan.Stern — Fri, 07 Jun 2024 18:58:06 +0000

Responding to your later points first...

> I am not so sure that I agree with your suggested change to the "exists" clause, given that DOB requires that R3 be coherence-before E2. What am I missing here?

Here we are discussing a litmus test (on slide 69) for HOB, not DOB, so I'll assume your "DOB" above is a typo.

Yes, HOB requires that R3 is coherence-before E2. Hence the "exists" clause should test for the value of R3 (i.e., 0:r5) being the original value of y, namely 0. The conclusion of HOB is that R1 must also be ordered before E2, in other words, that the value of R1 (i.e., 0:r3) must also be the original value of y, namely 0. (Or to put it another way, HOB says that while R1 might be executed later than R3, out of program order, it can't be executed so _much_ later that it observes the value written by E2.) Therefore to test for a violation of HOB, the "exists" clause has to ask whether 0:r3 is different from 0 and 0:r5 is equal to 0. That's what my suggested change does. (Notice that none of the executions listed on slide 70 violate HOB in this way.)

> I do not see how current hardware with current restrictions on speculation could leave out DOB without undue violence to the rest of its memory model, which might well be a failure of imagination on my part. So what are your thoughts on how reasonable hardware could leave out DOB?

I agree with your point of view; DOB is a practical necessity. Another thing it is needed for, which the slides don't mention, is precise exceptions. Consider the Linux-kernel litmus test on slide 41; if the value of gp happened to be NULL so that the write to r1[i].value caused an addressing exception, the exception wouldn't be precise if the write to x (later in the program) had already been executed.

Nevertheless, even though I expect DOB will be enforced by all hardware that runs the Linux kernel, there doesn't seem to be any good reason for including it in the memory model. As far as I know, nobody has suggested any use cases for it or shown that any existing code in the kernel relies on it.

> Slide 69's attempt does show a difference between ARMv8 and PPC using R,

Indeed it does. And I would claim that this difference is _entirely_ due to R and has no connection with HOB at all. Despite what the slide says.

> I would of course be very happy for you to show me a better litmus test demonstrating the beyond-coherence-before effects of HOB!

I don't know what effects you're talking about. As far as I can tell, the only effect of HOB is related to coherence ordering; it says that the value read by R1 has to be earlier in the coherence order than the value written by E2.

An instruction-level BPF memory model

PaulMcKenney — Fri, 07 Jun 2024 15:40:46 +0000

No argument whatsoever with the confusing part! ;-)

Agreed, HOB has to be more than coherence-before. And the point of those slides was in fact to come up with a test for some ordering stronger than coherence-before. Slide 69's attempt does show a difference between ARMv8 and PPC using R, but you are quite right if you are claiming that I did not prove either: (1) That this difference was in fact a consequence of ARMv8 HOB, or (2) That there is not some better test for the effects of HOB beyond coherence-before. I would of course be very happy for you to show me a better litmus test demonstrating the beyond-coherence-before effects of HOB! I am with you in suspecting that other-multicopy atomicity has something to do with it, but I have not yet proved this to myself. (Nor have I yet asked the ARM folks because experience indicates that if I haven't beat my head pretty hard against the question, I won't understand their answer.)

I am not so sure that I agree with your suggested change to the "exists" clause, given that DOB requires that R3 be coherence-before E2. What am I missing here?

You are quite correct on the historical sequence of events leading to our deciding to leave DOB out of LKMM. I was instead giving a post-facto rationale for that decision. I do not see how current hardware with current restrictions on speculation could leave out DOB without undue violence to the rest of its memory model, which might well be a failure of imagination on my part. So what are your thoughts on how reasonable hardware could leave out DOB?

An instruction-level BPF memory model

Alan.Stern — Thu, 06 Jun 2024 18:43:43 +0000

The situation is very confusing; a more detailed explanation may help.

For one thing, consider the description of Hazard Ordered Before (HOB) on slide 60 of your talk. The slide isn't entirely clear; does it require that R1 is coherence-before W2, or does it require that R1 executes before R3? The first meaning (which seems to be the one you're using) follows directly from the read-read coherence rule, so it is a consequence of per-variable sequential consistency. Since you undoubtedly want the BPF memory model to include this, there's no need to consider HOB. And to be crystal clear, PowerPC does not allow an earlier read to observe a later value than a later read of the same variable in the same thread.

Adding to the confusion is the fact that, despite its title, the litmus test on slide 69 does not in fact test for HOB. That is, the condition in the "exists" clause isn't 0:r3=1 /\ 0:r5=0. The whole added business about y and the memory barriers is irrelevant for HOB. (Compare it to the Linux kernel code fragment on slide 62.) If you wanted to discuss the R pattern in slides 63-68, you shouldn't have gotten it all mixed up with HOB. (That could well be what confused Daroc.) It probably would have been better to avoid both topics and consider other-multicopy atomicity instead.

Another thing is worth pointing out, relating to slide 57. Why doesn't the LKMM include Dependency Ordered Before? Not because it knows the full execution a priori -- that's equally true for the hardware memory models. The real reason goes back to the early days of development of the LKMM. Originally we had two different versions, a strong and a weak model. The strong model _did_ include Dependency Ordered Before. But we ended up dropping the strong model; the current LKMM is a descendant of our weak model.

An instruction-level BPF memory model

PaulMcKenney — Thu, 06 Jun 2024 17:10:34 +0000

Looks good to me!

An instruction-level BPF memory model

daroc — Thu, 06 Jun 2024 17:06:08 +0000

Thank you for catching that; I had to read through Paul's slides again myself to figure out what was going on.

I think that I have made a correct correction to the article.

An instruction-level BPF memory model

comex — Thu, 06 Jun 2024 16:48:38 +0000

So Daroc indeed described this incorrectly in the article.

Which is quite understandable! The property actually being tested is subtle enough that even with the benefit of reading the slides repeatedly and then reviewing the ARM architecture reference manual, I still don't really understand it :)

An instruction-level BPF memory model

PaulMcKenney — Thu, 06 Jun 2024 13:53:07 +0000

That most certainly would be a novel claim. ;-)

But the question is instead: "In ARMv8's hazard-ordered-before (HOB), does the fact that R2 cannot read a value older than that read by R1 impose additional ordering consequences?"

The answer for PowerPC is "no". So to the extent that HOB does impose additional ordering consequences, the BPF memory model needs to avoid imposing similar ordering consequences. This is not a surprise, given that ARMv8 is other-multicopy atomic and PowerPC is non-multicopy atomic. So we should expect that there are constraints in the ARMv8 memory model that do not apply to the BPF memory model. This is all OK, because it is OK for hardware to provide stronger ordering than required, with x86 being the usual poster child for this extra strength.

For more detail, please see slides 59-77 of the presentation at https://drive.google.com/file/d/1zLkpBVL1chDfyqZATOyKb_1Q...

Perhaps one of your questions is addressed by slide 64 and later. The initial scenario involves only one variable, but the added scaffolding required to test for additional ordering does add another variable. Does that help?

An instruction-level BPF memory model

foom — Thu, 06 Jun 2024 12:25:14 +0000

Just to be clear, you are saying the article's statement "operation R1 is allowed to read the value written by W1, even though R2 doesn't." is an accurate reflection of your intended statement?

That, on powerpc CPUs, if the initial value is 0 and W1 stores a 1, you could have R1 read 1 (as written by W1), yet R2 (later in program order than R1, on the same thread) read the initial value of 0?

I'm pretty sure that would be a novel claim, if so.

An instruction-level BPF memory model

PaulMcKenney — Wed, 05 Jun 2024 23:45:34 +0000

You got it right, the "hazard ordered before" example has only the single variable "x".

Admittedly unconventional for a litmus test, but then again, this particular litmus tests probes an obscure corner of the ARMv8 memory model. ;-)

An instruction-level BPF memory model

daroc — Wed, 05 Jun 2024 13:38:32 +0000

It is, of course, entirely possible that I have made an error. Looking at my notes, I am quite sure that McKenney's second example did involve only one memory location, but it's possible that I missed some nuance of the specific kind of loads and stores used. There was a substantial portion of the talk dedicated to explaining how the reordering could actually be detected which went far enough over my head (and into the weeds) that I elided over it in the article.

An atomic instruction is a full barrier to everything?

farnz — Wed, 05 Jun 2024 11:14:08 +0000

Yes, and that's implicit in virtually all descriptions of memory ordering; what we normally talk about is "happens-before" ordering, where we say that A happens-before B if and only if we can deduce that A must have happened given that we can demonstrate that B has happened.

There's also sometimes the need to talk about whether a happens-before ordering is global (i.e. if thread X sees that A happens-before B, then thread Y must agree that A happens-before B) or local (if thread X sees that A happens-before B, thread Y could still see that B happens-before A, that A happens-before B, or that there's no ordering relationship between A and B). This is, I think what you were reaching towards; an atomic instruction does not always guarantee a global ordering, but on x86, it's defined as doing so (LOCK prefixes are defined this way).

An atomic instruction is a full barrier to everything?

epa — Wed, 05 Jun 2024 10:57:27 +0000

Yes I see your point. So a more pedantic way to state the property would be that “if the effect of the atomic instruction is visible then the effects of all instructions before it are visible”.

An atomic instruction is a full barrier to everything?

farnz — Wed, 05 Jun 2024 10:46:33 +0000

There are CPUs that are ultra-strict like that (all x86 CPUs, I believe, meet this requirement); in your example, though, the detail you're missing is that if no other code uses r9 or address 123, there's no way to tell if the swap instruction was executed.

And it's a one-way condition; if the swap instruction was executed, then you know that all instructions before the swap instruction were executed. But if you do not know whether or not the swap instruction was executed, you do not know if the instructions before it are visible or not - and if you neither look at r9, nor inspect address 123, how do you know it executed?

An atomic instruction is a full barrier to everything?

epa — Wed, 05 Jun 2024 05:59:19 +0000

All CPUs and tasks agree that the side effects of all instructions that come before an atomic instruction are visible before it, and all effects of subsequent instructions are not visible until it has been executed.

Is that really true? I am sure there are CPUs that are ultra-strict like that. But I would have expected that only side effects of that instruction (and the earlier instructions which caused it to happen) are guaranteed visible.

If I do an atomic swap between register r9 and address 123, but no other code anywhere uses r9 or looks anywhere near that address, is it necessarily the case that my swap instruction makes visible the effect of all instructions before it? So I could do things with other registers and addresses, then do any swap instruction on any register to act as a barrier for everything done so far?

An instruction-level BPF memory model

foom — Wed, 05 Jun 2024 04:37:25 +0000

PPC definitely doesn't allow the reordering on a single memory location described here as example 2.

I suspect the description may have been oversimplified or mis-transcribed from the talk, and likely originally involved mutations on multiple memory locations.

An instruction-level BPF memory model

comex — Wed, 05 Jun 2024 01:01:55 +0000

Huh. Doesn’t the second example violate C++11’s read-read coherence requirement? [1] That is, assuming you translate the reads and writes described in the example into std::atomic loads and stores that use memory_ordering_relaxed. Yet the C++11 memory model was designed with the intent that relaxed loads and stores would compile into simple load and store instructions, without barriers, on the architectures common at the time – which includes POWER.

So what gives? Am I misunderstanding the example, am I misunderstanding the C++ memory model, or is there actually a conflict here?

[1] As described in: https://en.cppreference.com/w/cpp/atomic/memory_order