LWN: Comments on "Lockless patterns: full memory barriers"

Lockless patterns: full memory barriers

pbonzini — Fri, 16 Apr 2021 07:01:23 +0000

It's not intuitive at all, but only one memory barrier matters in each of the two cases. But both are needed (separately) to ensure that x=0 && y = 0 is not possible.

Lockless patterns: full memory barriers

firolwn — Mon, 15 Mar 2021 14:35:53 +0000

After reading 'Multicopy atomicity' section from kernel Documentation/memory-barriers.txt, I realize that I am wrong and no more smp_mb() is necessary to add.
--
Firo

Lockless patterns: full memory barriers

firolwn — Fri, 12 Mar 2021 17:03:25 +0000

Hi Paolo, great article. I think maybe you forgot to add 'smp_mb();' to the two diagrams which are just around the line starting with 'Due to transitivity'.
--
Firo

Lockless patterns: full memory barriers

PaulMcKenney — Tue, 09 Mar 2021 20:57:34 +0000

Or I haven't yet had a chance to read it thoroughly, your choice. ;-)

Lockless patterns: full memory barriers

pbonzini — Tue, 09 Mar 2021 20:31:23 +0000

If "that's all you have to say about that" (as Forrest Gump would put it), then I guess you didn't find any mistake, yay!

Lockless patterns: full memory barriers

jcm — Tue, 09 Mar 2021 05:00:41 +0000

Thing is you don’t even need to flush the sucker, just track age and ensure that they’ve all passed through. You can blow right through every barrier without cost provided you are willing to track enough cache lines in the process. I have been doing a lot of thinking lately about renaming SPRs and speculating through serializing instructions too. There’s no reason you couldn’t (provided you tracked everything, were willing to pay the cost, and also could precisely unwind the state to prevent side-channel crumbs, which might force you to add eg a side buffer). This has been dancing in my head for the past week solid.

Lockless patterns: full memory barriers

PaulMcKenney — Tue, 09 Mar 2021 01:41:25 +0000

Nice gentle introduction to the reads-from relation! ("~~~~~~~~") ;-)

Lockless patterns: full memory barriers

jcm — Mon, 08 Mar 2021 20:56:41 +0000

Yea, I did after. I agree, thanks :)

Lockless patterns: full memory barriers

pbonzini — Sun, 07 Mar 2021 20:56:55 +0000

Read the paragraph again, it mentions both scenarios. :)

Lockless patterns: full memory barriers

jcm — Sun, 07 Mar 2021 16:10:35 +0000

Btw, in the SB example, it's not that the store buffer is forwarding locally, "which would return zero", it's that each is writing into its local store buffer while reading the other variable from the initial state. The SB allows the stores to be delayed in terms of observation by other processors relative to the local one.

Lockless patterns: full memory barriers

jcm — Sun, 07 Mar 2021 16:02:25 +0000

x86 processors maintain the illusion of TSO ordering through the use of a MOB (Memory Ordering Buffer), which not only tracks the cache lines for ownership/invalidation but also replays at retirement if necessary in order to maintain ordering. So e.g. a load might be performed twice in order to ensure it retires correctly.

Lockless patterns: full memory barriers

pbonzini — Sat, 06 Mar 2021 17:44:20 +0000

Yep, I even mentioned (with a little bit of simplification) what Intel does for TSO. To some extent it should be possible to do the same for full barriers, by delaying invalidate messages and keeping them buffered until the store buffer has been flushed. After all in most cases there is no race and therefore the effect of the barrier is kind of invisible.

Lockless patterns: full memory barriers

jcm — Sat, 06 Mar 2021 15:28:07 +0000

The mental rabbit hole is separating the perceived observed order from reality. Reality might involve combinations of invalidation queue and store/load buffer tracking, but it might involve something very different :)

Lockless patterns: full memory barriers

jcm — Sat, 06 Mar 2021 15:25:07 +0000

One of the things that has been keeping me up at night recently thinking has been speculating through barriers. You do it by tracking cache ownership/state transitions as you go through and roll back as needed. The easiest way to think about it is probably as close to the apparatus needed for transactions. And there are a few papers in this area. So anyway, the point is draining various buffers isn’t the only way to pull this off. Think how Intel does their MOB for TSO by tracking updates and replaying too.

Lockless patterns: full memory barriers

pbonzini — Fri, 05 Mar 2021 19:06:47 +0000

Yes, it's only about the observed order, and in different ways depending on whether the tricks come from the compiler or the processor.

The genius stroke of C++11 compared to e.g. the Java memory model was to treat the compiler+processor combo as weak memory ordering even if the underlying architecture is TSO. I don't think any compiler applies very much the leeway that it's given, but it does make for a nice and consistent model at the language level.

I do prefer the LKMM and its clear foundation on the behavior of hardware (which I tried to convey in this article) to C++11's slightly too handwavy "just use sequential consistency unless you need something else".

Lockless patterns: full memory barriers

jcm — Fri, 05 Mar 2021 18:50:36 +0000

A key thing to remember with memory barriers is they are only about the observed order of memory operations. An implementation is actually free to do very different things with those barriers (including speculating right through them, which you can do as long as there is no circumstance under which someone can observe incorrect ordering).