Lockless patterns: full memory barriers

Posted Mar 5, 2021 19:06 UTC (Fri) by pbonzini (subscriber, #60935)
In reply to: Lockless patterns: full memory barriers by jcm
Parent article: Lockless patterns: full memory barriers

Yes, it's only about the observed order, and in different ways depending on whether the tricks come from the compiler or the processor.

The genius stroke of C++11 compared to e.g. the Java memory model was to treat the compiler+processor combo as weak memory ordering even if the underlying architecture is TSO. I don't think any compiler applies very much the leeway that it's given, but it does make for a nice and consistent model at the language level.

I do prefer the LKMM and its clear foundation on the behavior of hardware (which I tried to convey in this article) to C++11's slightly too handwavy "just use sequential consistency unless you need something else".

Lockless patterns: full memory barriers

Posted Mar 6, 2021 15:25 UTC (Sat) by jcm (subscriber, #18262) [Link] (3 responses)

One of the things that has been keeping me up at night recently thinking has been speculating through barriers. You do it by tracking cache ownership/state transitions as you go through and roll back as needed. The easiest way to think about it is probably as close to the apparatus needed for transactions. And there are a few papers in this area. So anyway, the point is draining various buffers isn’t the only way to pull this off. Think how Intel does their MOB for TSO by tracking updates and replaying too.

Lockless patterns: full memory barriers

Posted Mar 6, 2021 15:28 UTC (Sat) by jcm (subscriber, #18262) [Link]

The mental rabbit hole is separating the perceived observed order from reality. Reality might involve combinations of invalidation queue and store/load buffer tracking, but it might involve something very different :)

Lockless patterns: full memory barriers

Posted Mar 6, 2021 17:44 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (1 responses)

Yep, I even mentioned (with a little bit of simplification) what Intel does for TSO. To some extent it should be possible to do the same for full barriers, by delaying invalidate messages and keeping them buffered until the store buffer has been flushed. After all in most cases there is no race and therefore the effect of the barrier is kind of invisible.

Lockless patterns: full memory barriers

Posted Mar 9, 2021 5:00 UTC (Tue) by jcm (subscriber, #18262) [Link]

Thing is you don’t even need to flush the sucker, just track age and ensure that they’ve all passed through. You can blow right through every barrier without cost provided you are willing to track enough cache lines in the process. I have been doing a lot of thinking lately about renaming SPRs and speculating through serializing instructions too. There’s no reason you couldn’t (provided you tracked everything, were willing to pay the cost, and also could precisely unwind the state to prevent side-channel crumbs, which might force you to add eg a side buffer). This has been dancing in my head for the past week solid.