What applications benefit from that single-threaded optimization?

Posted Aug 27, 2024 17:50 UTC (Tue) by NYKevin (subscriber, #129325)
In reply to: What applications benefit from that single-threaded optimization? by roc
Parent article: A review of file descriptor memory safety in the kernel

To my understanding, relaxed atomics means:

* Don't tear the result if it's bigger than a word (or if the architecture is otherwise susceptible to tearing values of this size).
* If there would be a data race involving this value, don't produce UB, and instead just pick an arbitrary order at runtime.
* In cases where there would be a data race, the arbitrary order is per-variable and does not need to be consistent with anything else, including relaxed atomic operations on other variables.
* Because arbitrary ordering is permitted, cyclic data dependencies between multiple variables can create temporal paradoxes according to the spec (and, to a limited extent, on some real architectures), but it's still not UB for some weird reason. Reading through the literature, I'm having a hard time understanding why anyone would want to write code with such dependencies using relaxed-atomics in the first place. But we don't care, because we're not doing that anyway.
* Atomic load-modify-store (including atomic increment/decrement) is still a single atomic operation, not three operations, and it still has to produce coherent results when consistently used across all callsites. Ordering is relaxed, but atomicity is not.

So, here's the thing. This is (or at least should be) very cheap if all atomic operations are simple reads and writes. You just emit basic loads and stores, perhaps decorate the IR with some very lax optimization barriers, and don't think about synchronization at all. The architecture will almost certainly do something that is permitted by the existing rules, even without any flushes, fences, interlocking, or other nonsense. But I tend to expect that atomic read-modify-write (as seen in atomic reference counting) cannot assume away the ordering problem so easily. If you have two cores that both have cached copies of the variable, and they both concurrently try to do an atomic increment, the variable must ultimately increase by two, not one, and that implies a certain amount of synchronization between cores. They cannot both emit a simple (non-atomic) load/modify/store sequence and expect everything to be fine and dandy.