What applications benefit from that single-threaded optimization?

Posted Aug 24, 2024 0:45 UTC (Sat) by roc (subscriber, #30627)
In reply to: What applications benefit from that single-threaded optimization? by pbonzini
Parent article: A review of file descriptor memory safety in the kernel

What person in their right mind would take an incredibly performance-sensitive problem and say "the best way to implement this would be to launch a huge number of short-lived single-threaded processes"?

I suppose dezgeg answered this --- `make` :-(.

Maybe it's time for someone to wrap gcc in a service that forks threads or processes after startup.

What applications benefit from that single-threaded optimization?

Posted Aug 24, 2024 0:59 UTC (Sat) by roc (subscriber, #30627) [Link] (3 responses)

Although actually I'm a little confused and don't see why `make` would be a problem here. Al Viro mentioned above that the performance issue is cache-line ping-ponging bumping reference counts on shared file descriptors. But that can only happen if you're manipulating a file descriptor that is actually shared across multiple processes, so the file operations performed by the dynamic linker and the compiler itself would not trigger this.

What applications benefit from that single-threaded optimization?

Posted Aug 24, 2024 10:07 UTC (Sat) by intelfx (subscriber, #130118) [Link] (2 responses)

I'm assuming the atomics themselves are part of the issue — even relaxed, they are necessarily slower than non-atomic accesses.

So as I understood it, the context here is "why not drop the refcounting-elision optimization at all and save some complexity".

What applications benefit from that single-threaded optimization?

Posted Aug 24, 2024 11:21 UTC (Sat) by roc (subscriber, #30627) [Link] (1 responses)

That's not what Viro said up above.

Generally I assume cache-hitting relaxed atomic operations are very cheap as long as you don't have a data dependency on the result, and for refcounts, you don't. When decrementing there is a conditional branch but it should be well predicted. Maybe I'm wrong... if so, I'd like to know more.

What applications benefit from that single-threaded optimization?

Posted Aug 27, 2024 17:50 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

To my understanding, relaxed atomics means:

* Don't tear the result if it's bigger than a word (or if the architecture is otherwise susceptible to tearing values of this size).
* If there would be a data race involving this value, don't produce UB, and instead just pick an arbitrary order at runtime.
* In cases where there would be a data race, the arbitrary order is per-variable and does not need to be consistent with anything else, including relaxed atomic operations on other variables.
* Because arbitrary ordering is permitted, cyclic data dependencies between multiple variables can create temporal paradoxes according to the spec (and, to a limited extent, on some real architectures), but it's still not UB for some weird reason. Reading through the literature, I'm having a hard time understanding why anyone would want to write code with such dependencies using relaxed-atomics in the first place. But we don't care, because we're not doing that anyway.
* Atomic load-modify-store (including atomic increment/decrement) is still a single atomic operation, not three operations, and it still has to produce coherent results when consistently used across all callsites. Ordering is relaxed, but atomicity is not.

So, here's the thing. This is (or at least should be) very cheap if all atomic operations are simple reads and writes. You just emit basic loads and stores, perhaps decorate the IR with some very lax optimization barriers, and don't think about synchronization at all. The architecture will almost certainly do something that is permitted by the existing rules, even without any flushes, fences, interlocking, or other nonsense. But I tend to expect that atomic read-modify-write (as seen in atomic reference counting) cannot assume away the ordering problem so easily. If you have two cores that both have cached copies of the variable, and they both concurrently try to do an atomic increment, the variable must ultimately increase by two, not one, and that implies a certain amount of synchronization between cores. They cannot both emit a simple (non-atomic) load/modify/store sequence and expect everything to be fine and dandy.