Zapping pointers out of thin air
Paul McKenney gave a presentation at Kangrejos this year that wasn't (directly) related to Rust. Instead, he spoke about the work he has been doing in concert with many other contributors on improving the handling of subtle concurrency problems in C++. Although he cautioned that his talk was only an overview, and not a substitute for reading the relevant papers, he hoped that the things the C++ community is working on would be of interest to the Rust developers present as well, and potentially inform future work on the language. McKenney's talk was, as is his style, full of subtle examples of weird multithreaded behavior. Interested readers may wish to refer to his slides in an attempt to follow along.
Lifetime-end pointer zapping
He began with lifetime-end pointer zapping, which is a somewhat obscure topic, so he gave an example of when it could be relevant. Suppose that a C++ programmer has written a stack with two operations: push and clear. They want this to be multithreaded, so the idea is that multiple threads can push to this stack, and multiple threads can clear it, but only one clearing thread will "win" and get all of the pushed items. The natural way to model this situation is with a linked list and atomic compare-and-swap (CAS) operations to swap the pointer to the head of the list.
So suppose that the top of the stack points to object A, and a thread is
trying to push a second object, B. The thread puts a pointer to A inside B, but
the top still points to A. Then, the thread is preempted, and a second thread
clears the whole stack. It processes the entries, and frees A. Now A is on
the free list, but B still points to it. Then the second thread
allocates and pushes object C. The malloc() implementation gives C the
same chunk of memory as A, so now B points to C — or so we might think. The first thread
resumes, the compare-and-swap operation sees the same pointer (to A or C), and
the push of B succeeds. See the linked slide for a graphical depiction of the process.
But all is not well. The pointer from B to C is an example of a "zombie pointer", McKenney said — it's a pointer that would be legal in assembly code, and the types match, but it's not allowed in C++. Pointers in C or C++ are not just the addresses of particular locations in memory, they also have provenance. And B's pointer to A is invalidated as soon as A is freed — even if C ends up being put back into the same location. So if the compiler can prove that A will be freed and B will hold onto a reference to it anyway, then it knows any attempts to read through that pointer are undefined behavior, and it is free to make optimizations that assume this won't happen. Plus, while the simple example of this stack is "ABA tolerant", not every data structure deals so well with problems like this.
Björn Baron agreed that this particular problem would occur in Rust as well. Even though Rust's provenance rules are not completely settled, it's still definitely undefined behavior to have a pointer to an allocation that has been freed. There has been some discussion about adding a compiler intrinsic to help annotate this case, he said. McKenney said that this matched his experience working on the problem — a lot of discussion for relatively little progress.
One problem is that compiler developers really like provenance, because it permits
otherwise impossible optimizations. Tracking provenance is complicated, though.
Provenance can be erased, such as by converting a pointer to an integer.
That isn't a complete solution to lifetime-end pointer zapping, however.
Last year Davis Herring proposed "angelic provenance
", McKenney said,
which requires the compiler to pick the provenances that make a program valid,
if there's more than one possible interpretation. This helps generally simplify
the reasoning around provenance, but does not
help with the example he gave.
One potential fix would be to make atomic<T*> values (C++'s atomic pointers) erase provenance, including pointers referenced by successful CAS operations. This proposal has gone to the C++ committee, and initial reviews have been positive. Benno Lossin said that since Rust is trying to develop its own provenance rules, it probably makes sense for the Rust developers to consider having similar APIs that remove provenance. Alice Ryhl mentioned that Miri (Rust's undefined behavior sanitizer) actually already found a problem similar to the one with the stack McKenney presented in the queue implementations in Rust's standard library earlier this year.
Out of thin air
Next, McKenney discussed out-of-thin-air (OOTA) cycles. He gave an example involving two processes and two variables (initially both zero). Suppose that one process reads the value 42 from X, using a relaxed read. Then it stores that value into Y. The other process does a relaxed read of Y — getting 42 — and then stores it into X, for the first process to read. If both variables were initially zero, where did the 42 come from? In C++ terminology, it came "out of thin air". The value does not have to be 42; a cycle like this could produce any value at all, which is a major problem for reasoning about programs, so we would like to be able to show that it does not happen. The "cycle" part of the name comes from the fact that if you draw a diagram of where the value came from, it makes a loop — X got it from Y, and Y got it from X.
OOTA cycles are intuitively impossible, and in fact do not actually happen on
real hardware —
but formalizing a description of why it does not happen turns out to be
surprisingly hard to fit into existing memory models. This is important, because
the memory model of a language is part of determining which programs are valid,
and therefore which optimizations are valid as well. The C language committee
eventually gave up, McKenney said, just defining the memory model and then
tacking on a note saying "and also don't do this.
" Java did the same
thing, when trying to formalize it around 2000, and eventually gave up.
Researchers from the University of Kent have a promising theoretical model that takes into account compiler optimizations, and could finally put the matter to rest. But their model is theoretical and not yet ready, he said. McKenney has a simpler explanation for why this isn't an issue in real life: reads take time, and instruction execution takes time. To form an OOTA cycle, he said, at least one of those steps must go backward in time in the real world. Either a read must somehow get a future value, or executing a read and a store must take negative time. The only way this can be broken is from bad hardware speculation (but bad hardware can do anything anyway), or from the compiler making it so that the operations take no time at all to execute (such as by optimizing them out).
The reason this is a poor match for existing memory models is because they don't have a notion of time — only a notion of causal relationships. And so expressing the reasons that an OOTA cycle is impossible in that timeless formal language is difficult.
Baron suggested that maybe the CPU could predict that there would be a store and
later check whether that guess was correct. He wasn't aware of any CPUs that
actually did that, but it could theoretically happen. McKenney was glad that
someone had brought that up unprompted — "and I haven't paid that man
"
he said with a grin.
McKenney explained that there are really two kinds of links between events: temporal and atemporal links. A store-to-load link is temporal; when a store is performed, it takes time for that store to propagate across a system. Time has to elapse for information to propagate in the real world. But store-to-store links are atemporal — one store can 'win' over another even if it happened earlier in time, because CPU caches don't keep a global clock. McKenney has done experiments on actual hardware showing that the winning store can happen more than a microsecond before the store that it overwrites. Finally, load-to-store links are also atemporal and tell you nothing about the ordering of the events, because you might get an old value.
So, on real computers, the only kind of events that you can rely on to establish the actual order in which things happened in the real world are store-to-load links, where one CPU stores a value, and another loads it. How is this related to speculation? Well, suppose that a CPU decided to speculate a value. It would, at some point, need to check whether that speculation was correct. In doing so, it would need to perform a load. If that load sees a different value, then it needs to throw away the speculation. But if it sees the speculated value, it still can't take a negative amount of time, since it did eventually need to do that load. This means that an OOTA cycle would still be impossible, even with speculation, because speculation cannot actually get rid of a load — only move work that depends on that load earlier in time.
This sounds comforting, but what's the takeaway for compiler developers? The fact that OOTA can't occur on real hardware for physics-based reasons means nothing to compiler developers, who are forced to rely on the memory models that the C language specification provides. Ultimately, McKenney had one takeaway for compiler people: don't invent non-volatile atomic accesses. It is already against the rules to invent accesses to volatile atomics, but it is somewhat less clear why inventing accesses to non-volatile atomics would cause a problem; refraining from doing so prevents OOTA cycles.
The reason is semantic dependencies. The C abstract machine may not have a real notion of time, but it does have a notion of data dependencies. This would be enough to prevent OOTA cycles, except that compilers are permitted to invent extraneous loads (such as to reduce register pressure). McKenney showed an example where adding a second load of a non-volatile atomic boolean enabled a sequence of optimizations that actually removed a data-dependency between two other variables. With the dependency removed, the compiler could then reorder the remaining statements to produce a value out of thin air.
So McKenney's final piece of advice on the topic of avoiding OOTA cycles is simply don't do that. Inventing, duplicating, or repurposing non-volatile atomic loads is a necessary step to cause OOTA cycles, so if the compiler doesn't do that, then it doesn't need to worry about this class of errors. Whether the C++ committee will accept McKenney's paper on the topic saying as much remains to be seen — but since this issue has vexed the C++ community for some time, it seems likely.
