Hardware-level micro-op optimization
Hardware-level micro-op optimization
Posted Jul 12, 2018 18:49 UTC (Thu) by ncm (guest, #165)In reply to: Hardware-level micro-op optimization by corbet
Parent article: Spectre V1 defense in GCC
Some background, for those catching up... In prehistory, each instruction mapped to a specific series of machine states, and you knew everything about the machine just from the instructions you could see. When we got microcode, at first each instruction mapped to a specific sequence of microcode operations. With various caches, register renaming, and out-of-order execution scheduling "functional units" opportunistically, the sequence of machine states is a matter of speculation. With speculative execution, we got even less determinism, because now operations not even asked for ("yet") happen.
Early on, the translation from instructions to microcode sequences lost its direct mapping. Now, that mapping results in micro-ops for various nearby instructions interleaved, operating on physical registers chosen by the scheduler according to data flow dependencies it tracks. The translation to micro-ops can take into account knowledge of the actual run-time state of the machine, invisible to programmer and compiler. For example, the chip can know a divisor in a register is a power of two, and is not updated during a loop, and so substitute a shift or mask operation for the division. memcpy is a frequent bottleneck in real programs, so the chip may watch for instruction sequences that compilers emit for it, and substitute something smarter, instead, maybe based on the actual number of bytes and the actual alignment of the pointers.
At issue here is that the micro-op optimizer also knows which micro-ops change status bits, and so could know that the micro-op sequence following a status-bit-controlled branch can be shortened. There's nothing speculative about this. Chip vendors don't typically reveal this sort of detail, so the best we can do is measure whether the move and conditional move seem always to happen at the same speed, and suppose that, therefore, there would be no reason to do it. Of course, measurements don't tell us about the next release.
Posted Jul 12, 2018 18:57 UTC (Thu)
by corbet (editor, #1)
[Link] (2 responses)
If said "last use" was speculative, and thus the state of the condition code is speculative, then using that code for optimization *is* speculation, instead. The whole point is what happens during speculative execution; the instruction is a no-op in the real world. But an instruction that is defined as not being executed speculatively cannot be elided as the result of a speculative branch prediction.
Posted Jul 12, 2018 23:27 UTC (Thu)
by jcm (subscriber, #18262)
[Link]
Jon is right in his summary. But the point about uop caching and optimization is still a good one. Multiple efforts are underway in the industry to analyze this part of the front end in more detail for side channels. There are quite a few interesting possibilities I can think of, in particular with abuse of value prediction. I've asked a few research teams to consider looking at how badly people screwed up value predictors.
Posted Jul 12, 2018 23:44 UTC (Thu)
by ncm (guest, #165)
[Link]
(*Speculation may pile upon speculation, up to the limit of microarchitectural resources.)
Ultimately we will need assurances from vendors that the conditional nature of the move is not, and won't ever be, optimized away. Later, we will want another version of conditional move that we specifically allow to be micro-optimized; but first things first.
Hardware-level micro-op optimization
> the optimizer knows nothing has changed the status bit since its last use is not speculation.
Hardware-level micro-op optimization
Hardware-level micro-op optimization