Spectre V1 defense in GCC

Posted Jul 11, 2018 14:15 UTC (Wed) by nathan (subscriber, #3559)
Parent article: Spectre V1 defense in GCC

notice that the sequence:
if (index < structure->array_size) {
correct = (index >= structure->array_size) ? all_zeroes : correct;
requires the compiler's value-range-propagation optimization not function here. After all, because we're inside the if body, C abstract machine semantics tells us that index is indeed less than the array size. so a test for it to be greater-or-equal must be false. Thus C language semantics tells us we can reduce that conditional assignment to 'correct = correct' (and then eliminate it entirely). That of course would defeat the whole point.

That's one of the horrible bits of these vulnerabilities. Not only do they confuse human programmers, you often can't fix them without turning off optimizers. And you only want to do that as locally as possible. Hence the need for a compiler builtin that hides these semantics from the optimizers.

[above deduction presumes lack of volatile objects]

Compiler optimization

Posted Jul 11, 2018 15:13 UTC (Wed) by corbet (editor, #1) [Link] (7 responses)

That would indeed be the case if the defense were done in C code, but that code is there for illustrative purposes. The actual implementation is inserted by the compiler, as described further down in the article.

Compiler optimization

Posted Jul 12, 2018 7:06 UTC (Thu) by epa (subscriber, #39769) [Link]

I was confused by that too. I think pseudocode needs to look less like C -- at least in articles like this, where the difference between C source code and generated assembly is so significant.

Hardware-level micro-op optimization

Posted Jul 12, 2018 13:33 UTC (Thu) by ncm (guest, #165) [Link] (5 responses)

The compiler won't emit instructions to perform the comparison again -- the result is still in a status bit, so it only needs to issue a conditional move instruction. But the chip has its own peephole optimizer, and knows that it just used that status bit two instructions back. Could it not, itself, replace the conditional-move micro-op, in its decoded-instruction cache, with the unconditional version? Or are we confident that both have the same cycle-level cost, so that the hardware micro-op optimizer would have no reason to make such a substitution? Or, have we direct assurance from (all?) manufacturers that no such hardware-level optimization is done?

It is a strange world we live in, now, where we cannot have any confidence that the machine instructions we see correctly describe the machine behavior they will evoke.

Hardware-level micro-op optimization

Posted Jul 12, 2018 14:51 UTC (Thu) by corbet (editor, #1) [Link] (4 responses)

Instructions like CSEL are defined by the architecture to not execute speculatively. That is, as I understand it, a requirement to be able to do things like constant-time crypto operations. So its use of the condition code is different from the test immediately above, which can be speculated. Assuming the processor behaves as specified, the result should be correct.

Or that's how I understand it, at least.

Hardware-level micro-op optimization

Posted Jul 12, 2018 18:49 UTC (Thu) by ncm (guest, #165) [Link] (3 responses)

That is how I understand it, also. However, a hardware-level optimization to make a conditional move unconditional because the optimizer knows nothing has changed the status bit since its last use is not speculation.

Some background, for those catching up... In prehistory, each instruction mapped to a specific series of machine states, and you knew everything about the machine just from the instructions you could see. When we got microcode, at first each instruction mapped to a specific sequence of microcode operations. With various caches, register renaming, and out-of-order execution scheduling "functional units" opportunistically, the sequence of machine states is a matter of speculation. With speculative execution, we got even less determinism, because now operations not even asked for ("yet") happen.

Early on, the translation from instructions to microcode sequences lost its direct mapping. Now, that mapping results in micro-ops for various nearby instructions interleaved, operating on physical registers chosen by the scheduler according to data flow dependencies it tracks. The translation to micro-ops can take into account knowledge of the actual run-time state of the machine, invisible to programmer and compiler. For example, the chip can know a divisor in a register is a power of two, and is not updated during a loop, and so substitute a shift or mask operation for the division. memcpy is a frequent bottleneck in real programs, so the chip may watch for instruction sequences that compilers emit for it, and substitute something smarter, instead, maybe based on the actual number of bytes and the actual alignment of the pointers.

At issue here is that the micro-op optimizer also knows which micro-ops change status bits, and so could know that the micro-op sequence following a status-bit-controlled branch can be shortened. There's nothing speculative about this. Chip vendors don't typically reveal this sort of detail, so the best we can do is measure whether the move and conditional move seem always to happen at the same speed, and suppose that, therefore, there would be no reason to do it. Of course, measurements don't tell us about the next release.

Hardware-level micro-op optimization

Posted Jul 12, 2018 18:57 UTC (Thu) by corbet (editor, #1) [Link] (2 responses)

> However, a hardware-level optimization to make a conditional move unconditional because
> the optimizer knows nothing has changed the status bit since its last use is not speculation.

If said "last use" was speculative, and thus the state of the condition code is speculative, then using that code for optimization *is* speculation, instead. The whole point is what happens during speculative execution; the instruction is a no-op in the real world. But an instruction that is defined as not being executed speculatively cannot be elided as the result of a speculative branch prediction.

Hardware-level micro-op optimization

Posted Jul 12, 2018 23:27 UTC (Thu) by jcm (subscriber, #18262) [Link]

Jon is right in his summary. But the point about uop caching and optimization is still a good one. Multiple efforts are underway in the industry to analyze this part of the front end in more detail for side channels. There are quite a few interesting possibilities I can think of, in particular with abuse of value prediction. I've asked a few research teams to consider looking at how badly people screwed up value predictors.

Hardware-level micro-op optimization

Posted Jul 12, 2018 23:44 UTC (Thu) by ncm (guest, #165) [Link]

I agree, but the op that set the status flag was not a speculative op (unless it was in a block that is itself speculative*); it was a regular check that was supposed to be guarding the block where we inserted the conditional move, with, most likely, no micro-ops between it and the conditional move, thus ideally situated to be made unconditional.

(*Speculation may pile upon speculation, up to the limit of microarchitectural resources.)

Ultimately we will need assurances from vendors that the conditional nature of the move is not, and won't ever be, optimized away. Later, we will want another version of conditional move that we specifically allow to be micro-optimized; but first things first.