|
|
Log in / Subscribe / Register

Another round of speculative-execution vulnerabilities

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 9:40 UTC (Thu) by anton (subscriber, #25547)
In reply to: Another round of speculative-execution vulnerabilities by bartoc
Parent article: Another round of speculative-execution vulnerabilities

Actually speculative execution works so well because it turns out that a lot of execution is very predictable. It's so predictable that the branch predictor has an accuracy of ~99% (depending on the application). This means that the instruction fetcher can fetch ahead for hundreds of instructions, and the OoO execution engine can execute these instructions ahead in an order determined by the data dependencies, not otherwise by the program order. This allows modern CPUs to complete (i.e., execute) several instructions per cycle.

I don't see that this has much to do with the programming language. Rust is as vulnerable to Spectre and Downfall as C is AFAICS. The only influence I see is that for a language like JavaScript that always bounds-checks array accesses, you have an easier time adding Spectre-V1 mitigations. But for Rust, which tries to optimize away the bounds-check overhead, you end up either putting in Spectre-V1 mitigation overhead (can this be done automatically?), slowing it down to be uncompetetive with C, or it is still Spectre-V1 vulnerable. Admittedly adding mitigations cannot be done automatically in C, because the compiler has no way to know the bounds in all cases.

The way to go is that the hardware manufacturers must fix (not mitigate) Spectre. They know how to avoid misspeculated->permanent state transitions for architectural state (Zenbleed is the exception that proves the rule), now apply it to microarchitectural state!


to post comments

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 11:52 UTC (Thu) by excors (subscriber, #95769) [Link] (2 responses)

Maybe it's more a combination of control flow being predictable and memory latency being unpredictable. Compilers (AOT and JIT) can have a go at predicting control flow using PGO, but I suspect it's largely impossible for them to predict which memory loads will come from L1 (~5 cycles) and which will come from RAM (~300 cycles) as that depends heavily on the dynamic state of the caches. That unpredictable latency prevents a compiler from doing good instruction scheduling by itself, so it has to rely on the CPU doing dynamic scheduling that can adapt to the actual latency of every single memory access. And the CPU can do that because the highly predictable control flow means that while it's waiting for RAM, it can speculatively gather hundreds of instructions to reschedule and execute out of order.

If memory latency is predictable then I think it's much easier for the compiler to statically schedule the instructions, and the CPU can be much simpler while maintaining decent performance. But that only seems practical with very small amounts of memory (e.g. microcontrollers with single-cycle latency to SRAM, but only hundreds of KBs) or very large numbers of threads (e.g. GPUs where each core runs 128 threads with round-robin scheduling, so each thread has 128 cycles between consecutive instructions in the best case, which can mask a lot of memory latency), not for general-purpose desktop-class CPUs.

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 15:14 UTC (Thu) by farnz (subscriber, #17727) [Link]

Even with PGO, control flow still has largely unpredictable regions, which depend upon the details of user input, and can only be predicted at compile time if the exact input the user will use is provided at compile time. This was one component of why Itanium's EPIC never lived up to its performance predictions; as compilers got better at exploiting compile-time known predictability, they also benefited OoO and speculative execution machines, which could exploit predictability that only appears at runtime.

For example, in a H.264 encoder or decoder, black bars are going to send your binary down a highly predictable codepath doing the same thing time and time again; your PGO compiled binary is not going to be set up on the assumption of black bars, because that's just one part of the sorts of input you might get. However, at runtime, the CPU will notice that you're going down the same codepath over and over again as you handle the black bars, and will effectively optimize further based on the behaviour right now. Once you get back to the main picture, it'll change the optimizations it's applying dynamically, because you're no longer going down that specific route through the code.

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 16:03 UTC (Thu) by anton (subscriber, #25547) [Link]

Compilers (AOT and JIT) can have a go at predicting control flow using PGO
Profile-based static prediction has ~10% mispredictions, while modern history-based hardware branch prediction has about 1% mispredictions (for real numbers check the research literature, but the tendency is in that direction; and it's actually hard to compare the research, because static branch prediction research stopped about 30 years ago).

Concerning memory latency, I also see very good speedups of out-of-order over in-order for benchmarks like the LaTeX benchmark which rarely misses the L1 cache.

Also, the fact that the Itanium II designers chose small low-latency L1 caches, while OoO designers went for larger and longer-latency L1 caches (especially in recent years, with the load latency from D-cache growing to 5 cycles in Ice Lake ff.) shows that the unpredictability is a smaller problem for compiler scheduling than the latency.

The dream of static scheduling has led a number of companies (Intel and Transmeta being the most prominent) to spend substantial funds on it. Dynamic scheduling (out-of-order execution) has won for general-purpose computing, and the better accuracy of dynamic branch prediction played a big role in that.

With regard to Spectre and company, compiler-based speculation would exhibit Spectre vulnerabilities just as hardware-based speculation does. Ok, you can tell the compiler not to speculate, but that severely restricts your compiler scheduling, increasing the disadvantage over OoO. Better fix Spectre in the OoO CPU.

Another round of speculative-execution vulnerabilities

Posted Aug 11, 2023 13:26 UTC (Fri) by atnot (guest, #124910) [Link] (13 responses)

> don't see that this has much to do with the programming language. Rust is as vulnerable to Spectre and Downfall as C is AFAICS.

You're thinking much too narrow here in terms of what "C" is in this context. It's has far less to do with the specific syntax and more with the general model of computation that derives from the original PDP11, i.e.:

Programs are a series of commands, whose effects become visible in order from top to bottom. The sequence of these commands can be arbitrarily replaced using a specific command, called a "branch". There is a singular, uniform thing called "memory", which is numbered from zero to infinity and you can create a valid read-write reference to any of it by using that number. And so on.

None of this is true internally for any modern compute device. It isn't even true for C anymore. But it was true for the creators of C, and as a result these assumptions were baked very deeply into the language, then tooling like gcc and LLVM, then languages that use that tooling like Rust, OpenCL, CUDA, and then architectures that wanted to be able to easily targetable by those tools like RISC-V and, most notably, AMD64. (As opposed to it's Itanic cousin). It's so established that people don't even recognize these as specific design choices anymore, it's just "how computers work".

Rust is definitely a step away from C, and one that has at least some potential to improve how chips are designed in the future, if the tooling allows for it. But it's not a very big step in the grand scheme of things.

Another round of speculative-execution vulnerabilities

Posted Aug 11, 2023 22:40 UTC (Fri) by jschrod (subscriber, #1646) [Link] (12 responses)

I wouldn't call this PDP11 semantics - I would call this semantic of any von-Neumann computer architecture that exists nowadays. (I exclude quantum computing from that statement.) In fact, this is the paradigm of the Turing Machine.

That you have statements that are executed from top to bottom, where preconditions, invariants, and postconditions exist, is basically the fundament of theoretical computer science. No proof about algorithmic semantics or correctness would work without that assumption.

If this, in your own words, isn't true any more - can you please point me to academic work that formalizes that new behavior and its semantics?

After all, I cannot believe that theoretical computer scientists have ignored this development. It is a to good opportunity to write new articles for archived journals.

If no computer science work is published on your claim, can you please explain why research is ignoring this development?

Another round of speculative-execution vulnerabilities

Posted Aug 12, 2023 11:50 UTC (Sat) by atnot (guest, #124910) [Link] (11 responses)

> I would call this semantic of any von-Neumann computer architecture that exists nowadays. [...] In fact, this is the paradigm of the Turing Machine.

I reject this outright. There is an absolute world of wonderful compute models in between a turning machine and C or a PDP11. Many of them are von neumann machines, even. This should be clear from the fact that the key problem of the C model is that it's hard to model formally (see e.g. the size of the C memory model specification), while the turing machine was purpose designed for formal modelling.

Let me just give some examples:

For a very soft start, we can look at something like the 6502, which is generally pretty boring apart from treating he first 256 bytes of memory specially. Largely because of this, it is not supported upstream in any of the big C compilers.

Then we can look at something like Itanium, in which bundles of instructions are executed in parallel and can not see the effects of each other, along with things like explicit branch prediction, speculation and software pipelining.

This is actually pretty similar to modern CPUs, except instead of having it explicitly encoded in the instruction stream, they try to re-derive that information at runtime, often by just guessing.

Then we have things like GPUs, which have multiple tiers of shared memory, primarily work with masks instead of branches. Although they are slowly becoming more C-like as people seek to target them with C and C++ code.

There's also a whole bunch of architectures like ARM CHERI and many with memory segmentation, where addresses and pointers are not the same thing.

We can also talk about various other things like lisp machines, Mill, Transmeta, EDGE and many more things I'm forgetting.

Then even further asea, you can find things like FPGAs, which are programmed using a functional specification of behavior much like TLA+. (The current fad is, of course, trying to run C on them, to limited success)

Now if you say "But most of these are all obscure architectures nobody uses", then yes that's the point. It's because they don't look enough like K&R's PDP11. Itanium is far from the only innovative architecture that C killed and as primarily a hardware person, that's deeply frustrating.

Another round of speculative-execution vulnerabilities

Posted Aug 12, 2023 15:05 UTC (Sat) by anton (subscriber, #25547) [Link] (6 responses)

What "C memory model specification" do you mean?

Why should the zero page of the 6502 be a reason not to support the 6502 in "big" C compilers? They can use the zero page like compilers for other architectures use registers (which leave no trace in C's memory model, either). Besides, there are C compilers for the 6502, like cc65 and cc64, so there is obviously no fundamental incompatibility between C and the 6502. The difficulties are more practical, stemming from having zero 16-bit registers, three 8-bit registers, only 256 bytes of stack, no stack-pointer-relative addressing, etc.

Concerning IA-64 (Itanium), this certainly was designed with existing software (much of it written in C) in mind, and there are C compilers for it, I have used gcc on an Itanium II box, and it works. C has not killed IA-64, out-of-order (OoO) execution CPUs have outcompeted it. IA-64, Transmeta and the Mill are based on the EPIC assumption that the compiler can perform better scheduling than the hardware, and it turned out that this assumption is wrong, largely because hardware has better branch prediction, and can therefore perform deeper speculative execution.

And the fact that OoO won over EPIC shows that having an architecture where instructions are performed sequentially is a good interface between software (not just written in C, but also, e.g., Rust) and hardware, an interface that allows adding a lot of performance-enhancing features under the hood.

Concerning Lisp machines, they were outcompeted by RISCs, which could run Lisp faster; which shows that they are not just designed for C. There actually was work on architectural support for LISP in SPUR, and some of it made it into SPARC, but one Lisp implementor wrote that their Lisp actually did not use the architectural support in their SPARC port, because the cost/benefit ratio did not look good.

Concerning GPUs, according to your theory C should have killed them long ago, yet they thrive. They are useful for some computing tasks and not good for others. In particular, let me know when you have a Rust or, say, CUDA compiler or OS kernel (maybe one written in Rust or CUDA) running on a GPU.

Another round of speculative-execution vulnerabilities

Posted Aug 14, 2023 9:50 UTC (Mon) by james (guest, #1325) [Link] (5 responses)

C has not killed IA-64, out-of-order (OoO) execution CPUs have outcompeted it.
It's an interesting theoretical exercise to consider what would have happened if Meltdown and Spectre had been discovered sometime around 2000. Presumably the software workaround for Meltdown would have had to have looked like Red Hat's 4G/4G split, which could:
cause a typical measurable wall-clock overhead from 0% to 30%, for typical application workloads (DB workload, networking workload, etc.). Isolated microbenchmarks can show a bigger slowdown as well - due to the syscall latency increase.
That would have made a big difference to the perceived advantages of Itanium.

Would the conservative and increasingly security-sensitive server world have adopted the position that OoO couldn't be trusted? (Once Itanium was released, Intel would almost certainly have made that part of their marketing message.)

In 2018, when in this timeline Meltdown and Spectre were discovered, the consensus of the security community was that more such attacks would be discovered, and time has sadly proven that to be correct — but we now have no other realistic option but to live with it. We had other options around 2000 — then-current in-order processors (from Sun, for example).

The triumph of OoO looks much more like an accident of history rather than something inherent to computer science to me.

Another round of speculative-execution vulnerabilities

Posted Aug 14, 2023 11:05 UTC (Mon) by anton (subscriber, #25547) [Link] (4 responses)

AFAIK a mitigation for Meltdown was indeed to not share the address space between kernel and user space, leading to TLB flushes on system calls. Intel fixed Meltdown relatively quickly in hardware, and AMD hardware has not been vulnerable to Meltdown AFAIK.

By contrast, neither Intel nor AMD (nor AFAIK ARM or Apple) has fixed Spectre in the more than 6 years since they have been informed of it. This indicated that these CPU manufacturers don't believe that they can sell a lot of hardware by being the first to offer hardware with such a fix (and making it a part of their marketing message). So they think that few of their customers care about Spectre. But if they thought that many customers care about Spectre, they would design OoO hardware without Spectre.

As for IA-64, it has architectural features for speculative loads, and is therefore also vulnerable to Spectre. This vulnerability can probably be mitigated by recompiling the program without using speculative loads (if we assume that the hardware does not perform any speculative execution, it's good enough to perform the speculative load and then not use the loaded data until the speculation is confirmed; for security the speculatively loaded data should be cleared in case of a failed speculation). This mitigation would reduce the performance of Itanium CPUs to be close to the performance of architectures without these speculative features, i.e., even lower than the Itenium performance that we saw.

OoO certainly has other options wrt. Spectre than to live with it. Just fix it. All the OoO hardware designers (the Zen2 designers are the exception that proves the rule) are able to squash speculative architectural state on a misprediction; they now just need to apply the same discipline to speculative microarchitectural state. E.g., if they had squashed the speculative branch predictor state on a miscprediction, there would be no Inception, and if they had squashed the speculative AVX load buffer state on misprediction, there would be no Downfall.

Another round of speculative-execution vulnerabilities

Posted Aug 15, 2023 4:26 UTC (Tue) by donald.buczek (subscriber, #112892) [Link] (3 responses)

> E.g., if they had squashed the speculative branch predictor state on a miscprediction, there would be no Inception

A branch predictor, which isn't allowed to learn, would't that just be a rather useless static branch predictor like "allways probably backwards" or "as hinted by machine code" ?

Another round of speculative-execution vulnerabilities

Posted Aug 15, 2023 11:01 UTC (Tue) by anton (subscriber, #25547) [Link] (2 responses)

What makes you think that this would mean "isn't allowed to learn"? The fact that architectural state is not changed on a misprediction does not mean that architectural state is immutable, either.

A straightforward way is to learn from completed (i.e. architectural) branches, with the advantage that you learn from the ground truth rather than speculation.

If that approach updates the branch predictor too late in your opinion (and for the return predictor that's certainly an issue), a way to get speculative branch predictions is to have an additional predictor in the speculative state, and use that in combination with the non-speculative predictor. If a prediction turns out to be correct, you can turn the part of the branch predictor state that is based in that prediction from speculative to non-speculative (like you do for architectural state); if a prediction turns out to be wrong, revert the speculative branch predictor state to its state when the branch was speculated on (just like you do with speculative architectural state).

Another round of speculative-execution vulnerabilities

Posted Aug 15, 2023 14:26 UTC (Tue) by donald.buczek (subscriber, #112892) [Link] (1 responses)

> If a prediction turns out to be correct, you can turn the part of the branch predictor state that is based in that prediction from speculative to non-speculative (like you do for architectural state); if a prediction turns out to be wrong, revert the speculative branch predictor state to its state when the branch was speculated on (just like you do with speculative architectural state).

Why wouldn't such a branch predictor always give the initial answer? If correct, it would be sensible to stick to it and if wrong, you want to ignore that and revert to the state of the last correct guess or the initial state.

Assuming you want to apply that to binary, taken/not taken branch predictor and not only target branch predictors?

Another round of speculative-execution vulnerabilities

Posted Aug 15, 2023 15:23 UTC (Tue) by anton (subscriber, #25547) [Link]

If the prediction is wrong, you throw away the speculative nonsense (and thus avoid Inception), but you record that the prediction was wrong. I had not written that earlier, sorry.

In more detail: If the mispredicted branch is non-speculative, you record it in the non-speculative predictor. If the mispredicted branch is still in the speculative part of execution (that would mean that you have a CPU that corrects mispredictions out-of-order; I don't know if real CPUs do that), you record it in the speculative part, and when this branch leaves the speculative realm, this record can also be propagated to the non-speculative predictor.

Another round of speculative-execution vulnerabilities

Posted Aug 12, 2023 16:51 UTC (Sat) by farnz (subscriber, #17727) [Link] (3 responses)

Itanium failed to outperform AMD64 on hand-coded assembly as well as on C code. It wasn't killed by the C model, it was killed by a failure to deliver performance greater other CPUs. VLIW CPUs like Transmeta failed because VLIW code is inherently low-density in memory, and our current bottleneck for performance tends to be L1 cache size. Mill has never reached a point where hand-written code in simulation outperforms hand-written code for AMD64 given the same simulated resources as AMD64. EDGE is an ongoing research project, and may (or may not) prove worthwhile - there's certainly not been an effort to build a good EDGE CPU that can be compared to something "C-friendly" like RISC-V.

Similar failures apply to Lisp Machines. While they had dedicated hardware to make running Lisp code faster, they lost out because RISC CPUs like SPARC and MIPS were even faster at running Lisp code for a given energy input than Lisp Machines were. Again, not about programming model, but about the Lisp Machines being worse hardware for running Lisp than MIPS or SPARC.

In terms of competing models of computation that have actually made it to retail sale, FPGAs are a commercial success, but are not programmed like CPUs, because they're defined as a sea of interconnected logic gates, and you are better off exploiting that via a Hardware Description Language than via something like C, FORTRAN or COBOL. GPUs are a commercial success; individual threads on a GPU are similar to a CPU with SIMD, with many threads per core (8 on Intel, more on others), and a hardware thread scheduler that allows you to have a pool of cores sharing thousands or even hundreds of thousands of threads.

None of this is about the "C model"; underpinning all of the noise is that humans struggle to coordinate concurrent logic in their heads, and prefer to think about a small number of coordination points (locks, message channels, rendezvous points, whatever) with a single thread of execution between those points. OoOE with speculative execution is one of the two local minima we've found for such a mental model of programming, and supports the case where a single thread of logic is the bottleneck. The other model that works well is the workgroup model used by GPU programming, where something distributes a very large number of input values to a pool of workers, and lets the workers build a large number of output values. Between the input and output values, there's very little (if not no) coordination between workers.

And while the 6502 is not supported upstream in any of the big C compilers, nor are many other CPUs of the same vintage. The Z80 is not supported in any of the big C compilers, nor is the 6809, for example, and both of those were big selling CPUs at the time the 6502 was current; the Z80 is also a lot friendlier to C than the 6502, since the Z80 does not limit you to a single 256 byte stack at a fixed location in memory, whereas the 6502 has a 256 byte stack fixed in page 1. I've never personally programmed a 6809 system, but I believe that it's also a lot more C friendly than the 6502.

Fundamentally, the thing that has killed every alternative to date is that the surviving processor types are simply faster for commercially significant problems than any competitor was, even with alternative programming models. This applies to VLIW, and to EPIC, and to Lisp Machines.

Another round of speculative-execution vulnerabilities

Posted Aug 14, 2023 20:08 UTC (Mon) by mtaht (guest, #11087) [Link] (2 responses)

I remain fond of the Mill set of ideas for many reasons, but was not aware of any benchmarks of the compiler, or public sim information? I have not kept track.

Weirdly enough I do not care about IPC, what I care about is really rapid context and priv switching, something that unwinding speculation on the TLB flush on spectre really impacted. I am tired of building processors that can only go fast in a straight line. And like everyone here, tired of all these vulnerabilities.

The mill held promise of context or priv switching in 3 clocks. The implicit zero feature and byte level protections seemed like a win. But it has been a long 10+ years since that design was announced, have there been any updates?

Another round of speculative-execution vulnerabilities

Posted Aug 14, 2023 21:52 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

I recently perused the forum and it seems that they're in another funding round and looking to go from startup to a proper company (salaries, etc.). Technical progress (well, at least publicizing it) is blocked on that. To be fair, they are apparently in it for the money (based on the Q&A in at least one of the talks that have been released).

Another round of speculative-execution vulnerabilities

Posted Aug 17, 2023 14:43 UTC (Thu) by farnz (subscriber, #17727) [Link]

It's a while since I saw the information (around 10 years), so I don't have links to hand, and it was investor-targeted. They seemed to be making the same mistake as Itanium designers, though - they compared hand-optimized code on their Mill simulator to GCC output on a then current Intel chip (Haswell, IIRC), showing that simulated Mill was better than GCC output on Haswell. The claim was that compiler improvements needed for Mill would bring Mill's performance on compiled code ahead of Haswell's performance; but it failed to take into account that, with a lot of human effort, I could get better performance from Haswell with hand-optimized code than they got with GCC output, using GCC's output as a starting point.

I am inherently sceptical of "compiler improvement" claims that will benefit one architecture and not another; while I'll accept that the improvement is not evenly distributed, until Mill Computing can show that their architecture with their compiler can outperform Intel, AMD, Apple, ARM or other cores with a modern production-quality (e.g. GCC, LLVM) compiler for the same language, I will tend towards the assumption that anything that they improve in the compiler will also benefit other architectures.

This holds especially true for compiler improvements around scheduling, which is what Mill depends upon, and what Itanium partially needed to beat OoOE - improvements to scheduling of instructions benefit OoOE by making the static schedule closer to optimal, leaving the OoOE engine to deal with the dynamic issues only, and not statically predictable hazards.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds