The "Retbleed" speculative execution vulnerabilities
The "Retbleed" speculative execution vulnerabilities
Posted Jul 19, 2022 18:49 UTC (Tue) by atnot (guest, #124910)In reply to: The "Retbleed" speculative execution vulnerabilities by anton
Parent article: The "Retbleed" speculative execution vulnerabilities
It works the other way around too. Here goes: "Software people demand that hardware people make them faster and faster processors without changing the way it is programmed, that they should just go and apply more architectural optimizations. The whole reason all of this is done, why x86 is at this point a declarative language for describing how to distribute parallel compute tasks over hardware resources using an abstract control flow graph and dataflow labels (aka "registers") is to keep the fiction that computers haven't changed since the 80s alive to programmers.
Hardware people offered solutions long ago: Itanium had explicit speculation instructions. They were perfectly aware of the troubles the current direction would bring. But software people rejected it in favor of a 64bit pdp11 because it didn't look familiar enough to them and their PDP11 languages, then made fun of them. To the point that nobody has dared publish serious research on novel CPU architectures since around 2010."
This might not be completely fair, but it's not wrong either. Software developers are at least as culpable for the current situation as hardware vendors are. There's barriers yes, but they need to be broken in both directions.
Posted Jul 19, 2022 20:33 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
No. People rejected Itanic because it just Did Not Work. It's not possible to statically predict how the program will run, because a lot of timings are dependent on inputs. Even when caches are not in play, a good old divide instruction can take anywhere from 1 to 10 cycles to complete.
Posted Jul 19, 2022 20:52 UTC (Tue)
by deater (subscriber, #11746)
[Link] (3 responses)
You should look into the later itanium, such as Poulson, which had speculation and out-of-order in an attempt to catch up.
Also look into modern x86 systems where many of the divide instructions have a constant latency.
Posted Jul 19, 2022 22:55 UTC (Tue)
by atnot (guest, #124910)
[Link] (2 responses)
Absolutely, targeting C-like languages at VLIW is very difficult and requires advanced scheduling which was not really available at the time. This was a huge factor. It fared much better with GPUs which were targeted with more easily parallelizable languages. Even those would move away eventually though, coincidentally around the time CUDA and GPGPU came about.
Itanium was definitely far from perfect. The initial implementation was terrible and the decision to encode many implementation details of the first CPUs directly into the ISA was a mistake they quickly recognized. But so was x86, we've just gotten used to it. Certainly, today's 12-wide CPUs would have a lot easier time emulating a mediocre 2000s explicitly parallel VLIW CPU than a mediocre 80s microprocessor. Even with it's flaws, Itanium is still significantly less far off from what a modern CPU actually looks like.
Posted Jul 20, 2022 10:24 UTC (Wed)
by farnz (subscriber, #17727)
[Link]
The other issue with advanced scheduling is that an out-of-order execution design also benefits from a well-scheduled program. An out-of-order processor has a limited instruction window within which it can reschedule dynamically, and a well-scheduled program is set up so that all the rescheduling that can be done in that window is a consequence of the data the program is processing.
GPUs are a different case because they're designed for the world where single threaded performance is not particularly interesting - as long as all threads complete their work in a millisecond or so, we don't care how long each individual thread took. It's thus possible to avoid OoOE in favour of having more threads available to hardware, and better hardware for switching between threads when one thread gets blocked. In contrast, the whole point of CPUs in a modern system (with GPUs as well as CPUs) is to deal with the code where the time for one thread to complete its work sets the time for the whole operation.
I suspect that, for the subset of compute where the performance of a single thread is the most important factor, an out-of-order CPU is the best possible option. The wide-open question is whether we can design an ISA that allows us to avoid unwanted speculation completely; Itanium had that, because it was designed around making all the possible parallelism explicit, but Itanium wasn't a good ISA for out-of-order execution, and had low instruction density.
The other issue that Itanium's explicit speculation didn't account for is that we're starting to see uses of value prediction, not just memory access prediction; do we want to be explicit about all the possible speculative paths (e.g. "you can speculate that the value in r2 is less than the value in r3", or "you can speculate if you believe that r2 is between -16 and +96"), or do we instead want to find a good way to block speculation completely where it's potentially dangerous?
Posted Jul 20, 2022 18:56 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Targeting ANY languages with VLIW is difficult. The fundamental issue is that scheduling depends on input data, and no language can change that.
> Even those would move away eventually though, coincidentally around the time CUDA and GPGPU came about.
Yup. It's just not efficient to use VLIW for anything, even when OOO is not needed.
Posted Jul 20, 2022 6:52 UTC (Wed)
by anton (subscriber, #25547)
[Link]
Concerning IA-64 (later renamed into Itanium processor family), that's probably not what paulj meant, because Itanium is not simple, and switching between parallel running execution contexts was not envisioned for it in the 1990s (although it was implemented around 2010 or so). Poulson is not OoO AFAIK.
As for implementing ("emulating") IA-64 with the techniques for today's OoO hardware (the widest of which is 8-wide AFAIK), I doubt that that would be easier than implementing AMD64, Aarch64, or RISC-V; I don't see anything in IA-64 that helps the implementation significantly, and one would have to implement all the special features like the ALAT that are better handled by the dynamic alias predictor in modern CPUs; likewise, one would have to implement the compiler speculation support (based on imprecise static branch prediction) and (for performance) still have to implement the much better dynamic branch prediction and hardware speculation.
Single-issue in-order (what you call a PDP11) turns out to be a very good software-hardware interface (i.e., an architectural principle), even if the implementation is pretty far from what a straightforward implementation looks like.
The "Retbleed" speculative execution vulnerabilities
The "Retbleed" speculative execution vulnerabilities
The "Retbleed" speculative execution vulnerabilities
The "Retbleed" speculative execution vulnerabilities
The "Retbleed" speculative execution vulnerabilities
The "Retbleed" speculative execution vulnerabilities
Software people demand that hardware people make them faster and faster processors without changing the way it is programmed
Not sure about "demand", but that's the way it works in those areas affected by the software crisis (with the classic criterion being that software costs more than hardware), i.e., pretty much everything but supercomputers and tiny embedded controllers. It would be just too costly to rewrite all software for something like the TILE or (more extreme) Greenarrays hardware, especially in a way that performs at least as well as on mainstream hardware plus Spectre fixes.
