What about non-reclaimable performance losses?

Posted Mar 19, 2025 15:59 UTC (Wed) by intelfx (subscriber, #130118)
Parent article: Better CPU vulnerability mitigation configuration

I have to wonder how much performance has been lost in a non-runtime-selectable fashion: to mitigation mechanisms that have to be applied at compile time (such as retpolines), or defensive microcode patches (that probably do not offer any control over their effects), or — indirectly — to various hardened coding techniques that would have been unnecessary in a world without Spectre?

What about non-reclaimable performance losses?

Posted Mar 19, 2025 16:47 UTC (Wed) by fraetor (subscriber, #161147) [Link] (14 responses)

An alternative take is to wonder how much additional performance has been gained from the techniques behind all of these vulnerabilities. To make up some numbers, if we gained 30% from say, speculative execution, and then the mitigations cost us 10%, we are still up on performance.

What about non-reclaimable performance losses?

Posted Mar 19, 2025 17:19 UTC (Wed) by DemiMarie (subscriber, #164188) [Link] (8 responses)

Speculative execution is an absolutely massive win for single-threaded performance. I don’t know exact numbers offhand, but I’m almost certain it is well over a factor of 2.

The hardware could obviate the need for software mitigations by implementing speculative taint tracking, which ensures that speculative execution does not create any new side channels (other than power consumption) that would not otherwise be present. This costs about 15% or so.

The only way that getting rid of speculative execution could be remotely feasible is if the die area saved was used to vastly increase the core count and every program whose performance mattered was able to use all of the extra cores.

What about non-reclaimable performance losses?

Posted Mar 19, 2025 19:36 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (7 responses)

> The only way that getting rid of speculative execution could be remotely feasible is if the die area saved was used to vastly increase the core count and every program whose performance mattered was able to use all of the extra cores.

...which, for those of you playing at home, is basically impossible in practice. There are real-world algorithms that are "embarrassingly parallel," meaning that you can basically just throw more cores at them indefinitely, but they're fairly uncommon in practice. Most algorithms have some portion that can parallelize well, and some portion that has to be serial, and most problems do not admit a 100% parallel solution (you can't begin executing JavaScript until the page's DOM has been constructed, you can't construct the DOM until you finish parsing the HTML, etc., and even if we assume idealized perfect lazy evaluation, things can still block on each other when specific data is actually required). Worse, due to economies of scale, the (useful) algorithms that can benefit from massive parallelism are already offloaded to huge data centers and the like, so the problems that you're going to solve on one device at a time are mostly of the at-least-somewhat-serial variety (or else they are parallelizable, but too small to matter).

What about non-reclaimable performance losses?

Posted Mar 19, 2025 20:16 UTC (Wed) by excors (subscriber, #95769) [Link] (4 responses)

> you can't begin executing JavaScript until the page's DOM has been constructed, you can't construct the DOM until you finish parsing the HTML, etc

Not disagreeing with your general point, but this specifically is inaccurate: you *must* interleave JS execution, DOM construction and HTML parsing. Scripts must be executed as soon as the parser sees the "</script>", because the script can use document.write() to insert new characters immediately after the ">" - e.g.:

will end up producing an `h1` element. And the script can read, and modify, the DOM of everything that has been parsed before it (including any complete elements produced by document.write() during the script's execution).

That makes it even harder to implement with multithreading, because there's so much synchronisation between all the components. It would be nice if it was a sequential pipeline like you suggest, but it's far messier.

Despite that, browsers still do speculative parsing and speculative DOM construction in background threads - they keep parsing while downloading and executing scripts, so they can find more resources to preload and hopefully the scripts won't do anything funny that requires a rollback of the DOM changes. (https://hacks.mozilla.org/2017/09/building-the-dom-faster...)

What about non-reclaimable performance losses?

Posted Mar 20, 2025 12:32 UTC (Thu) by Baughn (subscriber, #124425) [Link] (3 responses)

That sounds utterly bonkers. Have there been no attempts to simplify the logic?

What about non-reclaimable performance losses?

Posted Mar 20, 2025 12:36 UTC (Thu) by intelfx (subscriber, #130118) [Link]

Backward compatibility (with the entirety of the Web, no less) says hi.

What about non-reclaimable performance losses?

Posted Mar 20, 2025 13:51 UTC (Thu) by excors (subscriber, #95769) [Link]

> That sounds utterly bonkers. Have there been no attempts to simplify the logic?

This is the simplified version. It used to be that every web browser had its own unique approach, based on some combination of reverse-engineering other browsers, reverse-engineering web pages that depended on the behaviour of other browsers, and just making it up as they went along. Sometimes their behaviour would depend on TCP packet boundaries. Sometimes they'd crash. None of it was documented.

Now there are standards that document it all in great detail, very carefully designed and tested to avoid breaking compatibility with billions of old web pages, and browsers have converged on those standards, so there's only one kind of bonkers behaviour instead of many.

If you're writing web pages you can avoid a lot of the complexity and performance issues by avoiding document.write(), using <script defer>, etc. But browsers can't avoid it, because the quickest way to lose users is to be incompatible with one web page that is important them. Browsers, and CPU manufacturers, also need to compete on performance while supporting these features that were designed a decade before the first dual-core desktop CPU, so it's really hard to avoid being bottlenecked by single-thread CPU performance.

Speculatively assuming sanity

Posted Mar 20, 2025 14:05 UTC (Thu) by farnz (subscriber, #17727) [Link]

Because of backwards compatibility (you can't be sure that no web page anywhere does something bonkers), you have to be able to fall back to the interleaved serial execution model at any time.

That's why the general technique browsers use to handle this is to speculatively assume that the bonkers thing doesn't actually happen, and start again but using the slow interleaved serial route if they observe the bonkers thing. This puts pressure on the wider ecosystem to allow you to run things in parallel, since while you will work with the bonkers thing (you have to!), performance is much better if you stick to sanity. And if the browser has good tools for making your sites perform better, those tools will clearly flag up that you've done something bonkers that forces the browser to abandon the fast path and restart on the slow path.

The net effect is that bonkers stuff still works (even if the original author is long gone), so you can still look at a monstrosity from 1997 in your current browser and have it work, but most sites will go towards sane over time because sane is faster.

Similar applies to CPUs in some senses, too - it is reasonable for a CPU to slow down if you do something that's technically allowed but difficult to implement in a modern design, but not reasonable to break backwards compatibility just because it's hard to implement in a high performance fashion. After all, if the code ran "fast enough" on an 80386 without cache at 16 MHz, then it'll run "fast enough" on a modern PC, too, even if it's forcing the CPU to behave like a 100 MHz CPU, not a 3 GHz CPU.

What about non-reclaimable performance losses?

Posted Mar 20, 2025 10:23 UTC (Thu) by paulj (subscriber, #341) [Link] (1 responses)

Not impossible in practice at all. Indeed, it is sufficiently practical that Sun Microsystems built a product line around such a CPU - SPARC T1 to T3. Sucked for single-thread performance, but they were fairly good at highly-parallel serving workloads, particularly on perf/W.

What about non-reclaimable performance losses?

Posted Mar 21, 2025 18:46 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

Sorry, I should have quoted even more selectively than I already did. I specifically meant that *this* is impossible:

> and every program whose performance mattered was able to use all of the extra cores.

What about non-reclaimable performance losses?

Posted Mar 19, 2025 17:34 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

Speculative execution provides something like 5-10x speed up for single-threading. It's utterly indispensable.

What about non-reclaimable performance losses?

Posted Mar 19, 2025 17:45 UTC (Wed) by fraetor (subscriber, #161147) [Link] (2 responses)

My knowledge about CPU performance is about 20 years out of date, so mostly revolves around optimising cache use. Do you know of any good resources for understanding modern CPU features?

What about non-reclaimable performance losses?

Posted Mar 19, 2025 18:00 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

This book is pretty great: https://products.easyperf.net/perf-book-2 (it's free, but you can pay for it)

Performance analysis book

Posted Mar 21, 2025 12:45 UTC (Fri) by amw (subscriber, #29081) [Link]

That looks like an excellent book recommendation, thank you!

The amazon link on that page didn't work for me though. Here's one that does: https://a.co/d/3r94Xq4

What about non-reclaimable performance losses?

Posted Mar 21, 2025 22:56 UTC (Fri) by anton (subscriber, #25547) [Link]

This chart shows the instructions per cycle of a number of microarchitectures on a number of benchmarks on a version of Gforth. The dashed lines are for in-order microarchitectures that do not speculate. The full lines are for out-of-order microarchitectures with speculative execution. Looking at the median, the best in-order microarchitecture has an IPC of about 0.9, while the best OoO microarchitecture has an IPC of almost 4. Plus, the best OoO microarchitecture is available at about 2.5 times the clock rate of the best in-order one, resulting in a total speed difference by a factor of 10.

An interesting pairing is Bonnell and Silvermont. Bonnell is the microarchitecture of the first Intel Atom and is in-order; Silvermont is its successor; it is also two-wide, but uses out-of-order execution. Silvermont has an advantage of about 1.5 in IPC on the median of these benchmarks, and another factor 1.5 in clock rate.

Concerning the cost of the mitigations, I have seen a factor of 2-9.5 slowdown from using retpolines on Gforth, and that mitigates only Spectre v2 (but Gforth is unusually heavy on indirect branches). I have read about slowdowns by a factor of more than 2 by various "speculative load hardening" mitigations, and that mitigates only against Spectre v1. The Linux kernel developers try to keep the costs in check by being smart about where to apply the mitigations, but that approach has huge development costs and they just have to err once on the wrong side of that edge, and the kernel can be attacked through these CPU vulnerabilities.

The better approach is to design hardware without these vulnerabilities. The "invisible speculation" approach looks most promising to me, because it costs little performance (the papers give numbers between a 20% slowdown and a small speedup, depending on the implementation variant). There are various research papers on that, and I have dabbled in the area with an overview paper.