What about non-reclaimable performance losses?

Posted Mar 21, 2025 22:56 UTC (Fri) by anton (subscriber, #25547)
In reply to: What about non-reclaimable performance losses? by fraetor
Parent article: Better CPU vulnerability mitigation configuration

This chart shows the instructions per cycle of a number of microarchitectures on a number of benchmarks on a version of Gforth. The dashed lines are for in-order microarchitectures that do not speculate. The full lines are for out-of-order microarchitectures with speculative execution. Looking at the median, the best in-order microarchitecture has an IPC of about 0.9, while the best OoO microarchitecture has an IPC of almost 4. Plus, the best OoO microarchitecture is available at about 2.5 times the clock rate of the best in-order one, resulting in a total speed difference by a factor of 10.

An interesting pairing is Bonnell and Silvermont. Bonnell is the microarchitecture of the first Intel Atom and is in-order; Silvermont is its successor; it is also two-wide, but uses out-of-order execution. Silvermont has an advantage of about 1.5 in IPC on the median of these benchmarks, and another factor 1.5 in clock rate.

Concerning the cost of the mitigations, I have seen a factor of 2-9.5 slowdown from using retpolines on Gforth, and that mitigates only Spectre v2 (but Gforth is unusually heavy on indirect branches). I have read about slowdowns by a factor of more than 2 by various "speculative load hardening" mitigations, and that mitigates only against Spectre v1. The Linux kernel developers try to keep the costs in check by being smart about where to apply the mitigations, but that approach has huge development costs and they just have to err once on the wrong side of that edge, and the kernel can be attacked through these CPU vulnerabilities.

The better approach is to design hardware without these vulnerabilities. The "invisible speculation" approach looks most promising to me, because it costs little performance (the papers give numbers between a 20% slowdown and a small speedup, depending on the implementation variant). There are various research papers on that, and I have dabbled in the area with an overview paper.