Grand Schemozzle: Spectre continues to haunt

By Jonathan Corbet
August 8, 2019

The Spectre v1 hardware vulnerability is often characterized as allowing array bounds checks to be bypassed via speculative execution. While that is true, it is not the full extent of the shenanigans allowed by this particular class of vulnerabilities. For a demonstration of that fact, one need look no further than the "SWAPGS vulnerability" known as CVE-2019-1125 to the wider world or as "Grand Schemozzle" to the select group of developers who addressed it in the Linux kernel.

Segments are mostly an architectural relic from the earliest days of x86; to a great extent, they did not survive into the 64-bit era. That said, a few segments still exist for specific tasks; these include FS and GS. The most common use for GS in current Linux systems is for thread-local or CPU-local storage; in the kernel, the GS segment points into the per-CPU data area. User space is allowed to make its own use of GS; the arch_prctl() system call can be used to change its value.

As one might expect, the kernel needs to take care to use its own GS pointer rather than something that user space came up with. The x86 architecture obligingly provides an instruction, SWAPGS, to make that relatively easy. On entry into the kernel, a SWAPGS instruction will exchange the current GS segment pointer with a known value (which is kept in a model-specific register); executing SWAPGS again before returning to user space will restore the user-space value. Some carefully placed SWAPGS instructions will thus prevent the kernel from ever running with anything other than its own GS pointer. Or so one would think.

There is a slight catch, in that not every entry into kernel code originates from user space. Running SWAPGS if the system is already running in kernel mode will not lead to good things, so the actual code in the kernel in most cases is the assembly equivalent of:

    if (!in_kernel_mode())
    	SWAPGS

That, of course, is where things can go wrong. If that code is executed speculatively, the processor may make the wrong decision about whether to execute SWAPGS and run with the wrong GS segment pointer. This test can be incorrectly speculated either way. If the CPU is speculatively executing an entry from user space, it may decide to forego SWAPGS and run with the user-space GS value. If, instead, the system was already running in kernel mode, the CPU might again speculate incorrectly and execute SWAPGS when it shouldn't, causing it to run (speculatively) with a user-space GS value. Either way, subsequent per-CPU data references would be redirected speculatively to an address under user-space control; that enables data exfiltration by way of the usual side-channel techniques.

That looks like a wide-open channel into kernel data structures, but there are some limitations. Only Intel processors will execute SWAPGS speculatively, so the already-in-kernel-mode case is limited to those processors. When entering from user mode, though, the lack of a needed SWAPGS instruction can obviously be speculated on any processor.

The other roadblock for attackers is that, while arch_prctl() can be used by unprivileged code to set the GS pointer, it limits that pointer to user-space addresses. That does not entirely head off exploitation, but it makes it harder: an attacker must find kernel code that loads a value via GS, then uses that value as a pointer that is dereferenced in turn. As Josh Poimboeuf notes in the mitigation patch merged into the mainline:

It's difficult to audit for this gadget in all the handlers, so while there are no known instances of it, it's entirely possible that it exists somewhere (or could be introduced in the future). Without tooling to analyze all such code paths, consider it vulnerable.

The use of supervisor mode access prevention will block this attack — but only on processors that are not vulnerable to the Meltdown problem, so that is only so helpful.

It is also worth noting that there is a longstanding effort to add support for the FSBASE and GSBASE instructions, which allow direct (and uncontrolled) setting of GS from user space. There are a number of performance advantages to allowing this, so the pressure to merge the patches is likely to continue even though they would make exploiting the SWAPGS vulnerability easier.

The mitigation applied in the kernel is relatively straightforward: serializing (LFENCE) instructions are placed in the code paths that decide to (or not to) execute SWAPGS. This, of course, will slow execution down, which is why the pull request for these fixes described them as coming from "the performance deterioration department". On systems where these attacks are not a concern, the new barriers can be disabled (along with all other Spectre v1 defenses) with the nospectre_v1 command-line option.

The Spectre vulnerabilities were so-named because it was assumed that they would haunt us for a long time. The combination of speculative execution and side channels leads to a huge variety of possible attacks and an equally large difficulty in proving that no such attacks are possible in any given body of code. As a result, the pattern we see here — slowing down the system to defend against attacks that may or may not be practical — is likely to be with us for some time yet.

Index entries for this article
Kernel	Security/Meltdown and Spectre
Security	Hardware vulnerabilities
Security	Meltdown and Spectre

Grand Schemozzle: Spectre continues to haunt

Posted Aug 10, 2019 5:22 UTC (Sat) by jcm (subscriber, #18262) [Link] (4 responses)

More generally, the SWAPGS vulnerability is about realizing that Spectre v1 concerns conditional branches and that includes BOTH paths of the branch - it's not just limited to bounds checks in nature. No hardware yet deployed has general mitigations for v1 - it's all assumed to be handled by software. That does need to change, but it's going to be years.

Grand Schemozzle: Spectre continues to haunt

Posted Aug 13, 2019 20:56 UTC (Tue) by magnus (subscriber, #34778) [Link] (3 responses)

Question is if there is enough of a market demand to justify Intel and AMD investing to ever fix the issue in the hardware. If the main vulnerabilities for consumers (like a bad website extracting data via Javascript) can be mitigated by software efforts there is not the PC mass market, and business customers can always run their critical stuff on a dedicated server where there is no internal attacker to exploit the side channel. Maybe if it is seen as a national security threat they could get public money to do the development.

Grand Schemozzle: Spectre continues to haunt

Posted Aug 20, 2019 13:02 UTC (Tue) by anton (subscriber, #25547) [Link] (2 responses)

There is no reliable way to mitigate Spectre V1 and the like (e.g., the vulnerability discussed here), and the reliable way to mitigate Spectre V2 is basically to cause mispredictions on indirect branches (so they could do away with indirect branch prediction hardware and its performance benefits, if they wanted to make this the recommended way).

Sure, maybe Intel can market a CPU without proper hardware to deal with these problems even in the face of AMD having CPUs with hardware fixes for them, by paying enough wise guys on the net that claim that Joe Normal does not need to worry about these vulnerabilities, but I think that it's a better business strategy to just build hardware with proper fixes.

It's not that hard. It just takes time. Spectre was reported to Intel and AMD 2 years ago. A typical time frame for developing a CPU is reported to be five years. If a fix is a minor change, one can see it faster, but if it requires deep changes, we have to wait for most of that time.

Grand Schemozzle: Spectre continues to haunt

Posted Aug 21, 2019 0:44 UTC (Wed) by flussence (guest, #85566) [Link] (1 responses)

>Spectre was reported to Intel and AMD 2 years ago.
The logo-and-website version of it was, but the underlying principle behind the Spectre vulnerabilities has been known since the Pentium Pro and yet Intel chose to ignore it for 25 years: https://lobste.rs/s/b4gl4k/intel_80x86_processor_architec...

Grand Schemozzle: Spectre continues to haunt

Posted Aug 21, 2019 6:25 UTC (Wed) by anton (subscriber, #25547) [Link]

The paper "The Intel 80x86 Processor Architectures: Pitfalls for Secure Systems" does not cover any CPU with speculative execution, so no, this does not show that underlying principle behind Spectre vulnerabilities has been known for 25 years.

Timing side-channels have been known for even longer, but before Spectre one thought that one could protect, e.g., secret keys by writing all code that handles secret keys in a way that makes the timing data-invariant. The sensation of Spectre is that this mitigation is not enough, and that basically all code in the same security domain has the potential for becoming a gadget for extracting such data.

Grand Schemozzle: Spectre continues to haunt

Posted Aug 12, 2019 18:01 UTC (Mon) by naptastic (guest, #60139) [Link] (12 responses)

How much does speculative execution actually improve performance? Could that die space be put to better use?

Grand Schemozzle: Spectre continues to haunt

Posted Aug 12, 2019 20:53 UTC (Mon) by excors (subscriber, #95769) [Link] (8 responses)

I think speculative execution is a significant part of how CPUs benefit from large die space: it lets the CPU find enough runnable instructions to keep all the silicon busy. If you fill a core with superscalar execution units, huge caches, etc, but it can only run a dozen instructions before it has to wait a hundred cycles for the result of a data-dependent branch, most of those resources will be idle and won't improve performance. Without speculative execution, there's little reason to build a large CPU.

(You could use the silicon to make a massively multithreaded chip with non-speculative in-order execution, but then you've basically got a GPU, and we've already got GPUs yet programmers keep writing single-threaded code and wanting it to go fast.)

Grand Schemozzle: Spectre continues to haunt

Posted Aug 13, 2019 20:49 UTC (Tue) by magnus (subscriber, #34778) [Link] (7 responses)

It's not speculation by itself that's the problem though, it's that the speculated instructions can evict cache lines (and maybe TLB entries too?). You could keep the speculation if the cache state could be made to roll back just like the architectural state when the speculation fails.

Grand Schemozzle: Spectre continues to haunt

Posted Aug 13, 2019 22:04 UTC (Tue) by farnz (subscriber, #17727) [Link] (4 responses)

The trouble is that changing microarchitectural state (like caches and TLB entries) is part of the point of speculative execution.

For a modern high performance CPU, registers are zero-latency (data available as soon as it's asked for.). L1 cache hit is 3 to 6 CPU cycles latency delay, L2 cache hit is 10 to 20 CPU cycles, L3 cache hit is around 40 CPU cycles delay. Memory itself is even further away - at 2 GHz, a RAM access is on the order of 100 to 200 CPU cycles. Arguably, virtually all the gains from speculation come in because of the changes it causes to the caches; if you have to spend hundreds of CPU cycles undoing any failed speculation, you're going to lose a lot of performance on each failures.

Worse, I suspect that a reasonable number of speculation failures gain during recovery from the modified µarch state - the right data is in cache, the right TLB entries are hot etc, so that what would have been a 100 cycle delay re-reading from main memory becomes a 20 cycle delay recovering from L2 cache instead of L1 cache.

Grand Schemozzle: Spectre continues to haunt

Posted Aug 13, 2019 22:34 UTC (Tue) by magnus (subscriber, #34778) [Link] (1 responses)

For sure you don't want to go all the way back to external RAM to restore the caches and TLB when rolling back the speculation, you would have to have some form of local buffer holding the entries that were evicted to quickly restore them as part of the roll back. Possibly an undo buffer integrated directly into the L1 cache RAM in some clever way. I'm not saying this is easy to implement efficiently, if it was then it would have been fixed already.

As for your last line that you may benefit from cache updates even when mis-speculating - that's an interesting point and would be interesting to have real-world measurements on this effect. I don't see how you could allow this without also having the side channel exploitable.

Grand Schemozzle: Spectre continues to haunt

Posted Aug 20, 2019 13:45 UTC (Tue) by anton (subscriber, #25547) [Link]

The way to go is not to restore evicted cache lines, but to keep the speculatively fetched cache lines in separate buffers and only put them in the cache (and evict other lines) when the load is commited (at that point it is no longer speculative, because the speculation has been confirmed).

The same has to be done for all other state that can be seen from the outside, or, if that is impractical (e.g., open pages for main memory accesses), wait until the instruction is no longer speculative. The latter costs a bit of performance, but given that a branch prediction can usually be confirmed in tens of cycles or less, while a main memory access takes hundreds, it should not be a big issue.

As for the benefit of reusing the speculatively fetched stuff in the correct path, branch mispredictions are rare so I don't expect that to have much of an effect. Still, maybe by having a per-thread buffer for such cache lines (or extending the buffer I mentioned above with such capabilities), one could preserve that benefit, but I would be very wary of potential side-channel attacks with such an approach.

Grand Schemozzle: Spectre continues to haunt

Posted Aug 14, 2019 9:50 UTC (Wed) by james (subscriber, #1325) [Link] (1 responses)

I have this horrible feeling that even if you did spend hundreds of cycles undoing failed speculation, that could be a side-channel in itself.

For example,

Time how long it takes on this system to get data from RAM.
Read an accurate clock.
Force a misprediction. The CPU needs data from RAM to resolve this.
Force another misprediction on a bounds check.
Carry out a compare based on data in memory you shouldn't have access to.
If the compare is true, read in lots of data from level 3 cache. That should have evicted data in level 1 and level 2 cache by the time the misprediction at stage 3 is resolved.
Eventually, the misprediction at stage 3 will be resolved, and everything will be rolled back -- including the evictions at stage 6, if they happened.
Read the clock again.
If the whole thing has taken little longer than a single read from RAM, the comparison at stage 5 was false. If it took as long as a read from RAM plus a number of reads from level 3 cache, the comparison was true.

And that's without having another friendly thread on another core watching what happened to the level 3 cache, or a network card that's reading this data...

Grand Schemozzle: Spectre continues to haunt

Posted Aug 14, 2019 10:34 UTC (Wed) by farnz (subscriber, #17727) [Link]

Indeed. And on top of that, most speculation side-channels don't matter - it only matters when a side channel can be used to read across a security boundary. So, for example, if a side channel lets JavaScript in my web browser read the DOM of the page it's part of, that's not an issue - the JavaScript has a direct route to getting that data, so a side channel that lets it get the data slowly is not a problem. The issue kicks in when the side channel lets you cross a software security boundary - e.g. by reading the DOM of the active tab, regardless of whether you're part of that page.

So, if you roll back all speculation, you're wasting effort most of the time; it's only when a security boundary is crossed while speculating that we need to worry. This isn't, however, just about userspace to/from kernelspace crossings; it's also about userspace to/from userspace crossings in VMs and anything else that handles untrusted data.

Grand Schemozzle: Spectre continues to haunt

Posted Aug 13, 2019 22:39 UTC (Tue) by excors (subscriber, #95769) [Link] (1 responses)

I think that's too narrow a view of Spectre - cache timings are the most easily measurable side effect of speculated instructions, so that's what the original Spectre attacks used, but there are many other side channels that could be used to smuggle secret data from speculated code (which executes without the normal security guarantees) into non-speculative code.

There's loads of internal caches and memories (including all the complexity that supports OoO), memory bus bandwidth, execution unit utilisation, chip temperature, etc, which will be influenced by speculative instructions. It seems infeasible to hide all that from non-speculative code that is running concurrently on the same hardware. It might be much harder to turn those into practical attacks than the cache side channel, but attackers will have many years to figure it out, and it would be nice if we had a comprehensive solution instead of just plugging each side channel as it gets actively exploited.

Grand Schemozzle: Spectre continues to haunt

Posted Aug 20, 2019 14:02 UTC (Tue) by anton (subscriber, #25547) [Link]

I think the state-based speculative side channels (e.g., caches, TLBs, AVX state, open pages) can be effectively fixed at the hardware level (by not propagating speculative state to structures shared with other code until it is no longer speculative).

Side channels through resource contention can be fixed by either not having untrusted code on the same core (i.e., restrictions on using SMT/Hyperthreading; it does not buy much in my experience anyway), or (for resources shared by multiple cores, e.g., the memory controllers) by not using the shared resource speculatively.

Concerning Power/Temperature, the hardware could be designed to consume the same power no matter what you do, but we probably do not want to go there (the CPUs would run quite a lot hotter and slower with such a fix), at least until it has been shown that such exploits are really practical.

Grand Schemozzle: Spectre continues to haunt

Posted Aug 13, 2019 7:21 UTC (Tue) by k8to (guest, #15413) [Link] (1 responses)

As excors says, this is largely a product of the huge difference in speed between cpus and ram, and to a lesser extent cpus and cache.

Since stalling on memory fetches reduces your execution speed by a factor of over a million, it's worth the tremendous engineering effort we've made to try to find ways to optimize memory reads, including speculative execution. The only real alternative would be to write software that explicitly was designed around its memory fetches, such as laying out instructions in data in a very explicitly linear way or something along those lines. The problems with that are general purpose software doesn't tend to have logic like that, and none of our general software toolchains are built to do that. I'm sure there are research projects and specialized silicon designed for that sort of approach though.

Grand Schemozzle: Spectre continues to haunt

Posted Aug 13, 2019 10:26 UTC (Tue) by farnz (subscriber, #17727) [Link]

FWIW, Itanium's design assumed two things would be true in the future, and would not have suffered from Spectre problems because it didn't have hardware speculation:

Memory throughput and cache size increased far in excess of latency; this would have ensured that the large instructions (128 bits for a bundle of 3 instructions) were not a performance penalty, as the I$ would have held a large number of bundles ready to execute, and the limiting factors would have been instruction group size, and software speculation instructions.
Hardware speculation introduced more problems than it solved, and software driven speculation was the way forward - thus Itanium's explicit speculation and advanced loads, combined with hardware support for both handling loops as prologue, kernel and epilogue and register rotation to indicate parallelism in tight loops.

As reality turned out, though, memory throughput didn't increase massively compared to latency (DDR beat Rambus), and hardware speculation turned out to be a far better use of silicon space than bigger caches. Not only that, but hardware speculation actually worked well enough that even the minimal software speculation AMD64 supports (the PREFETCHx family of instructions) was largely useless - most of the time, hardware prefetching and speculation beats manual prefetching.

Grand Schemozzle: Spectre continues to haunt

Posted Aug 20, 2019 13:25 UTC (Tue) by anton (subscriber, #25547) [Link]

How much does speculative execution actually improve performance?

From our LaTeX benchmark:

time system
2.368 Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Debian 9
0.602 Core i3-3227U 1900MHz, Lenovo Thinkpad e130, Ubuntu 13.10 64b

The Atom 330 is a low-power in-order CPU, the Core i3-3227U a low-power out-of-Order execution (OoO) CPU; the latter is a little younger, but I don't expect in-order performance would have improved much (there is a reason why Intel switched to OoO for the successor of the Atom 330). My guess is that an OoO CPU without speculation would not perform much better than the in-order CPU, because it would have to wait for branch resolution every few instructions. So my answer to your question is: A factor >3. Another in-order vs. OoO pairing is:

2.488 Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04
1.224 Odroid N2 (1800MHz Cortex A73) Ubuntu 18.04

A53 is in-order, A73 is OoO. Note that the A75 and A76 are OoO cores with significantly higher performance than the A73, so I expect a factor >3 between A76 and A53.