Grand Schemozzle: Spectre continues to haunt
Segments are mostly an architectural relic from the earliest days of x86; to a great extent, they did not survive into the 64-bit era. That said, a few segments still exist for specific tasks; these include FS and GS. The most common use for GS in current Linux systems is for thread-local or CPU-local storage; in the kernel, the GS segment points into the per-CPU data area. User space is allowed to make its own use of GS; the arch_prctl() system call can be used to change its value.
As one might expect, the kernel needs to take care to use its own GS pointer rather than something that user space came up with. The x86 architecture obligingly provides an instruction, SWAPGS, to make that relatively easy. On entry into the kernel, a SWAPGS instruction will exchange the current GS segment pointer with a known value (which is kept in a model-specific register); executing SWAPGS again before returning to user space will restore the user-space value. Some carefully placed SWAPGS instructions will thus prevent the kernel from ever running with anything other than its own GS pointer. Or so one would think.
There is a slight catch, in that not every entry into kernel code originates from user space. Running SWAPGS if the system is already running in kernel mode will not lead to good things, so the actual code in the kernel in most cases is the assembly equivalent of:
if (!in_kernel_mode()) SWAPGS
That, of course, is where things can go wrong. If that code is executed speculatively, the processor may make the wrong decision about whether to execute SWAPGS and run with the wrong GS segment pointer. This test can be incorrectly speculated either way. If the CPU is speculatively executing an entry from user space, it may decide to forego SWAPGS and run with the user-space GS value. If, instead, the system was already running in kernel mode, the CPU might again speculate incorrectly and execute SWAPGS when it shouldn't, causing it to run (speculatively) with a user-space GS value. Either way, subsequent per-CPU data references would be redirected speculatively to an address under user-space control; that enables data exfiltration by way of the usual side-channel techniques.
That looks like a wide-open channel into kernel data structures, but there are some limitations. Only Intel processors will execute SWAPGS speculatively, so the already-in-kernel-mode case is limited to those processors. When entering from user mode, though, the lack of a needed SWAPGS instruction can obviously be speculated on any processor.
The other roadblock for attackers is that, while arch_prctl() can be used by unprivileged code to set the GS pointer, it limits that pointer to user-space addresses. That does not entirely head off exploitation, but it makes it harder: an attacker must find kernel code that loads a value via GS, then uses that value as a pointer that is dereferenced in turn. As Josh Poimboeuf notes in the mitigation patch merged into the mainline:
The use of supervisor mode access prevention will block this attack — but only on processors that are not vulnerable to the Meltdown problem, so that is only so helpful.
It is also worth noting that there is a longstanding effort to add support for the FSBASE and GSBASE instructions, which allow direct (and uncontrolled) setting of GS from user space. There are a number of performance advantages to allowing this, so the pressure to merge the patches is likely to continue even though they would make exploiting the SWAPGS vulnerability easier.
The mitigation applied in the kernel is relatively straightforward:
serializing (LFENCE) instructions are placed in the code paths
that decide to (or not to) execute SWAPGS. This, of course, will
slow execution down, which is why the pull request for
these fixes described them as coming from "the performance
deterioration department
". On systems where these attacks are not a
concern, the new barriers can be disabled (along with all other Spectre v1
defenses) with the nospectre_v1 command-line option.
The Spectre vulnerabilities were so-named because it was assumed that they
would haunt us for a long time. The combination of speculative execution
and side channels leads to a huge variety of possible attacks and an
equally large difficulty in proving that no such attacks are possible in
any given body of code. As a result, the pattern we see here — slowing down
the system to defend against attacks that may or may not be practical — is
likely to be with us for some time yet.
Index entries for this article | |
---|---|
Kernel | Security/Meltdown and Spectre |
Security | Hardware vulnerabilities |
Security | Meltdown and Spectre |
Posted Aug 10, 2019 5:22 UTC (Sat)
by jcm (subscriber, #18262)
[Link] (4 responses)
Posted Aug 13, 2019 20:56 UTC (Tue)
by magnus (subscriber, #34778)
[Link] (3 responses)
Posted Aug 20, 2019 13:02 UTC (Tue)
by anton (subscriber, #25547)
[Link] (2 responses)
Sure, maybe Intel can market a CPU without proper hardware to deal with these problems even in the face of AMD having CPUs with hardware fixes for them, by paying enough wise guys on the net that claim that Joe Normal does not need to worry about these vulnerabilities, but I think that it's a better business strategy to just build hardware with proper fixes.
It's not that hard. It just takes time. Spectre was reported to Intel and AMD 2 years ago. A typical time frame for developing a CPU is reported to be five years. If a fix is a minor change, one can see it faster, but if it requires deep changes, we have to wait for most of that time.
Posted Aug 21, 2019 0:44 UTC (Wed)
by flussence (guest, #85566)
[Link] (1 responses)
Posted Aug 21, 2019 6:25 UTC (Wed)
by anton (subscriber, #25547)
[Link]
Timing side-channels have been known for even longer, but before Spectre one thought that one could protect, e.g., secret keys by writing all code that handles secret keys in a way that makes the timing data-invariant. The sensation of Spectre is that this mitigation is not enough, and that basically all code in the same security domain has the potential for becoming a gadget for extracting such data.
Posted Aug 12, 2019 18:01 UTC (Mon)
by naptastic (guest, #60139)
[Link] (12 responses)
Posted Aug 12, 2019 20:53 UTC (Mon)
by excors (subscriber, #95769)
[Link] (8 responses)
(You could use the silicon to make a massively multithreaded chip with non-speculative in-order execution, but then you've basically got a GPU, and we've already got GPUs yet programmers keep writing single-threaded code and wanting it to go fast.)
Posted Aug 13, 2019 20:49 UTC (Tue)
by magnus (subscriber, #34778)
[Link] (7 responses)
Posted Aug 13, 2019 22:04 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (4 responses)
The trouble is that changing microarchitectural state (like caches and TLB entries) is part of the point of speculative execution.
For a modern high performance CPU, registers are zero-latency (data available as soon as it's asked for.). L1 cache hit is 3 to 6 CPU cycles latency delay, L2 cache hit is 10 to 20 CPU cycles, L3 cache hit is around 40 CPU cycles delay. Memory itself is even further away - at 2 GHz, a RAM access is on the order of 100 to 200 CPU cycles. Arguably, virtually all the gains from speculation come in because of the changes it causes to the caches; if you have to spend hundreds of CPU cycles undoing any failed speculation, you're going to lose a lot of performance on each failures.
Worse, I suspect that a reasonable number of speculation failures gain during recovery from the modified µarch state - the right data is in cache, the right TLB entries are hot etc, so that what would have been a 100 cycle delay re-reading from main memory becomes a 20 cycle delay recovering from L2 cache instead of L1 cache.
Posted Aug 13, 2019 22:34 UTC (Tue)
by magnus (subscriber, #34778)
[Link] (1 responses)
As for your last line that you may benefit from cache updates even when mis-speculating - that's an interesting point and would be interesting to have real-world measurements on this effect. I don't see how you could allow this without also having the side channel exploitable.
Posted Aug 20, 2019 13:45 UTC (Tue)
by anton (subscriber, #25547)
[Link]
The same has to be done for all other state that can be seen from the outside, or, if that is impractical (e.g., open pages for main memory accesses), wait until the instruction is no longer speculative. The latter costs a bit of performance, but given that a branch prediction can usually be confirmed in tens of cycles or less, while a main memory access takes hundreds, it should not be a big issue.
As for the benefit of reusing the speculatively fetched stuff in the correct path, branch mispredictions are rare so I don't expect that to have much of an effect. Still, maybe by having a per-thread buffer for such cache lines (or extending the buffer I mentioned above with such capabilities), one could preserve that benefit, but I would be very wary of potential side-channel attacks with such an approach.
Posted Aug 14, 2019 9:50 UTC (Wed)
by james (subscriber, #1325)
[Link] (1 responses)
For example,
And that's without having another friendly thread on another core watching what happened to the level 3 cache, or a network card that's reading this data...
Posted Aug 14, 2019 10:34 UTC (Wed)
by farnz (subscriber, #17727)
[Link]
Indeed. And on top of that, most speculation side-channels don't matter - it only matters when a side channel can be used to read across a security boundary. So, for example, if a side channel lets JavaScript in my web browser read the DOM of the page it's part of, that's not an issue - the JavaScript has a direct route to getting that data, so a side channel that lets it get the data slowly is not a problem. The issue kicks in when the side channel lets you cross a software security boundary - e.g. by reading the DOM of the active tab, regardless of whether you're part of that page.
So, if you roll back all speculation, you're wasting effort most of the time; it's only when a security boundary is crossed while speculating that we need to worry. This isn't, however, just about userspace to/from kernelspace crossings; it's also about userspace to/from userspace crossings in VMs and anything else that handles untrusted data.
Posted Aug 13, 2019 22:39 UTC (Tue)
by excors (subscriber, #95769)
[Link] (1 responses)
There's loads of internal caches and memories (including all the complexity that supports OoO), memory bus bandwidth, execution unit utilisation, chip temperature, etc, which will be influenced by speculative instructions. It seems infeasible to hide all that from non-speculative code that is running concurrently on the same hardware. It might be much harder to turn those into practical attacks than the cache side channel, but attackers will have many years to figure it out, and it would be nice if we had a comprehensive solution instead of just plugging each side channel as it gets actively exploited.
Posted Aug 20, 2019 14:02 UTC (Tue)
by anton (subscriber, #25547)
[Link]
Side channels through resource contention can be fixed by either not having untrusted code on the same core (i.e., restrictions on using SMT/Hyperthreading; it does not buy much in my experience anyway), or (for resources shared by multiple cores, e.g., the memory controllers) by not using the shared resource speculatively.
Concerning Power/Temperature, the hardware could be designed to consume the same power no matter what you do, but we probably do not want to go there (the CPUs would run quite a lot hotter and slower with such a fix), at least until it has been shown that such exploits are really practical.
Posted Aug 13, 2019 7:21 UTC (Tue)
by k8to (guest, #15413)
[Link] (1 responses)
Since stalling on memory fetches reduces your execution speed by a factor of over a million, it's worth the tremendous engineering effort we've made to try to find ways to optimize memory reads, including speculative execution. The only real alternative would be to write software that explicitly was designed around its memory fetches, such as laying out instructions in data in a very explicitly linear way or something along those lines. The problems with that are general purpose software doesn't tend to have logic like that, and none of our general software toolchains are built to do that. I'm sure there are research projects and specialized silicon designed for that sort of approach though.
Posted Aug 13, 2019 10:26 UTC (Tue)
by farnz (subscriber, #17727)
[Link]
FWIW, Itanium's design assumed two things would be true in the future, and would not have suffered from Spectre problems because it didn't have hardware speculation:
As reality turned out, though, memory throughput didn't increase massively compared to latency (DDR beat Rambus), and hardware speculation turned out to be a far better use of silicon space than bigger caches. Not only that, but hardware speculation actually worked well enough that even the minimal software speculation AMD64 supports (the PREFETCHx family of instructions) was largely useless - most of the time, hardware prefetching and speculation beats manual prefetching.
Posted Aug 20, 2019 13:25 UTC (Tue)
by anton (subscriber, #25547)
[Link]
Grand Schemozzle: Spectre continues to haunt
Grand Schemozzle: Spectre continues to haunt
There is no reliable way to mitigate Spectre V1 and the like (e.g., the vulnerability discussed here), and the reliable way to mitigate Spectre V2 is basically to cause mispredictions on indirect branches (so they could do away with indirect branch prediction hardware and its performance benefits, if they wanted to make this the recommended way).
Grand Schemozzle: Spectre continues to haunt
Grand Schemozzle: Spectre continues to haunt
The logo-and-website version of it was, but the underlying principle behind the Spectre vulnerabilities has been known since the Pentium Pro and yet Intel chose to ignore it for 25 years: https://lobste.rs/s/b4gl4k/intel_80x86_processor_architec...
The paper "The Intel 80x86 Processor Architectures: Pitfalls for Secure Systems" does not cover any CPU with speculative execution, so no, this does not show that underlying principle behind Spectre vulnerabilities has been known for 25 years.
Grand Schemozzle: Spectre continues to haunt
Grand Schemozzle: Spectre continues to haunt
Grand Schemozzle: Spectre continues to haunt
Grand Schemozzle: Spectre continues to haunt
Grand Schemozzle: Spectre continues to haunt
Grand Schemozzle: Spectre continues to haunt
The way to go is not to restore evicted cache lines, but to keep the speculatively fetched cache lines in separate buffers and only put them in the cache (and evict other lines) when the load is commited (at that point it is no longer speculative, because the speculation has been confirmed).
Grand Schemozzle: Spectre continues to haunt
I have this horrible feeling that even if you did spend hundreds of cycles undoing failed speculation, that could be a side-channel in itself.
Grand Schemozzle: Spectre continues to haunt
Grand Schemozzle: Spectre continues to haunt
Grand Schemozzle: Spectre continues to haunt
I think the state-based speculative side channels (e.g., caches, TLBs, AVX state, open pages) can be effectively fixed at the hardware level (by not propagating speculative state to structures shared with other code until it is no longer speculative).
Grand Schemozzle: Spectre continues to haunt
Grand Schemozzle: Spectre continues to haunt
Grand Schemozzle: Spectre continues to haunt
Grand Schemozzle: Spectre continues to haunt
How much does speculative execution actually improve performance?
From our LaTeX benchmark:
time system
2.368 Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Debian 9
0.602 Core i3-3227U 1900MHz, Lenovo Thinkpad e130, Ubuntu 13.10 64b
The Atom 330 is a low-power in-order CPU, the Core i3-3227U a low-power out-of-Order execution (OoO) CPU; the latter is a little younger, but I don't expect in-order performance would have improved much (there is a reason why Intel switched to OoO for the successor of the Atom 330). My guess is that an OoO CPU without speculation would not perform much better than the in-order CPU, because it would have to wait for branch resolution every few instructions. So my answer to your question is: A factor >3.
Another in-order vs. OoO pairing is:
2.488 Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04
1.224 Odroid N2 (1800MHz Cortex A73) Ubuntu 18.04
A53 is in-order, A73 is OoO. Note that the A75 and A76 are OoO cores with significantly higher performance than the A73, so I expect a factor >3 between A76 and A53.