Stuffing the return stack buffer

By Jonathan Corbet
July 22, 2022

"Retbleed" is the name given to a class of speculative-execution vulnerabilities involving return instructions. Mitigations for Retbleed have found their way into the mainline kernel but, as of this writing, some remaining problems have kept them from the stable update releases. Mitigating Retbleed can impede performance severely, especially on some Intel processors. Thomas Gleixner and Peter Zijlstra think they have found a better way that bypasses the existing mitigations and misleads the processor's speculative-execution mechanisms instead.

If a CPU is to speculate past a return instruction, it must have some idea of where the code will return to. In recent Intel processors, there is a special hidden data structure called the "return stack buffer" (RSB) that caches return addresses for speculation. The RSB can hold 16 entries, so it must drop the oldest entries if a call chain goes deeper than that. As that deep call chain returns, the RSB can underflow. One might think that speculation would just stop at that point but, instead, the CPU resorts to other heuristics, including predicting from the branch history buffer. Alas, techniques for mistraining the branch history buffer are well understood at this point.

As a result, long call chains in the kernel are susceptible to speculative-execution attacks. On Intel processors starting with the Skylake generation, the only way to prevent such attacks is to turn on the indirect branch restricted speculation (IBRS) CPU "feature", which was added by Intel early in the Spectre era. IBRS works, but it has the unwelcome side effect of reducing performance by as much as 30%. For some reason, users lack enthusiasm for this solution.

Another way

Gleixner and Zijlstra decided to try a different approach. Speculative execution of return calls on these processors can only be abused if the RSB underflows. So, if RSB underflow can be prevented, this particular problem will go away. And that, it seems, can be achieved by "stuffing" the RSB whenever it is at risk of running out of entries.

That immediately leads to two new challenges: knowing when the RSB is running low, and finding a way to fill it back up. The first piece is handled by tracking the current call-chain depth — in an approximate way. The build system is modified to create a couple of new sections in the executable kernel image to hold entry and exit thunks for kernel functions and to track them. When RSB stuffing is enabled, the entry thunk will be invoked on entry to each function, and the exit thunk will be run on the way out.

The state of the RSB is tracked with a per-CPU, 64-bit value that is originally set to:

    0x8000 0000 0000 0000

The function entry thunk "increments" this counter by right-shifting it by five bits. The processor will sign-extend the value, so the counter will, after the first call, look like:

    0xfc00 0000 0000 0000

If twelve more calls happen in succession, the sign bit will have been extended all the way to the right and the counter will contain all ones, with bits beginning to fall off the right end; this counter thus cannot reliably count above twelve. In this way it mimics the RSB, which cannot hold more than 16 entries, with a safety margin of four calls; the use of shifts achieves that behavior without the need to introduce a branch. Whenever a return thunk is executed, the opposite happens: the counter is left-shifted by five bits. After twelve returns, the next shift will clear the remaining bits, and the counter will have a value of zero, which is the indication that something must be done to prevent the RSB from underflowing.

That "something" is a quick series of function calls (coded in assembly and found at the end of this patch) that adds 16 entries to the call stack, and thus to the RSB as well. Each of those calls, if ever returned from, will immediately execute an int3 instruction; that will stop speculation if those return calls are ever executed speculatively. The actual kernel does not want to execute those instructions (or all of those returns), of course, so the RSB-stuffing code increments the real stack pointer past the just-added call frames.

The end result is an RSB that no longer matches the actual call stack, but which is full of entries that will do no harm if speculated into. At this point, the call-depth counter can be set to -1 (all ones in the two's complement representation) to reflect the fact that the RSB is full. The kernel is now safe against Retbleed exploitation — until and unless another chain of twelve returns happens, in which case the RSB will need to be stuffed again.

Costs

Quite a bit of work has been put into minimizing the overhead of this solution, especially on systems where it is not needed. The kernel is built with direct calls to its functions as usual; at boot time, if the retbleed=stuff option is selected, all of those calls will be patched to go through the accounting thunks instead. The thunks themselves are placed in a huge-page mapping to minimize the translation lookaside buffer overhead. Even so, as the cover letter comments, there are costs: "We both unsurprisingly hate the result with a passion".

Those costs come in a few forms. An "impressive" amount of memory is required to hold the thunks and associated housekeeping. The bloating of the kernel has a performance impact of its own, even on systems where RSB stuffing is not enabled. The extra instructions add to pressure on the instruction cache, slowing execution. That last problem could be mitigated somewhat, the cover letter says, by allocating the thunks at the beginning of each function rather than in a separate section. Gleixner has prepared a GCC patch to make that possible, and reports that some of the performance loss is gained back when it is used.

The cover letter contains a long list of benchmark results comparing the performance of RSB stuffing against that of disabling mitigations entirely and of using IBRS. The numbers for RSB stuffing are eye-opening, including a 382% performance regression for one microbenchmark. In all cases, though, RSB stuffing performs better than IBRS.

Better performance than IBRS is only interesting, though, if the primary goal of blocking Retbleed attacks has been achieved. The cover letter says this:

The assumption is that stuffing at the 12th return is sufficient to break the speculation before it hits the underflow and the fallback to the other predictors. Testing confirms that it works. Johannes [Wikner], one of the retbleed researchers, tried to attack this approach and confirmed that it brings the signal to noise ratio down to the crystal ball level.
There is obviously no scientific proof that this will withstand future research progress, but all we can do right now is to speculate about that.

So RSB stuffing seems to work — for now, at least. That should make it attractive in situations where defending against Retbleed attacks is considered to be necessary; hosting providers with untrusted users would be one obvious example. But nobody will be happy with the overhead, even if it is better than IBRS. For a lot of users, RSB stuffing will be seen as a clever hack that, happily, they do not need to actually use.

Index entries for this article
Kernel	Security/Meltdown and Spectre

Stuffing the return stack buffer

Posted Jul 22, 2022 18:27 UTC (Fri) by alonz (subscriber, #815) [Link] (1 responses)

I like that last sentence ("all we can do right now is to speculate about that"). Yeah, look where speculation has brought us…

Stuffing the return stack buffer

Posted Jul 29, 2022 13:48 UTC (Fri) by smitty_one_each (subscriber, #28989) [Link]

https://www.hyrumslaw.com/ tells us that someone, somewhere, is making use off all of these behavioral aspects of the chip.

And that is disquieting.

Stuffing the return stack buffer

Posted Jul 22, 2022 22:11 UTC (Fri) by developer122 (guest, #152928) [Link] (5 responses)

That T430 thinkpad is feeling like a really great investment right about now.

Stuffing the return stack buffer

Posted Jul 23, 2022 13:23 UTC (Sat) by mss (subscriber, #138799) [Link] (4 responses)

Current CPUs are far faster than an Ivy Bridge CPU from a decade ago, even when the necessary mitigations for speculative execution vulnerabilities are enabled.

And these mitigations can always be disabled if not applicable to one's threat model.

Stuffing the return stack buffer

Posted Jul 23, 2022 21:42 UTC (Sat) by kenmoffat (subscriber, #4807) [Link] (3 responses)

My understanding is that on intel (AMD retbleed is a different vulnerability), anything before Skylake is not affected. Certainly, my haswell claims to be not vulnerable. And recent intel (gen 10 and later) are apparently also ok, so this seems to be only for generations 6 to 9.

Stuffing the return stack buffer

Posted Jul 24, 2022 11:45 UTC (Sun) by kenmoffat (subscriber, #4807) [Link] (2 responses)

I've been told that Ice Lake and later, except Alder Lake are vulnerable, and that the mitigation will also run on Alder Lake until new firmware is applied. See INTEL-SA-00702

Stuffing the return stack buffer

Posted Jul 24, 2022 14:13 UTC (Sun) by mss (subscriber, #138799) [Link] (1 responses)

The Intel affected CPU model page says that Ice Lake models are not affected by INTEL-SA-00702.

eIBRS parts had their own vulnerability in March (the relevant paper is here), which apparently can also be used to mount Retbleed-style attacks.

Stuffing the return stack buffer

Posted Jul 24, 2022 20:24 UTC (Sun) by kenmoffat (subscriber, #4807) [Link]

So yes, not exactly the same vulnerability, but the same mitigation.

Stuffing the return stack buffer

Posted Jul 23, 2022 0:40 UTC (Sat) by scientes (guest, #83068) [Link] (6 responses)

The real another way is to have the time dimension part of the ISA:

Prevention of Microarchitectural Covert Channels on an Open-Source 64-bit RISC-V Core

Such as with this suggested new instruction, part of seL4 development.

Stuffing the return stack buffer

Posted Jul 23, 2022 14:41 UTC (Sat) by Paf (subscriber, #91811) [Link] (4 responses)

“ We show that the addition of a single-instruction extension to the RISC-V ISA, that flushes microarchitectural state,”

That doesn’t sound like including the time domain…? It’s just flushing all state at certain transitions?

Stuffing the return stack buffer

Posted Jul 23, 2022 14:47 UTC (Sat) by Paf (subscriber, #91811) [Link] (3 responses)

Specifically, at context switches. So this looks to be only truly applicable on a system where cores are entirely independent or there’s just one core. This looks like a very limited mitigation and not much applicable to the larger cores we work with. It also relies on knowing what “all micro architecture state” is - part of the steady march of these issues has been people finding things that weren’t previously recognized as “state” worthy of protecting.

Stuffing the return stack buffer

Posted Jul 23, 2022 16:26 UTC (Sat) by epa (subscriber, #39769) [Link]

...or somehow you arrange for all the cores sharing a cache to be running code from the same protection domain at the same time. When one context switches to a different user or to the kernel, they all must.

Stuffing the return stack buffer

Posted Jul 24, 2022 20:19 UTC (Sun) by Sesse (subscriber, #53779) [Link] (1 responses)

So all your syscalls are now basically as expensive as you can possibly make them?

Stuffing the return stack buffer

Posted Jul 24, 2022 23:03 UTC (Sun) by Paf (subscriber, #91811) [Link]

I mean, they talk about how to make that specific operation more efficient and it could be worse, but …….. yes.

Stuffing the return stack buffer

Posted Aug 2, 2022 17:13 UTC (Tue) by immibis (subscriber, #105511) [Link]

"have the time dimension part of the ISA" is a rather pretentious way of saying "add an instruction to flush cached microarchitectural state"

Stuffing the return stack buffer

Posted Jul 23, 2022 3:39 UTC (Sat) by felixfix (subscriber, #242) [Link]

Holy mackerel! There is nothing new under the sun!

Waaaay back when, late 1970s, I worked on Datapoint 2200/5500/6600 8 bit computers, Datapoint's extension of the basic 8008 into Z80 level, but different: no IX IY regs, had system/user modes, other differences.

It also had a 16 level hardware stack with no overflow or underflow detection or warning.

Its only interrupt was every millisecond whether you wanted it or not. Everything was polled.

We threw in a push, push, pop, pop, to force always using 3 levels of stack, because none of our interrupt code was supposed to use more than two levels, and user code was expected to never use more than 13 levels.

Deja vu all over again!

Stuffing the return stack buffer

Posted Jul 23, 2022 14:15 UTC (Sat) by jhoblitt (subscriber, #77733) [Link] (5 responses)

What are the odds that a microcode update could disable falling back on "other heuristics" when the RSB is exhausted?

Stuffing the return stack buffer

Posted Jul 23, 2022 14:42 UTC (Sat) by tglx (subscriber, #31301) [Link] (3 responses)

Exactly zero. If that would be feasible we would not discuss such a workaround at all.

Stuffing the return stack buffer

Posted Jul 23, 2022 19:13 UTC (Sat) by jhoblitt (subscriber, #77733) [Link] (2 responses)

That isn't surprising... even if a heroic microcode fix was possible, there is obviously a strong financial incentive to push customers towards purchasing new chips (which already have a, presumably, well validated mitigation).

I genuinely appreciate the effort and wizardry going into this problem. I suspect many are in the position of considering risk and evaluating which hosts need retbleed mitigation... and which subset of those can't afford a 1/3 loss of capacity.

It is looking like the ultimate solution is either to buy more hardware to compensate for increased kernel overhead or to upgrade to Intel "12th" gen or newer cpus with eIBRS support? Either option is difficult with the current unprecedented lead times on IT equipment.

Of course, I accepted delivery of 5 pallets of zen3 based servers immediately prior to the public retbleed disclosure.

Stuffing the return stack buffer

Posted Jul 27, 2022 15:01 UTC (Wed) by anton (subscriber, #25547) [Link] (1 responses)

Actually it is somewhat surprising. CPU designer often put in "chicken bits" for disabling new microarchitectural features, in case they turn out to be buggy. You can then still sell the CPU instead of having to scrap it. And some of these chicken bits have been known to be used over the years (and probably many more were used before the CPUs were released, and the public never heard of them).

But when they designed this fallback from the return stack buffer to the history-based indirect branch predictor into Skylake, they apparently did not put a chicken bit for that in, probably because history-based indirect branch prediction had been present in Intel CPUs for many generations.

Stuffing the return stack buffer

Posted Jul 27, 2022 20:53 UTC (Wed) by izbyshev (guest, #107996) [Link]

> CPU designer often put in "chicken bits" for disabling new microarchitectural features, in case they turn out to be buggy.

Yeah, like this one :)

Stuffing the return stack buffer

Posted Jul 24, 2022 0:32 UTC (Sun) by developer122 (guest, #152928) [Link]

Microcode is not all-powerful. Certain key mechanisms must be hardwired for performance reasons.

Stuffing the return stack buffer

Posted Jul 24, 2022 6:33 UTC (Sun) by petkan (subscriber, #54713) [Link] (1 responses)

If you need _real_ security it just better to not use _any_ sort of computer. ;)

Stuffing the return stack buffer

Posted Jul 24, 2022 6:42 UTC (Sun) by petkan (subscriber, #54713) [Link]

... and of course big "F*** you, Intel" for making 3/4 of all machines that i own a rather expensive paperweight.

Stuffing the return stack buffer

Posted Jul 24, 2022 10:56 UTC (Sun) by fw (subscriber, #26023) [Link] (9 responses)

Huh. Why doesn't the CPU execute the conditional branch in shlq $5, PER_CPU_VAR(__x86_call_depth); jz 1f speculatively, defeating the mitigation?

Stuffing the return stack buffer

Posted Jul 24, 2022 14:36 UTC (Sun) by izbyshev (guest, #107996) [Link]

I guess it does, and that's why "safety margin of four calls" mentioned in the article is needed.

Stuffing the return stack buffer

Posted Jul 24, 2022 16:06 UTC (Sun) by mss (subscriber, #138799) [Link] (7 responses)

jz 1f is not an indirect branch.

Stuffing the return stack buffer

Posted Jul 24, 2022 17:12 UTC (Sun) by izbyshev (guest, #107996) [Link] (6 responses)

But it's directly followed by ret on the fallthrough path. If the speculation window could be arbitrarily large, I don't see what would prevent CPU from simply bypassing the RSB stuffing code by taking the fallthrough path N times where N is the size of the RSB, and then still using the attacker-controlled indirect branch predictor. So it seems that this mitigation relies on a certain upper bound on the size of the speculation window.

Stuffing the return stack buffer

Posted Jul 24, 2022 17:18 UTC (Sun) by izbyshev (guest, #107996) [Link] (5 responses)

And indeed, quoting the patch:

+ * The shift count might cause this to be off by one in either direction,
+ * but there is still a cushion vs. the RSB depth. The algorithm does not
+ * claim to be perfect and it can be speculated around by the CPU, but it
+ * is considered that it obfuscates the problem enough to make exploitation
+ * extremly difficult.

Stuffing the return stack buffer

Posted Jul 24, 2022 17:44 UTC (Sun) by Paf (subscriber, #91811) [Link] (1 responses)

Which seems fine, since if the CPU designers were involved, it’s probably ok for existing CPUs and fixed in future, right?

Stuffing the return stack buffer

Posted Jul 24, 2022 18:14 UTC (Sun) by izbyshev (guest, #107996) [Link]

I can't find any indications of Intel CPU designers being involved in this mitigation, but from what I could understand, the newest CPUs are not affected, so, indeed, the mitigation has to work only on a known set of CPUs.

Stuffing the return stack buffer

Posted Jul 26, 2022 0:24 UTC (Tue) by developer122 (guest, #152928) [Link] (2 responses)

In other words, "it's fixed until we discover otherwise"

Stuffing the return stack buffer

Posted Jul 26, 2022 13:47 UTC (Tue) by mss (subscriber, #138799) [Link] (1 responses)

Some knowledgeable people already say that:

Retpoline is not safe on Skylake-era CPUs, and we knew this before the Spectre/Meltdown embargo broke in Jan '18.

RSB stuffing relies on retpolines for Spectre v2 mitigation.

Stuffing the return stack buffer

Posted Jul 26, 2022 14:47 UTC (Tue) by izbyshev (guest, #107996) [Link]

> RSB stuffing relies on retpolines for Spectre v2 mitigation.

FWIW, it's vice versa: retpolines rely on RSB stuffing to make them less broken on Skylake.

But yeah, the general sentiment of that email is that apparently retpolines would be unsafe on Skylake even if RSB stuffing were added in all cases when the RSB might become empty.