Kernel support for control-flow enforcement

By Jonathan Corbet
June 25, 2018

As attackers have lost the easy ability to execute code stored in writable memory, they have increasingly turned to return-oriented programming (ROP) and related techniques to compromise vulnerable systems. ROP attacks use the code that is present in the program under attack and are hard to defend against in software. In response, hardware vendors are developing ways to defeat ROP-like techniques at a lower level. One of the results is Intel's Control-Flow Enforcement Technology (CET) [PDF], which adds two mechanisms (shadow stacks and indirect-branch tracking) that are intended to resist these attacks. Yu-cheng Yu recently posted a set of patches showing how this technology is to be used to defend Linux systems.

The patches adding CET support were broken up into four separate groups: CPUID support and documentation, some memory-management work, shadow stacks, and indirect-branch tracking (IBT). The current patches support 64-bit systems only, and they only support CET for user-space code. Future versions are supposed to lift both of those restrictions.

ROP attacks generally work by loading a set of fabricated call frames onto the stack, each of which "returns" into a carefully chosen fragment of code. By stringing these "ROP gadgets" together, the attacker is able to execute enough useful code to take control of the system. Gadgets are plentiful in any large program; the ability to "return" into the middle of a multi-byte instruction to get an entirely different sequence of operations makes them even more available on x86 systems. The stack is, of course, writable by the running program; it contains a mixture of control-flow information (return addresses, for example) and other data. It is that mixing that has made ROP attacks possible.

One way to thwart such attacks is to move the return addresses to another context where they are not so easy to mess with; that is the core idea behind the shadow-stack functionality. Briefly, when shadow stacks are enabled, a function call will push the return address onto both the regular stack and a special shadow stack. When a return instruction is encountered, the return address is popped from both stacks and compared; if they do not match, a fault results. Both the push and pop operations are handled by the processor. As long as the attacker is unable to tamper with the shadow stack, it should prevent the use of a return instruction to divert the flow of control.

Preventing that tampering requires some special treatment for the shadow stack. It is allocated from a virtual-memory range, and the base address is stored into a model-specific register (MSR). The pages within the shadow stack must have a strange combination of bits set: read-only but dirty. Until now, the dirty state has been used almost exclusively by the kernel to track pages that must be written to backing store, but shadow stacks won't work without it. As a result, a new "software dirty" bit must be allocated in the page tables to fill the role that the hardware dirty bit handled previously.

The read-only protection on the shadow stack should prevent attackers from adding their own special entries — if that protection cannot be changed. To that end, shadow stacks are allocated in a special type of virtual-memory area (VMA) marked with the new VM_SHSTK flag. System calls like madvise(), mprotect(), mremap(), and munmap() will refuse to operate on a shadow-stack VMA. There is a new set of arch_prctl() operations that will operate on shadow stacks; they are described in this documentation patch. These calls, which are unprivileged, are meant to be used at program startup to set up the stack; one of them (ARCH_CET_LOCK) can be used to prevent disabling of shadow stacks (and IBT).

One interesting issue with shadow stacks is how they will interact with retpolines, which are used to thwart Spectre variant-2 attacks. Retpolines replace indirect function calls (those where the address of the function is determined at run time) with an instruction sequence that looks a lot like a ROP attack; they will not work when a shadow stack is in use. Intel claims (in section 4.3 of this document [PDF]) that retpolines will be unneeded on processors that support CET. Hopefully there will be no surprises that will force a choice between these two protective technologies.

Jump-oriented programming is a ROP-like technique that exploits indirect jumps and function calls. One way to severely restrict such exploits is to prevent jumps to any location that was not actually intended to be jumped to. IBT does this by adding a new pair of instructions (endbr32 and endbr64) that function as no-ops but which indicate a possible target for an indirect jump. These instructions will be treated as no-ops by older processors that lack CET support. When IBT is enabled, the processor will require that an endbr instruction is the first one encountered after an indirect jump; if something else is encountered, a fault will result.

Shadow stacks should be largely transparent to any program that is not, itself, doing strange things with return addresses on the stack. IBT is different, though; if it is enabled, the entire program must have been compiled with the necessary options to insert the endbr instructions in the right places. If a program has been so compiled, but it requires a library that has not, then IBT cannot be enabled without breaking the program. One of the jobs of the ELF loader on a CET-enabled system will be to check the CET-readiness of each library and only enable CET if all components are ready for it.

That leaves one interesting case uncovered, though. A program may need only CET-ready libraries to get started, but it might at some later point call dlopen() to load a library that has not been built for CET. At that point, there are only two options: turn off CET for that process, or fail the operation. If the ARCH_CET_LOCK operation described above has been invoked, only the latter option will be available. So locking can only be done at the cost of introducing a real chance of breaking programs when IBT is enabled.

That led to a long discussion about whether ARCH_CET_LOCK makes sense at all. Kees Cook argued that, in its absence, attackers will focus all of their energies on finding a way to turn CET off before carrying out the real attack. Andy Lutomirski responded that, by the time an attacker can disable CET, they are already in control and there's not much CET can do anyway. How that will be resolved is unclear at this time.

Disagreements over details like that notwithstanding, there appear to be no concerns (outside of grsecurity land anyway) about the CET features overall. They should make the system far more resistant to some common attack techniques with, seemingly, little in the way of performance or convenience costs. Chances are, though, that this technology won't be accepted until it is able to cover kernel code as well, since that is where a lot of attacks are focused. So CET support in Linux won't happen in the immediate future — but neither will the availability of CET-enabled processors.

Index entries for this article
Kernel	Security/Control-flow integrity
Security	Linux kernel

Kernel support for control-flow enforcement

Posted Jun 26, 2018 2:38 UTC (Tue) by TheJH (subscriber, #101155) [Link] (1 responses)

seems like LWN is a more reliable way to find out about userspace API changes than the linux-api list...

Kernel support for control-flow enforcement

Posted Jun 26, 2018 15:52 UTC (Tue) by JFlorian (guest, #49650) [Link]

seems like LWN is a more reliable way to find out about * than *...

fixed that for ya :-)

Kernel support for control-flow enforcement

Posted Jun 26, 2018 2:55 UTC (Tue) by mtaht (subscriber, #11087) [Link] (4 responses)

It is probably hopeless for me to point once again at the mill cpu design, which has
has a segmented and inaccessible stack, no register rubble, and a PLB good to the byte + a TLB.

https://millcomputing.com/docs/security/

spectre vs the mill: https://millcomputing.com/white-papers/

No matter if the mill is never funded and built, there are many ideas left in it that I hope will see
incremental adoption in other cpus.

Kernel support for control-flow enforcement

Posted Jun 26, 2018 5:20 UTC (Tue) by alison (subscriber, #63752) [Link] (3 responses)

Has millcomputing.com ever communicated with the RISC-V folks?

We all know what TLB is, but PLB?

Kernel support for control-flow enforcement

Posted Jun 26, 2018 6:15 UTC (Tue) by cpitrat (subscriber, #116459) [Link]

Processor Local Bus I guess ?

Kernel support for control-flow enforcement

Posted Jun 26, 2018 11:11 UTC (Tue) by renox (guest, #23785) [Link]

> Has millcomputing.com ever communicated with the RISC-V folks?

Why would they? The Mill guys are building a set of patent to licence, I doubt that they're interested in helping a "free ISA" CPU.

Kernel support for control-flow enforcement

Posted Jun 26, 2018 13:12 UTC (Tue) by farnz (subscriber, #17727) [Link]

In the context of Mill, it's "Protection Lookaside Buffer" - like a TLB, but no translation, just read/write/execute permissions.

Missing mianipulation of return address

Posted Jun 26, 2018 3:15 UTC (Tue) by jreiser (subscriber, #11027) [Link] (1 responses)

If Intel (and AMD?) are going to trouble of implementing CET, perhaps they could be persuaded to implement two related missing instructions? PUSHRA (push return address) and POPRA (pop return address) are the same as ordinary PUSH and POP, except for the explicit hint that the data item is a return address, and thus the side stack (cache) of return addresses should be manipulated accordingly. Such instructions would enable a measurable speed improvement (> 1%) for some mixed compiled+interpreted language runtime systems, by reducing the overhead from a mis-predicted branch when a compiled RET encounters a mismatch between the side stack and the real stack because the interpreter was in control at the time of the PUSHRA. PUSHRA is 0xFF/7 (hex 0xFF followed by modR/M of octal 0m7r) and POPRA is 0x8F/1 (hex 0x8F followed by modR/M of octal 0m1r).

Missing mianipulation of return address

Posted Jun 26, 2018 5:10 UTC (Tue) by luto (guest, #39314) [Link]

Where does your 1% number come from?

Implementing PUSHRA may require special hardware. Consider PUSHRA mem; CALL. The CPU wants to speculatively execute CALL before the PUSHRA argument becomes available. To do that, either the CPU can push garbage to the return stack, in which case the instruction would be almost useless, or PUSHRA would need to reserve a return stack slot and fill it later. The latter might be fairly intrusive to a CPU design.

POPRA reg/mem seems silly. A POPRA that simply discards a return stack entry and has no effect on RSP seems better. Also, speculative execution of PUSHRA might not even be supportable by conventional return stack hardware.

Also, why are you assigning your hypothetical instructions opcodes at all, let alone single-byte opcodes?

Kernel support for control-flow enforcement

Posted Jun 26, 2018 3:18 UTC (Tue) by pabs (subscriber, #43278) [Link] (7 responses)

grsec also have a ROP solution called RAP:

https://www.grsecurity.net/rap_announce.php
https://www.grsecurity.net/rap_announce_full.php

Kernel support for control-flow enforcement

Posted Jun 26, 2018 5:41 UTC (Tue) by Lionel_Debroux (subscriber, #30014) [Link] (6 responses)

What's more, RAP is (much) more powerful than CET and doesn't require special, architecture-specific hardware support available on very few real-world processors at the time of this writing.
Likewise for PaX's KERNEXEC and MEMORY_UDEREF, which Intel and ARM eventually implemented years later as SMEP / PXN and SMAP / PAN . In PaX, these took advantage of PCID / INVPCID years before mainline integrated support for these to reduce the performance impact of KPTI.

But hey, CET will probably be implemented, because it's *something* to raise the low baseline of Linux to a slightly higher level, for a minority of computers :)

An alternative version of the grsecurity.net link posted at the end of the article should be https://forums.grsecurity.net/viewtopic.php?f=7&t=4490 .

Kernel support for control-flow enforcement

Posted Jun 26, 2018 6:02 UTC (Tue) by cpitrat (subscriber, #116459) [Link] (5 responses)

> RAP is (much) more powerful than CET

It would be interesting to detail why (rather than attempt to start a troll).

Kernel support for control-flow enforcement

Posted Jun 26, 2018 6:29 UTC (Tue) by Lionel_Debroux (subscriber, #30014) [Link] (4 responses)

Well, I'll just copy the first post of the very forum topic I linked (and re-read before linking), adding a bit of emphasis :)
The current LWN article says "which adds two mechanisms (shadow stacks and indirect-branch tracking)", so technically, PaXTeam's two year-old post still holds, AFAICT.

The first post of https://forums.grsecurity.net/viewtopic.php?f=7&t=4490 reads:

Intel's recent announcement ([A1], [A2]) of their hardware support for a form of Control Flow Integrity (CFI) has raised a lot of interest among the expert as well as the popular press. As an interested party we've decided to look at some of the details and analyze the strengths and weaknesses of Intel's Control-flow Enforcement Technology (CET). Note that all the discussion below is based off Intel's published technology preview documents. As no processor with the claimed technology will exist for several years, the details are not complete and may change in small ways prior to production.

Full disclosure: we have a competing production-ready solution to defend against code reuse attacks called RAP, see [R1], [R2]. RAP isn't tied to any particular CPU architecture or operating system, and it scales to real-life software from Xen to Linux to Chromium with excellent performance.

Following typical CFI schemes ([P1]), CET provides two separate mechanisms to protect indirect control flow transfers: one for forward edges (indirect calls and jumps) and another for backward edges (function returns). As we'll see, they have very different characteristics, so we'll look at them each individually.

Indirect Branch Tracking

The forward edge mechanism is called Indirect Branch Tracking (IBT) and is designed to allow only designated code locations as valid targets for indirect calls and jumps ([N1]). This is no different from other approaches in the field. What does differentiate these schemes is their precision, that is, the number of allowed targets at each indirect control transfer instruction. Intuitively, the less locations an attack can target, the less likely that those locations will be useful for something. Without any CFI an attacker can target any executable byte in the program's address space. CFI, ideally, restricts this set to a minimum at each indirect control transfer instruction.

How does CET fare in this regard? Very badly unfortunately as CET implements the weakest form of CFI in that there's only a single class of valid targets. That is, any indirect control transfer can be redirected to any of the designated target locations (similar to what Microsoft's CFG allows). Such simplistic schemes have been proven to be fatally weak by both academic and industry researchers ([P2] [P3] [P4] [P5]).

In contrast, RAP's type hash based classification can create over 30.000 function pointer classes and 47.000 function classes for Chromium (this means among others that thousands of otherwise valid functions cannot be called indirectly at all).

Beyond the design flaw identified above, there are also implementation problems with CET. One of them is related to the fact that the hardware has not one but two state machines to keep track of the IDLE/WAIT_FOR_ENDBRANCH states for user and kernel mode code, respectively. Only one state machine is active at a time depending on the privilege level of the currently executing code; the other state machine is suspended. There is however no mention in the documentation how this hidden state is saved and restored when the privilege boundary is crossed by a system call, interrupts, etc.

This in particular seems to make it impossible for a kernel to switch contexts between threads since it may very well happen that the outgoing thread was interrupted in a different state than what the incoming thread would like to resume in, triggering an instant Control Protection fault upon returning from the kernel to userland. The same problem arises with UNIX style signal delivery and other similar asynchronous userland code invocations. Hopefully this is merely an oversight in the documentation and not the design itself.

Another problem is the support mechanism for compatibility with code that hasn't been recompiled for CET. The Legacy Code Bitmap (LCB) seems to be direct hardware support for Microsoft's CFG scheme and suffers from the same problems as a result identified by earlier research ([P6], [P7], [P8]).

Interestingly, this same compatibility mechanism could be used to fix the fatal flaw of the coarse-grained design. Namely, to simulate fine-grained CFI one could create a separate bitmap for each indirect call type and activate it for the call. The implementation would suffer from increased memory usage (one LCB per function pointer type) and it'd also have a large performance impact due to the slow access to the MSR storing the address of the LCB (this would be even worse for userland as the MSR doesn't seem to be writable directly from user mode code). Needless to say, RAP achieves fine-grained forward-edge CFI without this performance impact.

A third problem with IBT is that to mark valid indirect branch targets an otherwise useless instruction must be emitted at the target location which wastes instruction decoding bandwidth at least (and probably more on non-CET capable processors). In contrast, RAP's type hash based marking scheme was specifically designed to avoid this problem thus its only impact is on memory use.

Shadow Stack

Let's now look at CET's offering for protecting function returns. This mechanism is based on the well-known concept of shadow stacks that have been (re)invented and implemented many times in the past ([P9]).

Shadow stacks aim to provide secure storage for return addresses that can only be written by call instructions. This ensures that memory corruption bugs cannot be used to divert control flow at function returns, which used to be a widespread exploitation technique since the beginnings of time.

While the shadow stack design is sound as it provides precise enforcement of call/return pairs, implementing it in real life systems faces several problems such as protecting the shadow stack region itself from memory corruption attacks, performance overhead of instructions needed to read from and write to the shadow stack, and compatibility with programming constructs that intentionally violate the strict call/return pairing assumed by the shadow stack design.

Traditional shadow stack implementations all suffer from the problem that they're writable and thus subject to memory corruption themselves. Fixing this by changing memory protection rights on each function call and return is prohibitively expensive thus most designs either assume a weaker threat model or try to hide behind ASLR (which is vulnerable to more powerful threats itself).

Intel's shadow stack design solves the problem of writable shadow stacks by giving hardware support to separate the shadow stack memory from other data and allow only designated instructions to write there. This is a sound design but the particular implementation requires implementors to be careful.

Namely, the way shadow stacks are marked seems to make RELRO and text relocated pages look like shadow stacks as well (they're all read-only but have been written to thus dirty in the last level page table entries). This can become a problem if the actual shadow stack area is ever mapped directly next to such a mapping as overflowing or underflowing the shadow stack may go unnoticed and give rise to an attack. Speaking of which, the current document doesn't say anything about how shadow stack overflows/underflows are handled.

Finally, as already discovered by past implementors ([P10] [P11] [P12] [P13]), shadow stacks cannot be used through compiler modifications only. Each OS has their own exceptional cases that need special handling. On Linux and similar OSes, these exceptional cases include the setjmp/longjmp/makecontext/setcontext set of functions which can violate the assumption that a function will return to its call site. It also includes the default glibc behavior of lazy binding (done for performance reasons) as well as C++ exceptions and asynchronous signal handling.

Conclusion

In summary, Intel's CET is mainly a hardware implementation of Microsoft's weak CFI implementation with the addition of a shadow stack. Its use will require the presence of Intel processors that aren't expected to be released for several years. Rather than truly innovating and advancing the state of the art in performance and security guarantees as RAP has, CET merely cements into hardware existing technology known and bypassed by academia and industry that is too weak to protect against the larger class of code reuse attacks. One can't help but notice a striking similarity with Intel's MPX, another software-dependent technology announced with great fanfare a few years ago that failed to live up to its many promises and never reached its intended adoption as the solution to end buffer overflow attacks and exists only as yet another bounds-checking based debugging technology.

In comparison, RAP is architecture-independent, best of breed in performance and security, doesn't require the latest CPU, and gives software developers the powerful ability to easily make the protections from RAP even more fine-grained.

[A1] http://blogs.intel.com/blog/intel-innovating-stop-cyber-attacks/
[A2] http://blogs.intel.com/evangelists/2016/06/09/intel-release-new-technology-specifications-protect-rop-attacks/
[R1] https://pax.grsecurity.net/docs/PaXTeam-H2HC15-RAP-RIP-ROP.pdf
[R2] https://grsecurity.net/rap_announce.php
[N1] Note that in practice indirect calls are the interesting case as the typical use of indirect jumps is to implement high level switch/case constructs where the code addresses and the paths leading to them are already in read-only memory and thus not subject to memory corruption.
[P1] https://pax.grsecurity.net/docs/pax-future.txt
[P2] https://www.usenix.org/system/files/conference/usenixsecurity14/sec14-paper-davi.pdf
[P3] http://nsl.cs.columbia.edu/projects/minestrone/papers/outofcontrol_oakland14.pdf
[P4] https://people.csail.mit.edu/stelios/papers/jujutsu_ccs15.pdf
[P5] http://dl.acm.org/citation.cfm?id=2813671
[P6] https://blog.coresecurity.com/2015/03/25/exploiting-cve-2015-0311-part-ii-bypassing-control-flow-guard-on-windows-8-1-update-3/
[P7] https://www.blackhat.com/docs/us-15/materials/us-15-Zhang-Bypass-Control-Flow-Guard-Comprehensively-wp.pdf
[P8] http://xlab.tencent.com/en/2015/12/09/bypass-dep-and-cfg-using-jit-compiler-in-charkra-engine/
[P9] http://www.angelfire.com/sk/stackshield/
[P10] http://mosermichael.github.io/cstuff/all/projects/2011/06/19/stack-mirror.html
[P11] https://www.cs.utah.edu/plt/publications/ismm09-rwrf.pdf
[P12] http://seclab.cs.sunysb.edu/seclab/pubs/vee14.pdf
[P13] https://www.trust.informatik.tu-darmstadt.de/fileadmin/user_upload/Group_TRUST/PubsPDF/ropdefender.pdf

Kernel support for control-flow enforcement

Posted Jun 26, 2018 16:43 UTC (Tue) by ju3Ceemi (subscriber, #102464) [Link] (3 responses)

Well, how may I say that ..

The pax stuff is not in a competition, nor is an alternative.
Because, with pragmatism:
- proprietary -> worthless, it will not be used by the mass
- not upstreamed -> ibid

Kernel support for control-flow enforcement

Posted Jun 27, 2018 8:55 UTC (Wed) by citypw (guest, #82661) [Link] (2 responses)

What do you mean "proprietary -> worthless"? Does Intel ever have PCT patents? SGX? CET? Big corps can have their patent and that's all right. Why a small open source consulting company shouldn't do the same? I mean, what's your point?

"it will not be used by the mass", could you plz give some data statistics? AFAIK, PaX's RAP is the only kernel CFI solution in the production environment.

Kernel support for control-flow enforcement

Posted Jun 28, 2018 1:19 UTC (Thu) by pabs (subscriber, #43278) [Link] (1 responses)

I assume they meant that since RAP (and the rest of grsec/PaX) is hidden behind grsecurity's support contracts, it will never be integrated into popular branches of Linux (like Android or mainline) and thus never reach the majority of systems that run Linux.

Kernel support for control-flow enforcement

Posted Jul 5, 2018 23:34 UTC (Thu) by nix (subscriber, #2304) [Link]

It doesn't seem impossible to reimplement, just fiddly. And CET is fiddly too, and, uh... weak. Distinctly weak. (Mind you, RAP requires thinking about every language you implement it for -- but not all that terribly much, and CET requires compiler modifications too, so that's a wash.)

Kernel support for control-flow enforcement

Posted Jun 28, 2018 1:20 UTC (Thu) by alkbyby (subscriber, #61687) [Link] (2 responses)

I guess I with have to read docs, but perhaps someone could post high level overview how this stuff is supposed to interact with features such as swapcontext, longjmp or even just throwing exception.

longjmp and such

Posted Jun 28, 2018 13:04 UTC (Thu) by corbet (editor, #1) [Link]

I kind of skipped over that part, sorry. Some of the new arch_prctl() calls are there to let the C library handle things like that; they allow changes to be made to the shadow stack when needed.

Kernel support for control-flow enforcement

Posted Jun 29, 2018 6:18 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

For longjmp and exception handling you have to save and restore the shadow stack pointer. Swapcontext on the other hand needs to allocate a shadow stack in addition to the stack, and save/restore the shadow stack base in addition to the pointer. There are instructions that are used to switch shadow stacks. Windows will probably integrate them in the fiber API, Linux will probably put up a "some assembly required" note (pun intended).

Kernel support for control-flow enforcement

Posted Jun 28, 2018 7:37 UTC (Thu) by robert_s (subscriber, #42402) [Link]

But would this support be maintained out would we end up with another Intel MPX fiasco?