|
|
Log in / Subscribe / Register

Control-flow integrity in 5.13

By Jonathan Corbet
May 21, 2021
Among the many changes merged for the 5.13 kernel is support for the LLVM control-flow integrity (CFI) mechanism. CFI defends against exploits by ensuring that indirect function calls have not been redirected by an attacker. Quite a bit of work was needed to make this feature work well for the kernel, but the result appears to be production-ready and able to defend Linux systems from a range of attacks.

Protecting indirect function calls

The kernel depends heavily on indirect function calls — calls where the destination address is not known at compile time. Device drivers, filesystems, and other kernel subsystems interface with the generic, core code by providing functions to be called to carry out specific actions. When the time comes to, for example, open a file (which may be a special file corresponding to a device), the core kernel will make an indirect call to the appropriate open() function defined in the file_operations structure for the file in question. Indirect function calls allow for a clean separation between generic and low-level code.

This mechanism is flexible and performs well, but it also makes those indirect calls into an attractive target for attackers. If an indirect call can be redirected to an attacker-chosen location, there are few limits to the disorder that can result. Changes over the years have made it hard for attackers to inject their own code into the kernel, but if they can force execution to an arbitrary location, that matters little. Note that an exploit need not redirect a call to the beginning of another function; it can jump to any arbitrary point within the kernel image. There is no shortage of useful targets for a corrupted indirect function call.

CFI attempts to block this sort of exploit by restricting indirect calls to locations that are plausible targets. In this case, "plausible" means that the call goes to the beginning of an actual function, and that the target function has the same prototype as the caller was expecting. That is not a perfect test; there may be functions with the same prototype that will perform some sort of useful action for an attacker. But the result is still a massive reduction in the set of available targets, which will often be enough.

This check is often called "forward-edge CFI", since it protects calls to functions. The corresponding "backward-edge" protection ensures that return addresses on the stack have not been tampered with. The patches merged for 5.13 are focused on the forward-edge problem.

LLVM CFI in Linux

Specifically, this CFI implementation works by examining the full kernel image at link time; for this reason, link-time optimization must also be enabled to use it. Whenever a location is found where the address of a function is taken, LLVM makes a note of the function and its prototype. It then injects a set of "jump tables" into the built kernel, one for each encountered function prototype. So, for example, the open() function mentioned above is defined as:

    int (*open) (struct inode *inode, struct file *file);

There are many functions in the kernel matching this prototype that have their address taken to stuff into a file_operations structure somewhere. LLVM will collect them all into a single jump table, which is essentially a list of the addresses of these functions.

The next step is to change all of the places where that function's address is taken, and replace the address with the corresponding location in the jump table. So an assignment like:

    func_ptr = my_open_function;

will result in assigning an address within the jump table to func_ptr.

Finally, whenever an indirect function call is made, control goes to a special function called __cfi_check(); this function receives, along with the target address, the address of the jump table matching the prototype of the called function. It will verify that the target address is, indeed, an address within the expected jump table, extract the real function address from the table, and jump to that address. If the target address is not within the jump table, instead, the default action is to assume that an attack is in progress and immediately panic the system. There is a permissive mode selectable at configuration time that simply logs the error instead.

Kernel-specific quirks

That severe response may be justified, but it would be awfully annoying if there were situations where the kernel makes an indirect call to a function that doesn't exactly match the prototype of the pointer being used. So, naturally, the kernel did exactly that. In pre-5.13 kernels, list_sort() was declared as:

    void list_sort(void *priv, struct list_head *head,
		   int (*cmp)(void *priv, struct list_head *a, struct list_head *b))

The comparison function cmp() is passed in by the caller and is invoked, via an indirect call, to compare items in the list. Inside list_sort(), though, one sees this line:

    a = merge(priv, (cmp_func)cmp, b, a);

The cmp_func() type to which the function pointer is cast looks almost like the prototype of cmp(), except that the two list_head pointers have the const attribute. That is enough to change the prototype of the function and, at run time, to cause a CFI failure. The fix that was adopted was to propagate the const attribute to the callers of list_sort() so that the cast of the function pointer became unnecessary. That, however, required changing callers in 40 different files across the kernel source.

Another interesting quirk comes from the fact that the jump tables are built at link time. That works for the monolithic kernel, but loadable modules are linked separately. CFI in loadable modules works, but each module gets its own jump tables. Remember that function pointers are replaced by pointers into the jump tables; since modules have different jump tables, they will get different pointers as well. In other words, the values of two pointers to the same function will differ if one of them is in a loadable module.

For the most part, things will work anyway; calls to those two different pointers will end up in the same place. But consider this line in __queue_delayed_work():

    WARN_ON_ONCE(timer->function != delayed_work_timer_fn);

This test was added to the 3.7 kernel in 2012 as a way to "detect delayed_work users which diddle with the internal timer"; nearly nine years later one assumes that they have all been found, but the test remains. But, if CFI is in use, then the address for delayed_work_timer_fn() as seen from a loadable module will not be the same as the address seen from the core kernel; that will cause the test to fail. There are a couple of places in the kernel with tests like this; they have been "fixed" by simply disabling the test when CFI is configured in.

Various other things needed to be fixed as well, including making provisions for parts of the code that absolutely must have a direct pointer to a function rather than to the jump table. CFI in the kernel only works for the arm64 architecture in 5.13; support for x86 is in the works but is not yet ready to be enabled. There doesn't seem to be much in the way of data regarding the performance impact of this feature, but the LLVM page describing CFI says that its cost is "less than 1%".

CFI looks like a new feature that could have some scary, sharp edges. It is worth noting though that Kees Cook, when he sent the pull request asking that the patches (which were written by Sami Tolvanen) be merged, said that CFI "has happily lived in Android kernels for almost 3 years". It is, in other words, already widely deployed in the real world and probably doesn't have many surprises left to offer — except, perhaps, for attackers, who will find that many of their exploits no longer work.

Index entries for this article
KernelReleases/5.13
KernelSecurity/Control-flow integrity
SecurityLinux kernel


to post comments

Control-flow integrity in 5.13

Posted May 21, 2021 16:38 UTC (Fri) by anton (subscriber, #25547) [Link] (7 responses)

Given that all targets of each indirect call are known, instead of a checked use of an indirect jump table, the indirect call could be replaced by a hard-coded binary search among the possible targets, and finally a direct call. The comparisons and conditional branches of this search cost something, but given that a retpoline costs a guaranteed misprediction (~20 cycles), in our Spectre-workaround world the binary search is probably cheaper in many cases.

Control-flow integrity in 5.13

Posted May 22, 2021 3:54 UTC (Sat) by Paf (subscriber, #91811) [Link] (6 responses)

That seems ... extremely unlikely to me? Can you give numbers for the cost of this binary search? Does it involve absolutely no possibility of chaining cache misses from step to step?

Control-flow integrity in 5.13

Posted May 22, 2021 5:00 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

It's not a bad idea, actually. In many places you might only have just a few targets.

Control-flow integrity in 5.13

Posted May 22, 2021 14:45 UTC (Sat) by Paf (subscriber, #91811) [Link] (1 responses)

Ok, but - numbers? I’m struggling to see how multiple jumps is better than a single mispredicted execution branch.

Control-flow integrity in 5.13

Posted May 25, 2021 18:00 UTC (Tue) by andresfreund (subscriber, #69562) [Link]

I can't imagine a binary search working well, but it's not hard to believe a few hot cases checked linearly could work out. A mispredicted call is pretty expensive.

I'm pretty sure that several compilers use profile guided "optimistic" devirtualization, which basically ends up with code like
if (call_target == very_common_target) very_common_target() else if (call_target == also_common_target) also_common_target() else *call_target(). And I've seen code like that manually written in plenty places, with good success.

Control-flow integrity in 5.13

Posted May 22, 2021 16:04 UTC (Sat) by anton (subscriber, #25547) [Link] (2 responses)

It's a good question, so I wrote a microbenchmark to answer it to some extent (and wrote a little discussion of the whole topic). Unfortunately, especially wrt branch prediction, but also wrt cache misses, actual behaviour depends very much on actual usage, which is hard to model in a microbenchmark. So the performance numbers I give are from the best case (everything in cache, and perfect branch prediction). I also give some (theoretical) analysis of the cache-miss cases.

Anyway, with a plain indirect call (through a register) each iteration of the microbenchmark costs 5-6 cycles on a variety of CPUs I measured (from Sandy Bridge to Zen3), and a table-using checked indirect call costs 7-9 cycles; that's both without retpolines. The binary-search variant (for 7 target functions) costs 6-12 cycles. If you turn on retpolines, the plain indirect call costs 24-57 cycles, the table-using variant 26-60, and the binary-search variant still costs 6-12 cycles (there are no indirect branches, so no retpolines). Binary-search cost will grow with the number of targets, but you can have many targets before you reach 26-60 cycles, at least with perfectly predictable branches.

Concerning cache misses, the binary search for 7 targets fits in one I-cache line and incurs fewer cache misses (when it's not in the cache) than the table-using code (which incurs two cache misses in this case). You can probably cover ~42 targets with a maximum of two cache misses: first select among 6 ranges in one cache line, then among 7 in the final cache line. If you have more targets, you can have more cache misses than the table-using version. But of course, the question is if you spend more time on the cache misses for cold code, or more time in the retpolines in hot code.

Control-flow integrity in 5.13

Posted May 23, 2021 8:50 UTC (Sun) by jpfrancois (subscriber, #65948) [Link] (1 responses)

What about 1000 functions ? How does it scale ? Here is a count of file using struct file_operations.
https://elixir.bootlin.com/linux/v5.12.6/C/ident/file_ope...

net_device_ops goes into the 500.

I suppose other heavily used abstraction will suffer the same penalty.

Control-flow integrity in 5.13

Posted May 23, 2021 11:47 UTC (Sun) by anton (subscriber, #25547) [Link]

The number of files that a struct is used in does not tell us anything about the number of targets that clang might see as potential targets for a given call through a function pointer. With several dozen file system types, I expect that VFS operations will have several dozen target functions (and hopefully with unique signatures).

As for scaling, the basic scaling properties of divide-and-conquer searches are well known. The search time (and the number of nodes accessed) increases logarithmically with the size of the search space. If your question is about the constant factors, I can give better answers today than yesterday:

If we try to minimize the number of cache lines accessed (important for cold code), we get a B-tree-like characteristic, where we consider each cache line to be a tree node with 8 subtrees or (for leaf nodes) 8 target functions; in some cases, a little more is possible, giving us 73 targets for a two-level tree. Measuring such a tree access, I see that this dispatch costs, e.g. on Zen3, 8-10 cycles rather than 6-8 for the binary tree with 7 targets, so every level costs roughly 2 cycles. So if we have a four-level tree for 4096 targets, the total cost will be about 12-14 cycles (in the well-predicted and cached case) and the search will access 4 I-cache lines. If there are branch mispredictions, that would cause a lot of slowdown, however.

reverse inlining inderect call

Posted May 21, 2021 18:08 UTC (Fri) by ballombe (subscriber, #9523) [Link] (1 responses)

What about inlining the function performing the indirect call, instatiating the function pointer ?
When applicable, this would remove the indirect call and the performance penalty associated.

reverse inlining inderect call

Posted May 22, 2021 3:55 UTC (Sat) by Paf (subscriber, #91811) [Link]

This assume the value of that specific function pointer is known at compile time, doesn’t it? If that’s the case, then none of this is necessary at all.

Control-flow integrity in 5.13

Posted May 22, 2021 15:08 UTC (Sat) by dxin (guest, #136611) [Link]

I thought only Pixel phones uses clang to build the kernel, hence only Pixel have CFI enabled?

Control-flow integrity in 5.13

Posted May 22, 2021 15:11 UTC (Sat) by ale2018 (subscriber, #128727) [Link] (3 responses)

I'm not clear how an attacker is supposed to redirect a call to some other address than the function it was meant to reach. The example shows the check carried out in the code near the location of the call itself. It does nothing to prevent, say, returning from an overflowed stack, does it?

CFI is meant to defend against an attacker who is able to fiddle with jump tables in kernel memory, but neither with the bit arrays nor with the code itself (still in kernel memory), right? Or maybe it merely tries to impede the attacker by requiring coordinated changes in the jump table and in the bit array?

And how about compiling with GCC?

Control-flow integrity in 5.13

Posted May 22, 2021 16:17 UTC (Sat) by corbet (editor, #1) [Link] (1 responses)

As noted in the article, this change provides forward-edge protection. Protecting against return-address corruption (backward-edge) requires different techniques like shadow stacks.

The jump tables will be in read-only memory, which makes them a lot harder to overwrite.

Control-flow integrity in 5.13

Posted May 25, 2021 1:33 UTC (Tue) by cypherpunks2 (guest, #152408) [Link]

PaX RAP uses a different technique which does not require shadow stacks and is more performant. Sadly it is available for customers only right now.

Control-flow integrity in 5.13

Posted May 27, 2021 5:35 UTC (Thu) by wahern (subscriber, #37304) [Link]

Take an object like

struct foo {
  char buf[64];

  int (*fptr)(int);

  struct {
    int (*fptr)(int);
  } *vtable;
};

If an attacker can overflow (struct foo).buf, then they can rewrite the address of fptr or vtable to point wherever. The latter takes extra leg work to exploit, unless they know the address of the (struct foo) object, in which case they can just point vtable back into an area they already wrote, reducing it to the former case. There are more complex cases (e.g. involving integer indices into tables rather than raw pointers) but the basic problem is the same: deriving a function pointer through loads from writeable memory regions.

Control-flow integrity in 5.13

Posted May 25, 2021 16:52 UTC (Tue) by marcH (subscriber, #57642) [Link] (2 responses)

> __cfi_check(); this function receives, along with the target address, the address of the jump table matching the prototype of the called function. It will verify that the target address is, indeed, an address within the expected jump table, extract the real function address from the table, and jump to that address.

I don't understand what there is to "extract"; isn't the real/target address a __cfi_check() argument already? Or is there some indirection that I missed?

__cfi_check()

Posted May 25, 2021 16:58 UTC (Tue) by corbet (editor, #1) [Link] (1 responses)

The argument to __cfi_check() is an address in the jump table. Sorry if that wasn't clear.

__cfi_check()

Posted May 26, 2021 7:11 UTC (Wed) by marcH (subscriber, #57642) [Link]

Ah yes of course: much faster to check that the argument is within the given range compared to looking for some value in the whole jump table.

> Sorry if that wasn't clear.

I read again trying to understand how I got that wrong and I think it's because I assumed that the "target address" was the address of the "target function" mentioned earlier.

Control-flow integrity in 5.13

Posted May 25, 2021 17:53 UTC (Tue) by andresfreund (subscriber, #69562) [Link]

> There doesn't seem to be much in the way of data regarding the performance impact of this feature, but the LLVM page describing CFI says that its cost is "less than 1%".

I have a quite hard time believing that, tbh. Not in the sense that I don't believe that there are no workload in which that is true (probably lots), but that it's true in all "common" workloads. The dcache footprint alone makes me doubt this. It's not helped by the subsequent sentence in the LLVM page:

"Note that this scheme has not yet been optimized for binary size; an increase of up to 15% has been observed for Chromium."

There's *lots* of code that is primarily bound by icache misses. A 15% increase is pretty substantial.

I assume that the code size increase in the kernel would be lower than for chromium, which probably has a lot more vtables than linux has "callback structs" like file_operations.

Control-flow integrity in 5.13

Posted Jul 18, 2021 17:49 UTC (Sun) by mcortese (guest, #52099) [Link]

If I understand correctly, this is only available when link-time optimisation is enabled, which is not common.


Copyright © 2021, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds