LWN: Comments on "Control-flow integrity in 5.13"

Control-flow integrity in 5.13

mcortese — Sun, 18 Jul 2021 17:49:12 +0000

If I understand correctly, this is only available when link-time optimisation is enabled, which is not common.

Control-flow integrity in 5.13

wahern — Thu, 27 May 2021 05:35:05 +0000

Take an object like

struct foo {
  char buf[64];

  int (*fptr)(int);

  struct {
    int (*fptr)(int);
  } *vtable;
};

If an attacker can overflow (struct foo).buf, then they can rewrite the address of fptr or vtable to point wherever. The latter takes extra leg work to exploit, unless they know the address of the (struct foo) object, in which case they can just point vtable back into an area they already wrote, reducing it to the former case. There are more complex cases (e.g. involving integer indices into tables rather than raw pointers) but the basic problem is the same: deriving a function pointer through loads from writeable memory regions.

__cfi_check()

marcH — Wed, 26 May 2021 07:11:07 +0000

Ah yes of course: much faster to check that the argument is within the given range compared to looking for some value in the whole jump table.

> Sorry if that wasn't clear.

I read again trying to understand how I got that wrong and I think it's because I assumed that the "target address" was the address of the "target function" mentioned earlier.

Control-flow integrity in 5.13

andresfreund — Tue, 25 May 2021 18:00:41 +0000

I can't imagine a binary search working well, but it's not hard to believe a few hot cases checked linearly could work out. A mispredicted call is pretty expensive.

I'm pretty sure that several compilers use profile guided "optimistic" devirtualization, which basically ends up with code like
if (call_target == very_common_target) very_common_target() else if (call_target == also_common_target) also_common_target() else *call_target(). And I've seen code like that manually written in plenty places, with good success.

Control-flow integrity in 5.13

andresfreund — Tue, 25 May 2021 17:53:47 +0000

> There doesn't seem to be much in the way of data regarding the performance impact of this feature, but the LLVM page describing CFI says that its cost is "less than 1%".

I have a quite hard time believing that, tbh. Not in the sense that I don't believe that there are no workload in which that is true (probably lots), but that it's true in all "common" workloads. The dcache footprint alone makes me doubt this. It's not helped by the subsequent sentence in the LLVM page:

"Note that this scheme has not yet been optimized for binary size; an increase of up to 15% has been observed for Chromium."

There's *lots* of code that is primarily bound by icache misses. A 15% increase is pretty substantial.

I assume that the code size increase in the kernel would be lower than for chromium, which probably has a lot more vtables than linux has "callback structs" like file_operations.

__cfi_check()

corbet — Tue, 25 May 2021 16:58:00 +0000

The argument to __cfi_check() is an address in the jump table. Sorry if that wasn't clear.

Control-flow integrity in 5.13

marcH — Tue, 25 May 2021 16:52:39 +0000

> __cfi_check(); this function receives, along with the target address, the address of the jump table matching the prototype of the called function. It will verify that the target address is, indeed, an address within the expected jump table, extract the real function address from the table, and jump to that address.

I don't understand what there is to "extract"; isn't the real/target address a __cfi_check() argument already? Or is there some indirection that I missed?

Control-flow integrity in 5.13

cypherpunks2 — Tue, 25 May 2021 01:33:50 +0000

PaX RAP uses a different technique which does not require shadow stacks and is more performant. Sadly it is available for customers only right now.

Control-flow integrity in 5.13

anton — Sun, 23 May 2021 11:47:41 +0000

The number of files that a struct is used in does not tell us anything about the number of targets that clang might see as potential targets for a given call through a function pointer. With several dozen file system types, I expect that VFS operations will have several dozen target functions (and hopefully with unique signatures).

As for scaling, the basic scaling properties of divide-and-conquer searches are well known. The search time (and the number of nodes accessed) increases logarithmically with the size of the search space. If your question is about the constant factors, I can give better answers today than yesterday:

If we try to minimize the number of cache lines accessed (important for cold code), we get a B-tree-like characteristic, where we consider each cache line to be a tree node with 8 subtrees or (for leaf nodes) 8 target functions; in some cases, a little more is possible, giving us 73 targets for a two-level tree. Measuring such a tree access, I see that this dispatch costs, e.g. on Zen3, 8-10 cycles rather than 6-8 for the binary tree with 7 targets, so every level costs roughly 2 cycles. So if we have a four-level tree for 4096 targets, the total cost will be about 12-14 cycles (in the well-predicted and cached case) and the search will access 4 I-cache lines. If there are branch mispredictions, that would cause a lot of slowdown, however.

Control-flow integrity in 5.13

jpfrancois — Sun, 23 May 2021 08:50:47 +0000

What about 1000 functions ? How does it scale ? Here is a count of file using struct file_operations.
https://elixir.bootlin.com/linux/v5.12.6/C/ident/file_ope...

net_device_ops goes into the 500.

I suppose other heavily used abstraction will suffer the same penalty.

Control-flow integrity in 5.13

corbet — Sat, 22 May 2021 16:17:49 +0000

As noted in the article, this change provides forward-edge protection. Protecting against return-address corruption (backward-edge) requires different techniques like shadow stacks.

The jump tables will be in read-only memory, which makes them a lot harder to overwrite.

Control-flow integrity in 5.13

anton — Sat, 22 May 2021 16:04:44 +0000

It's a good question, so I wrote a microbenchmark to answer it to some extent (and wrote a little discussion of the whole topic). Unfortunately, especially wrt branch prediction, but also wrt cache misses, actual behaviour depends very much on actual usage, which is hard to model in a microbenchmark. So the performance numbers I give are from the best case (everything in cache, and perfect branch prediction). I also give some (theoretical) analysis of the cache-miss cases.

Anyway, with a plain indirect call (through a register) each iteration of the microbenchmark costs 5-6 cycles on a variety of CPUs I measured (from Sandy Bridge to Zen3), and a table-using checked indirect call costs 7-9 cycles; that's both without retpolines. The binary-search variant (for 7 target functions) costs 6-12 cycles. If you turn on retpolines, the plain indirect call costs 24-57 cycles, the table-using variant 26-60, and the binary-search variant still costs 6-12 cycles (there are no indirect branches, so no retpolines). Binary-search cost will grow with the number of targets, but you can have many targets before you reach 26-60 cycles, at least with perfectly predictable branches.

Concerning cache misses, the binary search for 7 targets fits in one I-cache line and incurs fewer cache misses (when it's not in the cache) than the table-using code (which incurs two cache misses in this case). You can probably cover ~42 targets with a maximum of two cache misses: first select among 6 ranges in one cache line, then among 7 in the final cache line. If you have more targets, you can have more cache misses than the table-using version. But of course, the question is if you spend more time on the cache misses for cold code, or more time in the retpolines in hot code.

Control-flow integrity in 5.13

ale2018 — Sat, 22 May 2021 15:11:37 +0000

I'm not clear how an attacker is supposed to redirect a call to some other address than the function it was meant to reach. The example shows the check carried out in the code near the location of the call itself. It does nothing to prevent, say, returning from an overflowed stack, does it?

CFI is meant to defend against an attacker who is able to fiddle with jump tables in kernel memory, but neither with the bit arrays nor with the code itself (still in kernel memory), right? Or maybe it merely tries to impede the attacker by requiring coordinated changes in the jump table and in the bit array?

And how about compiling with GCC?

Control-flow integrity in 5.13

dxin — Sat, 22 May 2021 15:08:56 +0000

I thought only Pixel phones uses clang to build the kernel, hence only Pixel have CFI enabled?

Control-flow integrity in 5.13

Paf — Sat, 22 May 2021 14:45:25 +0000

Ok, but - numbers? I’m struggling to see how multiple jumps is better than a single mispredicted execution branch.

Control-flow integrity in 5.13

Cyberax — Sat, 22 May 2021 05:00:15 +0000

It's not a bad idea, actually. In many places you might only have just a few targets.

reverse inlining inderect call

Paf — Sat, 22 May 2021 03:55:32 +0000

This assume the value of that specific function pointer is known at compile time, doesn’t it? If that’s the case, then none of this is necessary at all.

Control-flow integrity in 5.13

Paf — Sat, 22 May 2021 03:54:19 +0000

That seems ... extremely unlikely to me? Can you give numbers for the cost of this binary search? Does it involve absolutely no possibility of chaining cache misses from step to step?

reverse inlining inderect call

ballombe — Fri, 21 May 2021 18:08:08 +0000

What about inlining the function performing the indirect call, instatiating the function pointer ?
When applicable, this would remove the indirect call and the performance penalty associated.

Control-flow integrity in 5.13

anton — Fri, 21 May 2021 16:38:02 +0000

Given that all targets of each indirect call are known, instead of a checked use of an indirect jump table, the indirect call could be replaced by a hard-coded binary search among the possible targets, and finally a direct call. The comparisons and conditional branches of this search cost something, but given that a retpoline costs a guaranteed misprediction (~20 cycles), in our Spectre-workaround world the binary search is probably cheaper in many cases.