Control-flow integrity in 5.13

Posted May 22, 2021 3:54 UTC (Sat) by Paf (subscriber, #91811)
In reply to: Control-flow integrity in 5.13 by anton
Parent article: Control-flow integrity in 5.13

That seems ... extremely unlikely to me? Can you give numbers for the cost of this binary search? Does it involve absolutely no possibility of chaining cache misses from step to step?

Control-flow integrity in 5.13

Posted May 22, 2021 5:00 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

It's not a bad idea, actually. In many places you might only have just a few targets.

Control-flow integrity in 5.13

Posted May 22, 2021 14:45 UTC (Sat) by Paf (subscriber, #91811) [Link] (1 responses)

Ok, but - numbers? I’m struggling to see how multiple jumps is better than a single mispredicted execution branch.

Control-flow integrity in 5.13

Posted May 25, 2021 18:00 UTC (Tue) by andresfreund (subscriber, #69562) [Link]

I can't imagine a binary search working well, but it's not hard to believe a few hot cases checked linearly could work out. A mispredicted call is pretty expensive.

I'm pretty sure that several compilers use profile guided "optimistic" devirtualization, which basically ends up with code like
if (call_target == very_common_target) very_common_target() else if (call_target == also_common_target) also_common_target() else *call_target(). And I've seen code like that manually written in plenty places, with good success.

Control-flow integrity in 5.13

Posted May 22, 2021 16:04 UTC (Sat) by anton (subscriber, #25547) [Link] (2 responses)

It's a good question, so I wrote a microbenchmark to answer it to some extent (and wrote a little discussion of the whole topic). Unfortunately, especially wrt branch prediction, but also wrt cache misses, actual behaviour depends very much on actual usage, which is hard to model in a microbenchmark. So the performance numbers I give are from the best case (everything in cache, and perfect branch prediction). I also give some (theoretical) analysis of the cache-miss cases.

Anyway, with a plain indirect call (through a register) each iteration of the microbenchmark costs 5-6 cycles on a variety of CPUs I measured (from Sandy Bridge to Zen3), and a table-using checked indirect call costs 7-9 cycles; that's both without retpolines. The binary-search variant (for 7 target functions) costs 6-12 cycles. If you turn on retpolines, the plain indirect call costs 24-57 cycles, the table-using variant 26-60, and the binary-search variant still costs 6-12 cycles (there are no indirect branches, so no retpolines). Binary-search cost will grow with the number of targets, but you can have many targets before you reach 26-60 cycles, at least with perfectly predictable branches.

Concerning cache misses, the binary search for 7 targets fits in one I-cache line and incurs fewer cache misses (when it's not in the cache) than the table-using code (which incurs two cache misses in this case). You can probably cover ~42 targets with a maximum of two cache misses: first select among 6 ranges in one cache line, then among 7 in the final cache line. If you have more targets, you can have more cache misses than the table-using version. But of course, the question is if you spend more time on the cache misses for cold code, or more time in the retpolines in hot code.

Control-flow integrity in 5.13

Posted May 23, 2021 8:50 UTC (Sun) by jpfrancois (subscriber, #65948) [Link] (1 responses)

What about 1000 functions ? How does it scale ? Here is a count of file using struct file_operations.
https://elixir.bootlin.com/linux/v5.12.6/C/ident/file_ope...

net_device_ops goes into the 500.

I suppose other heavily used abstraction will suffer the same penalty.

Control-flow integrity in 5.13

Posted May 23, 2021 11:47 UTC (Sun) by anton (subscriber, #25547) [Link]

The number of files that a struct is used in does not tell us anything about the number of targets that clang might see as potential targets for a given call through a function pointer. With several dozen file system types, I expect that VFS operations will have several dozen target functions (and hopefully with unique signatures).

As for scaling, the basic scaling properties of divide-and-conquer searches are well known. The search time (and the number of nodes accessed) increases logarithmically with the size of the search space. If your question is about the constant factors, I can give better answers today than yesterday:

If we try to minimize the number of cache lines accessed (important for cold code), we get a B-tree-like characteristic, where we consider each cache line to be a tree node with 8 subtrees or (for leaf nodes) 8 target functions; in some cases, a little more is possible, giving us 73 targets for a two-level tree. Measuring such a tree access, I see that this dispatch costs, e.g. on Zen3, 8-10 cycles rather than 6-8 for the binary tree with 7 targets, so every level costs roughly 2 cycles. So if we have a four-level tree for 4096 targets, the total cost will be about 12-14 cycles (in the well-predicted and cached case) and the search will access 4 I-cache lines. If there are branch mispredictions, that would cause a lot of slowdown, however.