Control-flow integrity in 5.13
Control-flow integrity in 5.13
Posted May 22, 2021 3:54 UTC (Sat) by Paf (subscriber, #91811)In reply to: Control-flow integrity in 5.13 by anton
Parent article: Control-flow integrity in 5.13
Posted May 22, 2021 5:00 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted May 22, 2021 14:45 UTC (Sat)
by Paf (subscriber, #91811)
[Link] (1 responses)
Posted May 25, 2021 18:00 UTC (Tue)
by andresfreund (subscriber, #69562)
[Link]
I'm pretty sure that several compilers use profile guided "optimistic" devirtualization, which basically ends up with code like
Posted May 22, 2021 16:04 UTC (Sat)
by anton (subscriber, #25547)
[Link] (2 responses)
Anyway, with a plain indirect call (through a register) each iteration of the microbenchmark costs 5-6 cycles on a variety of CPUs I measured (from Sandy Bridge to Zen3), and a table-using checked indirect call costs 7-9 cycles; that's both without retpolines. The binary-search variant (for 7 target functions) costs 6-12 cycles. If you turn on retpolines, the plain indirect call costs 24-57 cycles, the table-using variant 26-60, and the binary-search variant still costs 6-12 cycles (there are no indirect branches, so no retpolines). Binary-search cost will grow with the number of targets, but you can have many targets before you reach 26-60 cycles, at least with perfectly predictable branches.
Concerning cache misses, the binary search for 7 targets fits in one I-cache line and incurs fewer cache misses (when it's not in the cache) than the table-using code (which incurs two cache misses in this case). You can probably cover ~42 targets with a maximum of two cache misses: first select among 6 ranges in one cache line, then among 7 in the final cache line. If you have more targets, you can have more cache misses than the table-using version. But of course, the question is if you spend more time on the cache misses for cold code, or more time in the retpolines in hot code.
Posted May 23, 2021 8:50 UTC (Sun)
by jpfrancois (subscriber, #65948)
[Link] (1 responses)
net_device_ops goes into the 500.
I suppose other heavily used abstraction will suffer the same penalty.
Posted May 23, 2021 11:47 UTC (Sun)
by anton (subscriber, #25547)
[Link]
As for scaling, the basic scaling properties of divide-and-conquer searches are well known. The search time (and the number of nodes accessed) increases logarithmically with the size of the search space. If your question is about the constant factors, I can give better answers today than yesterday:
If we try to minimize the number of cache lines accessed (important for cold code), we get a B-tree-like characteristic, where we consider each cache line to be a tree node with 8 subtrees or (for leaf nodes) 8 target functions; in some cases, a little more is possible, giving us 73 targets for a two-level tree. Measuring such a tree access, I see that this dispatch costs, e.g. on Zen3, 8-10 cycles rather than 6-8 for the binary tree with 7 targets, so every level costs roughly 2 cycles. So if we have a four-level tree for 4096 targets, the total cost will be about 12-14 cycles (in the well-predicted and cached case) and the search will access 4 I-cache lines. If there are branch mispredictions, that would cause a lot of slowdown, however.
Control-flow integrity in 5.13
Control-flow integrity in 5.13
Control-flow integrity in 5.13
if (call_target == very_common_target) very_common_target() else if (call_target == also_common_target) also_common_target() else *call_target(). And I've seen code like that manually written in plenty places, with good success.
It's a good question, so I wrote a microbenchmark to answer it to some extent (and wrote a little discussion of the whole topic). Unfortunately, especially wrt branch prediction, but also wrt cache misses, actual behaviour depends very much on actual usage, which is hard to model in a microbenchmark. So the performance numbers I give are from the best case (everything in cache, and perfect branch prediction). I also give some (theoretical) analysis of the cache-miss cases.
Control-flow integrity in 5.13
Control-flow integrity in 5.13
https://elixir.bootlin.com/linux/v5.12.6/C/ident/file_ope...
The number of files that a struct is used in does not tell us anything about the number of targets that clang might see as potential targets for a given call through a function pointer. With several dozen file system types, I expect that VFS operations will have several dozen target functions (and hopefully with unique signatures).
Control-flow integrity in 5.13