Control-flow integrity in 5.13
Control-flow integrity in 5.13
Posted May 22, 2021 16:04 UTC (Sat) by anton (subscriber, #25547)In reply to: Control-flow integrity in 5.13 by Paf
Parent article: Control-flow integrity in 5.13
It's a good question, so I wrote a microbenchmark to answer it to some extent (and wrote a little discussion of the whole topic). Unfortunately, especially wrt branch prediction, but also wrt cache misses, actual behaviour depends very much on actual usage, which is hard to model in a microbenchmark. So the performance numbers I give are from the best case (everything in cache, and perfect branch prediction). I also give some (theoretical) analysis of the cache-miss cases.
Anyway, with a plain indirect call (through a register) each iteration of the microbenchmark costs 5-6 cycles on a variety of CPUs I measured (from Sandy Bridge to Zen3), and a table-using checked indirect call costs 7-9 cycles; that's both without retpolines. The binary-search variant (for 7 target functions) costs 6-12 cycles. If you turn on retpolines, the plain indirect call costs 24-57 cycles, the table-using variant 26-60, and the binary-search variant still costs 6-12 cycles (there are no indirect branches, so no retpolines). Binary-search cost will grow with the number of targets, but you can have many targets before you reach 26-60 cycles, at least with perfectly predictable branches.
Concerning cache misses, the binary search for 7 targets fits in one I-cache line and incurs fewer cache misses (when it's not in the cache) than the table-using code (which incurs two cache misses in this case). You can probably cover ~42 targets with a maximum of two cache misses: first select among 6 ranges in one cache line, then among 7 in the final cache line. If you have more targets, you can have more cache misses than the table-using version. But of course, the question is if you spend more time on the cache misses for cold code, or more time in the retpolines in hot code.
