PLT Rewriting
PLT Rewriting
Posted Feb 12, 2024 22:19 UTC (Mon) by andresfreund (subscriber, #69562)In reply to: PLT Rewriting by anton
Parent article: GNU C Library version 2.39
It's pretty easy to end up being bottlenecked by something other than what you're trying to measure if you're not careful.
> In any case, if I got a penny for every unsubstantiated claim of a performance advantage from a compiler maintainer or compiler supremacy advocate, ... And even when the original claim is debunked, often someone feels the urge to bring up another possible, but equally unsubstantiated performance advantage. This is not about you or jrtc27 in particular, I just have been in too many of these kinds of discussions.
FWIW, I've seen pretty clear production performance benefits from using -fno-plt, even though that "just" eliminates a single direct call by moving the load from the GOT inline into the callers.
Posted Feb 14, 2024 14:06 UTC (Wed)
by anton (subscriber, #25547)
[Link]
My feeling is that -fno-plt has better performance benefits than PLT rewriting, because the latter just replaces a perfectly predictable indirect jump with a direct jump (which is often equally expensive). But of course my feelings are just in the "unsubstantiated claims" department, and we need proper and reproducible measurements to get something better.
Concerning the supposedly hard-to-predict and therefore slow indirect jump that always jumps to the same place, I would like to see how "bottlenecked by something other" explains the original threaded-code microbenchmark results on CPUs with BTBs (e.g., the Pentium, compare with the BTB-less 486) or the v2 results on recent CPUs, where, e.g., direct-threaded code achieves 1 indirect branch per cycle on Tiger Lake (and that despite the fact that only half of the executed indirect branches always jump to the same target in V2, while the other half varies between 5 targets; modern history-based indirect branch predictors predict that perfectly, too).
PLT Rewriting