Relief for retpoline pain
The way to make an indirect call faster is to replace it with a direct call; that renders branch prediction unnecessary. Of course, if a direct call would have sufficed in any given situation, the developer would have used it rather than an indirect call, so this replacement is not always straightforward. All of the proposed solutions to retpoline overhead strive to do that replacement in one way or another, though; they vary from the simple to the complex.
Speeding up DMA operations
The simplest method is often the best; that is the approach taken in Christoph Hellwig's patch set speeding up the DMA-mapping code. Setting up DMA buffers can involve a lot of architecture-specific trickery; the DMA mapping layer does its best to hide that trickery behind a common API. As is often the case in the kernel, the code in the middle uses a structure full of function pointers to direct a generic DMA call to the code that can implement it in any specific setting.
It turns out, though, that the most common case for DMA mapping is the simplest: the memory is simply directly mapped in both the CPU's and the device's address space with no particular trickery required. Hellwig's work takes advantage of that fact by testing for this case and calling the direct-mapping support code directly rather than going through a function pointer. So, for example, code that looks like this:
addr = ops->map_page(...);
is transformed into something like:
if (dma_is_direct(ops)) addr = dma_direct_map_page(...); else addr = ops->map_page(...);
The cost of the if test is more than recouped in the
direct-mapping case by avoiding the indirect function call (and it is tiny
relative to the cost of that call in the other cases). Jesper Dangaard
Brouer, who
reported the performance hit in the DMA-mapping code, expressed his
happiness at this change: "my XDP performance is back
".
Barring problems, this change seems likely to be merged sometime soon.
Choosing from a list
In some situations, an indirect function call will end up invoking one out of a relatively small list of known functions; a variant of the above approach can be used to test for each of the known alternatives and call the correct function directly. This patch set from Paolo Abeni implements that approach with a simple set of macros. If a given variable func can point to either of f1() or f2(), the indirect call can be avoided with code that looks like this:
INDIRECT_CALLABLE_DECLARE(f1(args...)); INDIRECT_CALLABLE_DECLARE(f2(args...)); /* ... */ INDIRECT_CALL_2(func, f2, f1, args...);
This code will expand to something like:
if (func == f1) f1(args); else if (func == f2) f2(args); else (*func)(args);
Abeni's patch set is aimed at the network stack, so it contains some additional optimizations that can apply when the choice is between the IPv4 and IPv6 versions of a function. He claims a 10% or so improvement for a UDP generic receive offload (GRO) benchmark. Networking maintainer David Miller has indicated a willingness to accept this work, though the current patch set needs a couple of repairs before it can be merged.
Static calls
Sometimes indirect calls reflect a mode of operation in the kernel that is not often changed; in such cases, the optimal approach might be to just turn the indirect call into a direct call and patch the code when the target must be changed. That is the approach taken by the static calls patch set from Josh Poimboeuf.
Imagine a global variable target that can hold a pointer to either of f1() or f2(). This variable could be declared as a static call with a declaration like:
DEFINE_STATIC_CALL(target, f1);
Initially, target will point to f1(). Changing it to point to f2() requires a call like:
static_call_update(target, f2);
Actually calling the function pointed to by target is done with static_call():
static_call(target, args...);
Since changing the target of a call involves code patching, it is an expensive operation and should not be done often. One possible use case for static calls is tracepoints in the kernel, which can have an arbitrary function attached to them, but which are not often changed. Using a static call for that attachment can reduce the runtime overhead of enabling a tracepoint.
This patch set has been through a couple of revisions so far. It implements two different mechanisms. The first tracks all call sites for each static call variable and patches each of them when the target changes; the second stores the target in a trampoline and all calls jump through there. The motivations for the two approaches are not spelled out, but one can imagine that the direct calls will be a little faster, while the trampoline will be quicker and easier to patch when the target changes.
Relpolines/optpolines
A rather more involved and general-purpose approach can be seen in this patch set posted by Nadav Amit in October. Rather than requiring developers to change indirect call sites by hand, Amit adds a new mechanism that optimizes indirect calls on the fly.
The patch set uses some "assembly macro magic
" to change how
every retpoline injected into the kernel works; the new version contains
both fast and slow paths. The fast path is a test and direct call to the
most frequently called target (hopefully) from any retpoline, while the
slow path is the old retpoline mechanism. In the normal production mode,
the fast path should mitigate the retpoline overhead in a large fraction of
the calls from that site.
What makes this work interesting is the selection of the target for the
fast path. Each "relpoline" (a name that was deemed too close to
"retpoline" for comfort and which, as a result, may be renamed to something
like "optpoline") starts out in a learning mode where it builds a hash
table containing the actual target for each call that is made. After a
sufficient number of calls, the most frequently called target is
patched directly into the code, and the learning mode ends. To follow
changing workloads, relpolines are put back into the learning mode after
one minute of operation, a period that Amit says "might be too
aggressive
".
This mechanism has the advantage of optimizing all indirect calls, not just
the ones identified as a problem by a developer. It can also operate on
indirect calls added in loadable modules at any point during the system's
operation. The results, he says, are "not bad
"; they include
a 10% improvement in an nginx benchmark. Even on a system with retpolines
disabled, simply optimizing the indirect calls yields a 2% improvement for
nginx. The downside, of course, is the addition of a fair amount of
low-level complexity to implement this mechanism.
Response to this patch set has been muted but generally positive. There are, though, lots of suggestions on the details of how this mechanism would work. There may be further optimizations to be had by storing more than one common target, for example. The learning mechanism can probably benefit from some improvement. There was also a suggestion to use a GCC plugin rather than the macro magic to insert the new call mechanism into the kernel. As a result, the patch set is still under development and will likely take some time yet to be ready.
What's next
Various other developers have been working on the indirect call problem as well. Edward Cree, for example, has posted a patch set adding a simple learning mechanism to static calls. Nearly one year after the Spectre vulnerability was disclosed, the development community is clearly still trying to do something about the performance costs the Spectre mitigations have imposed.
The current round of fixes is trying to recover the performance lost when
the indirect branch predictor was taken out of the picture. As Cree put
it: "Essentially we're doing indirect branch prediction in
software because the hardware can't be trusted to get it right; this is
sad
". Merging four different approaches (at least) to this problem
may not be the best solution, especially since this particular
vulnerability should eventually be fixed in the hardware, rendering all of
the workarounds unnecessary. Your editor would not want to speculate on
which of the above patches, if any, will make it into the mainline,
though.
Index entries for this article | |
---|---|
Kernel | Retpoline |
Posted Dec 15, 2018 5:52 UTC (Sat)
by josh (subscriber, #17465)
[Link] (1 responses)
Posted Dec 18, 2018 11:59 UTC (Tue)
by jezuch (subscriber, #52988)
[Link]
Posted Dec 15, 2018 6:01 UTC (Sat)
by patrakov (subscriber, #97174)
[Link] (4 responses)
Posted Dec 15, 2018 7:26 UTC (Sat)
by areilly (subscriber, #87829)
[Link]
Posted Dec 15, 2018 9:35 UTC (Sat)
by ibukanov (subscriber, #3942)
[Link]
Posted Dec 15, 2018 19:26 UTC (Sat)
by jcm (subscriber, #18262)
[Link] (1 responses)
Posted Dec 20, 2018 13:25 UTC (Thu)
by mp (subscriber, #5615)
[Link]
Posted Dec 15, 2018 8:39 UTC (Sat)
by zev (subscriber, #88455)
[Link]
While I was working on it the results weren't quite dramatic enough to justify pursuing it further, but this was well before Spectre -- perhaps it just wasn't timed right...
Posted Dec 15, 2018 9:34 UTC (Sat)
by pbonzini (subscriber, #60935)
[Link] (2 responses)
Posted Dec 15, 2018 9:55 UTC (Sat)
by ibukanov (subscriber, #3942)
[Link] (1 responses)
The indirect branch prediction on CPU made that optimization largely unnecessary, but now we are back to it as the prediction turned out to be a security nightmare.
Posted Dec 15, 2018 19:32 UTC (Sat)
by jcm (subscriber, #18262)
[Link]
Posted Dec 18, 2018 18:27 UTC (Tue)
by anton (subscriber, #25547)
[Link] (1 responses)
Posted Dec 18, 2018 21:17 UTC (Tue)
by roc (subscriber, #30627)
[Link]
Posted Dec 21, 2018 12:56 UTC (Fri)
by wtarreau (subscriber, #51152)
[Link]
Posted Jan 4, 2019 12:05 UTC (Fri)
by teknoraver (subscriber, #99765)
[Link]
Relief for retpoline pain
Relief for retpoline pain
Relief for retpoline pain
Relief for retpoline pain
Relief for retpoline pain
Relief for retpoline pain
This comment seems to nicely illustrate the fact that "relpoline" is indeed a name Relief for retpoline pain
too close to "retpoline" for comfort
.
Relief for retpoline pain
Relief for retpoline pain
Relief for retpoline pain
Relief for retpoline pain
Relief for retpoline pain
Indirect function calls [...] have never been blindingly fast
Actually, in my measurements correctly predicted indirect calls have been as fast as direct calls on Intel-compatible CPUs for a decade or two. That obviated the need for inline caching, so it's not surprising that all the papers on inline caching are more than two decades old.
Relief for retpoline pain
Relief for retpoline pain
Relief for retpoline pain