| Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net. |
Indirect function calls — calls to a function whose address is stored in a pointer variable — have never been blindingly fast, but the Spectre hardware vulnerabilities have made things far worse. The indirect branch predictor used to speed up indirect calls in the CPU can no longer be used, and performance has suffered accordingly. The "retpoline" mechanism was a brilliant hack that proved faster than the hardware-based solutions that were tried at the beginning. While retpolines took a lot of the pain out of Spectre mitigation, experience over the last year has made it clear that they still hurt. It is thus not surprising that developers have been looking for alternatives to retpolines; several of them have shown up on the kernel lists recently.
The way to make an indirect call faster is to replace it with a direct call; that renders branch prediction unnecessary. Of course, if a direct call would have sufficed in any given situation, the developer would have used it rather than an indirect call, so this replacement is not always straightforward. All of the proposed solutions to retpoline overhead strive to do that replacement in one way or another, though; they vary from the simple to the complex.
The simplest method is often the best; that is the approach taken in Christoph Hellwig's patch set speeding up the DMA-mapping code. Setting up DMA buffers can involve a lot of architecture-specific trickery; the DMA mapping layer does its best to hide that trickery behind a common API. As is often the case in the kernel, the code in the middle uses a structure full of function pointers to direct a generic DMA call to the code that can implement it in any specific setting.
It turns out, though, that the most common case for DMA mapping is the simplest: the memory is simply directly mapped in both the CPU's and the device's address space with no particular trickery required. Hellwig's work takes advantage of that fact by testing for this case and calling the direct-mapping support code directly rather than going through a function pointer. So, for example, code that looks like this:
addr = ops->map_page(...);
is transformed into something like:
if (dma_is_direct(ops))
addr = dma_direct_map_page(...);
else
addr = ops->map_page(...);
The cost of the if test is more than recouped in the direct-mapping case by avoiding the indirect function call (and it is tiny relative to the cost of that call in the other cases). Jesper Dangaard Brouer, who reported the performance hit in the DMA-mapping code, expressed his happiness at this change: "my XDP performance is back". Barring problems, this change seems likely to be merged sometime soon.
In some situations, an indirect function call will end up invoking one out of a relatively small list of known functions; a variant of the above approach can be used to test for each of the known alternatives and call the correct function directly. This patch set from Paolo Abeni implements that approach with a simple set of macros. If a given variable func can point to either of f1() or f2(), the indirect call can be avoided with code that looks like this:
INDIRECT_CALLABLE_DECLARE(f1(args...));
INDIRECT_CALLABLE_DECLARE(f2(args...));
/* ... */
INDIRECT_CALL_2(func, f2, f1, args...);
This code will expand to something like:
if (func == f1)
f1(args);
else if (func == f2)
f2(args);
else
(*func)(args);
Abeni's patch set is aimed at the network stack, so it contains some additional optimizations that can apply when the choice is between the IPv4 and IPv6 versions of a function. He claims a 10% or so improvement for a UDP generic receive offload (GRO) benchmark. Networking maintainer David Miller has indicated a willingness to accept this work, though the current patch set needs a couple of repairs before it can be merged.
Sometimes indirect calls reflect a mode of operation in the kernel that is not often changed; in such cases, the optimal approach might be to just turn the indirect call into a direct call and patch the code when the target must be changed. That is the approach taken by the static calls patch set from Josh Poimboeuf.
Imagine a global variable target that can hold a pointer to either of f1() or f2(). This variable could be declared as a static call with a declaration like:
DEFINE_STATIC_CALL(target, f1);
Initially, target will point to f1(). Changing it to point to f2() requires a call like:
static_call_update(target, f2);
Actually calling the function pointed to by target is done with static_call():
static_call(target, args...);
Since changing the target of a call involves code patching, it is an expensive operation and should not be done often. One possible use case for static calls is tracepoints in the kernel, which can have an arbitrary function attached to them, but which are not often changed. Using a static call for that attachment can reduce the runtime overhead of enabling a tracepoint.
This patch set has been through a couple of revisions so far. It implements two different mechanisms. The first tracks all call sites for each static call variable and patches each of them when the target changes; the second stores the target in a trampoline and all calls jump through there. The motivations for the two approaches are not spelled out, but one can imagine that the direct calls will be a little faster, while the trampoline will be quicker and easier to patch when the target changes.
A rather more involved and general-purpose approach can be seen in this patch set posted by Nadav Amit in October. Rather than requiring developers to change indirect call sites by hand, Amit adds a new mechanism that optimizes indirect calls on the fly.
The patch set uses some "assembly macro magic" to change how every retpoline injected into the kernel works; the new version contains both fast and slow paths. The fast path is a test and direct call to the most frequently called target (hopefully) from any retpoline, while the slow path is the old retpoline mechanism. In the normal production mode, the fast path should mitigate the retpoline overhead in a large fraction of the calls from that site.
What makes this work interesting is the selection of the target for the fast path. Each "relpoline" (a name that was deemed too close to "retpoline" for comfort and which, as a result, may be renamed to something like "optpoline") starts out in a learning mode where it builds a hash table containing the actual target for each call that is made. After a sufficient number of calls, the most frequently called target is patched directly into the code, and the learning mode ends. To follow changing workloads, relpolines are put back into the learning mode after one minute of operation, a period that Amit says "might be too aggressive".
This mechanism has the advantage of optimizing all indirect calls, not just the ones identified as a problem by a developer. It can also operate on indirect calls added in loadable modules at any point during the system's operation. The results, he says, are "not bad"; they include a 10% improvement in an nginx benchmark. Even on a system with retpolines disabled, simply optimizing the indirect calls yields a 2% improvement for nginx. The downside, of course, is the addition of a fair amount of low-level complexity to implement this mechanism.
Response to this patch set has been muted but generally positive. There are, though, lots of suggestions on the details of how this mechanism would work. There may be further optimizations to be had by storing more than one common target, for example. The learning mechanism can probably benefit from some improvement. There was also a suggestion to use a GCC plugin rather than the macro magic to insert the new call mechanism into the kernel. As a result, the patch set is still under development and will likely take some time yet to be ready.
Various other developers have been working on the indirect call problem as well. Edward Cree, for example, has posted a patch set adding a simple learning mechanism to static calls. Nearly one year after the Spectre vulnerability was disclosed, the development community is clearly still trying to do something about the performance costs the Spectre mitigations have imposed.
The current round of fixes is trying to recover the performance lost when the indirect branch predictor was taken out of the picture. As Cree put it: "Essentially we're doing indirect branch prediction in software because the hardware can't be trusted to get it right; this is sad." Merging four different approaches (at least) to this problem may not be the best solution, especially since this particular vulnerability should eventually be fixed in the hardware, rendering all of the workarounds unnecessary. Your editor would not want to speculate on which of the above patches, if any, will make it into the mainline, though.
Relief for retpoline pain
Posted Dec 15, 2018 5:52 UTC (Sat) by josh (subscriber, #17465) [Link]
Relief for retpoline pain
Posted Dec 18, 2018 11:59 UTC (Tue) by jezuch (subscriber, #52988) [Link]
Relief for retpoline pain
Posted Dec 15, 2018 6:01 UTC (Sat) by patrakov (subscriber, #97174) [Link]
Relief for retpoline pain
Posted Dec 15, 2018 7:26 UTC (Sat) by areilly (subscriber, #87829) [Link]
Relief for retpoline pain
Posted Dec 15, 2018 9:35 UTC (Sat) by ibukanov (subscriber, #3942) [Link]
Relief for retpoline pain
Posted Dec 15, 2018 19:26 UTC (Sat) by jcm (subscriber, #18262) [Link]
Relief for retpoline pain
Posted Dec 20, 2018 13:25 UTC (Thu) by mp (subscriber, #5615) [Link]
This comment seems to nicely illustrate the fact that "relpoline" is indeed a nametoo close to "retpoline" for comfort.
Relief for retpoline pain
Posted Dec 15, 2018 8:39 UTC (Sat) by zev (subscriber, #88455) [Link]
While I was working on it the results weren't quite dramatic enough to justify pursuing it further, but this was well before Spectre -- perhaps it just wasn't timed right...
Relief for retpoline pain
Posted Dec 15, 2018 9:34 UTC (Sat) by pbonzini (subscriber, #60935) [Link]
Relief for retpoline pain
Posted Dec 15, 2018 9:55 UTC (Sat) by ibukanov (subscriber, #3942) [Link]
The indirect branch prediction on CPU made that optimization largely unnecessary, but now we are back to it as the prediction turned out to be a security nightmare.
Relief for retpoline pain
Posted Dec 15, 2018 19:32 UTC (Sat) by jcm (subscriber, #18262) [Link]
Relief for retpoline pain
Posted Dec 18, 2018 18:27 UTC (Tue) by anton (subscriber, #25547) [Link]
Indirect function calls [...] have never been blindingly fastActually, in my measurements correctly predicted indirect calls have been as fast as direct calls on Intel-compatible CPUs for a decade or two. That obviated the need for inline caching, so it's not surprising that all the papers on inline caching are more than two decades old.
Relief for retpoline pain
Posted Dec 18, 2018 21:17 UTC (Tue) by roc (subscriber, #30627) [Link]
Relief for retpoline pain
Posted Dec 21, 2018 12:56 UTC (Fri) by wtarreau (subscriber, #51152) [Link]
Relief for retpoline pain
Posted Jan 4, 2019 12:05 UTC (Fri) by teknoraver (guest, #99765) [Link]
Copyright © 2018, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds