The cost of inline functions

[Posted April 28, 2004 by corbet]

The kernel makes heavy use of inline functions. In many cases, inline expansion of functions is necessary; some of these functions employ various sorts of assembly language trickery that must be part of the calling function. In many other cases, though, inline functions are used as a way of improving performance. The thinking is that, by eliminating the overhead of performing actual function calls, inline functions can make things go faster.

The truth turns out not to be so simple. Consider, for example, this patch from Stephen Hemminger which removes the inline attribute from a set of functions for dealing with socket buffers ("SKBs", the structure used to represent network packets inside the kernel). Stephen ran some benchmarks after applying his patch; those benchmarks ran 3% faster than they did with the functions being expanded inline.

The problem with inline functions is that they replicate the function body every time they are called. Each use of an inline function thus makes the kernel executable bigger. A bigger executable means more cache misses, and that slows things down. The SKB functions are called in many places all over the networking code. Each one of those calls creates a new copy of the function; Denis Vlasenko recently discovered that many of them expand to over 100 bytes of code. The result is that, while many places in the kernel are calling the same function, each one is working with its own copy. And each copy takes space in the processor instruction cache. That cache usage hurts; each cache miss costs more than a function call.

Thus, the kernel hackers are taking a harder look at inline function declarations than they used to. An inline function may seem like it should be faster, but that is not necessarily the case. The notion of a "time/space tradeoff" which is taught in many computer science classes turns out, often, to not hold in the real world. Many times, smaller is also faster.

Index entries for this article
Kernel	Coding style
Kernel	Inline functions

The cost of inline functions

Posted Apr 29, 2004 2:10 UTC (Thu) by jamesm (guest, #2273) [Link]

It's interesting to also note that the HTTP workload was not affected in a meaningful way.

The cost of inline functions

Posted Apr 29, 2004 13:59 UTC (Thu) by alspnost (guest, #2763) [Link]

This is presumably the same reasoning behind the frequent observation that -Os optimised binaries are faster than -O2 ones in many cases?

Time/space tradeoff

Posted Apr 29, 2004 17:55 UTC (Thu) by jzbiciak (guest, #5246) [Link] (1 responses)

There still is a time/space tradeoff of sorts, it's just that time as a function of space is not a monotonically decreasing function.

That's been true for a long time. The difference is that the crossover between negative slope (bigger space == less time) and positive slope (bigger space == more time) keeps moving closer and closer.

One of the benefits of inlining, aside from eliminating the call/return, is that it opens new optimization opportunities by optimizing across the caller/callee boundary. In effect, it allows the called function to be specialized for the context from which it was called. For instance, one of the operands to a function might be a flag that enables/disables some feature controlled by the function. If that flag is a constant in the call, entire codepaths from the callee might become dead code.

It would be interesting to see GCC start specializing functions in this manner without having to inline them, so we keep this secondary benefit while avoiding code bloat. Of course, this is relevant only if GCC can see multiple callers that would benefit from the same specializations. For instance, how many times is kmalloc called with "GFP_KERNEL"? Many. Would an automatic specialization for kmalloc(size, GFP_KERNEL) result in a performance benefit? Possibly.

Time/space tradeoff

Posted May 6, 2004 6:41 UTC (Thu) by joib (subscriber, #8541) [Link]

gcc 3.4 has some optimizations in this area. From http://gcc.gnu.org/gcc-3.4/changes.html:

# A new unit-at-a-time compilation scheme for C, Objective-C, C++ and Java which is enabled via -funit-at-a-time (and implied by -O2). In this scheme a whole file is parsed first and optimized later. The following basic inter-procedural optimizations are implemented:

* Removal of unreachable functions and variables
* Discovery of local functions (functions with static linkage whose address is never taken)
* On i386, these local functions use register parameter passing conventions.
* Reordering of functions in topological order of the call graph to enable better propagation of optimizing hints (such as the stack alignments needed by functions) in the back end.
* Call graph based out-of-order inlining heuristics which allows to limit overall compilation unit growth (--param inline-unit-growth).

Overall, the unit-at-a-time scheme produces a 1.3% improvement for the SPECint2000 benchmark on the i386 architecture (AMD Athlon CPU).
# More realistic code size estimates used by inlining for C, Objective-C, C++ and Java. The growth of large functions can now be limited via --param large-function-insns and --param large-function-growth.

The cost of inline functions

Posted Apr 30, 2004 14:14 UTC (Fri) by oak (guest, #2786) [Link]

Inline function can make code smaller if the inlined code is smaller (e.g. single struct lookup) than the instructions required for setting up a function call.

The cost of inline functions

Posted May 3, 2004 8:20 UTC (Mon) by eru (subscriber, #2753) [Link] (1 responses)

Doesn't this also depend a lot on the processor architecture? I used to work with a SPARC and there I found that in the GCC of the time, turning on automatic inlining was usually a pessimization for my applications. I assumed that this is because non-inlined calls to simple functions on the SPARC chips are cheap. Most arguments are passed in registers, and the register window mechanism greatly reduces the instructions needed to save/restore caller's register-allocated variables.

I would argue that "inline" as a language feature is just like the "register" storage class. It should not be used unless inlining really is necessary for low-level reasons, and normally the compiler should be left to make inlining decisions based on its knowledge of the target processor trade-offs.

The cost of inline functions

Posted May 9, 2004 7:54 UTC (Sun) by hs (guest, #15495) [Link]

not necessarily. it depends a lot on the function code and what the optimizer does: in some situations things like constant propagation and commeon subexpression detection can make most of the inlined code go away. <p>
with some optimizations it takes good judgement to decide whether to activate them or not