LWN.net Logo

Link-time optimization for the kernel

By Jonathan Corbet
August 21, 2012
The kernel tends to place an upper limit on how quickly any given workload can run, so it is unsurprising that kernel developers are always on the lookout for ways to make the system go faster. Significant amounts of work can be put into optimizations that, on the surface, seem small. So when the opportunity comes to make the kernel go faster without the need to rewrite any performance-critical code paths, there will naturally be a fair amount of interest. Whether the "link-time optimization" (LTO) feature supported by recent versions of GCC is such an opportunity or not is yet to be proved, but Andi Kleen is determined to find out.

The idea behind LTO is to examine the entire program after the individual files have been compiled and exploit any additional optimization opportunities that appear. The most significant of those opportunities appears to be the inlining of small functions across object files. The compiler can also be more aggressive about detecting and eliminating unused code and data. Under the hood, LTO works by dumping the compiler's intermediate representation (the "GIMPLE" code) into the resulting object file whenever a source file is compiled. The actual LTO stage is then carried out by loading all of the GIMPLE code into a single in-core image and rewriting the (presumably) further-optimized object code.

The LTO feature first appeared in GCC 4.5, but it has only really started to become useful in the 4.7 release. It still has a number of limitations; one of those is that all of the object files involved must be compiled with the same set of command-line options. That limitation turns out to be a problem with the kernel, as will be seen below.

Andi's LTO patch set weighs in at 74 changesets — not a small or unintrusive change. But it turns out that most of the changes have the same basic scope: ensuring that the compiler knows that specific symbols are needed even if they appear to be unused; that prevents the LTO stage from optimizing them away. For example, symbols exported to modules may not have any callers in the core kernel itself, but they need to be preserved for modules that may be loaded later. To that end, Andi's first patch defines a new attribute (__visible) used to mark such symbols; most of the remaining patches are dedicated to the addition of __visible attributes where they are needed.

Beyond that, there is a small set of fixes for specific problems encountered when building kernels with LTO. It seems that functions with long argument lists can get their arguments corrupted if the functions are inlined during the LTO stage; avoiding that requires marking the functions noinline. Andi complains "I wish there was a generic way to handle this. Seems like a ticking time bomb problem." In general, he acknowledges the possibility that LTO may introduce new, optimization-related bugs into the kernel; finding all of those could be a challenge.

Then there is the requirement that all files be built with the same set of options. Current kernels are not built that way; different options are used in different parts of the tree. In some places, this problem can be worked around by disabling specific optimizations that depend on different compiler flags than are used in the rest of the kernel. In others, though, features must simply be disabled to use LTO. These include the "modversions" feature (allowing kernel modules to be used with more than one kernel version) and the function tracer. Modversions seems to be fixable; getting ftrace to work may require changes to GCC, though.

It is also necessary, of course, to change the build system to use the GCC LTO feature. As of this writing, one must have a current GCC release; it is also necessary to install a development version of the binutils package for LTO to work. Even a minimal kernel requires about 4GB of memory for the LTO pass; an "allyesconfig" build could require as much as 9GB. Given that, the use of 32-bit systems for LTO kernel builds is out of the question; it is still possible, of course, to build a 32-bit kernel on a 64-bit system. The build will also take between two and four times as long as it does without LTO. So developers are unlikely to make much use of LTO for their own work, but it might be of interest to distributors and others who are building production kernels.

The fact that most people will not want to do LTO builds actually poses a bit of a problem. Given the potential for LTO to introduce subtle bugs, due either to optimization-related misunderstandings or simple bugs in the new LTO feature itself, widespread testing is clearly called for before LTO is used for production kernels. But if developers and testers are unwilling to do such heavyweight builds, that testing may be hard to come by. That will make it harder to achieve the level of confidence that will be needed before LTO-built kernels can be used in real-world settings.

Given the above challenges, the size of the patch set, and the ongoing maintenance burden of keeping LTO working, one might well wonder if it is all worth it. And that comes down entirely to the numbers: how much faster does the kernel get when LTO is used? Hard numbers are not readily available at this time; the LTO patch set is new and there are still a lot of things to be fixed. Andi reports that runs of the "hackbench" benchmark gain about 5%, while kernel builds don't change much at all. Some networking benchmarks improve as much as 18%. There are also some unspecified "minor regressions." The numbers are rough, but Andi believes they are encouraging enough to justify further work; he also expects the LTO implementation in GCC to improve over time.

Andi also suggests that, in the long term, LTO could help to improve the quality of the kernel code base by eliminating the need to put inline functions into include files.

All told, this is a patch set in a very early stage of development; it seems unlikely to be proposed for merging into a near-term kernel, even as an experimental feature. In the longer term, though, it could lead to faster kernels; use of LTO in the kernel could also help to drive improvements in the GCC implementation that would benefit all projects. So it is an effort that is worth keeping an eye on.


(Log in to post comments)

Link-time optimization for the kernel

Posted Aug 21, 2012 14:40 UTC (Tue) by Lionel_Debroux (subscriber, #30014) [Link]

Besides speed improvements, LTO is useful for PaX's GCC plugins aimed at reporting (e.g. k*alloc allocation size stats) and hardening (size overflow, KERNEXEC/amd64, constification of structures containing only function pointers unless they're marked __no_const or are detectably modified somewhere, etc.).

Link-time optimization for the kernel

Posted Aug 21, 2012 15:25 UTC (Tue) by ewan (subscriber, #5533) [Link]

"The fact that most people will not want to do LTO builds actually poses a bit of a problem."

As long as there are Gentoo users there will be someone to test massively resource intensive compiler optimisations of uncertain benefit.

Link-time optimization for the kernel

Posted Aug 21, 2012 16:29 UTC (Tue) by rvfh (subscriber, #31018) [Link]

You're trolling...

Link-time optimization for the kernel

Posted Aug 21, 2012 16:59 UTC (Tue) by Homer512 (subscriber, #85295) [Link]

But not far from the truth:
http://realnc.blogspot.de/2012/06/building-gentoo-linux-w...

(Gentoo user myself, so this is not meant mockingly)

Link-time optimization for the kernel

Posted Aug 23, 2012 0:53 UTC (Thu) by prometheanfire (subscriber, #65683) [Link]

Someone has to find the bugs.

Link-time optimization for the kernel

Posted Aug 21, 2012 17:08 UTC (Tue) by mikemol (subscriber, #83507) [Link]

Yet he's right. And I say this as an enthusiastic Gentoo user who's done some automation to find optimum configurations. :)

In reality, though, building a kernel with LTO isn't going to take as long as, say, building LibreOffice, Firefox or Chromium. Those take sufficiently long that more than a few Gentoo users prefer to use binary ebuilds rather than build them themselves.

What's really needed here, though, are automated regression tests to improve the confidence that the output of an LTO build isn't the most likely source of problems. That's the kind of thing that would need massive corporate sponsorship to get off the ground, though; there's a *ton* of code in the kernel, and while it's reasonably well-organized, that's still a lot of tests to write.

It's also necessary to understand that you can't be 100% confident you've caught all possible compiler-introduced bugs. The best you might be able to do is identify the portions of the kernel which the LTO pass twiddled the most, and write additional tests to exercise that code from multiple angles.

I wonder if such an "LTO hotspot" analysis could be used to do certain high-level reasoning about the code. Being able to answer questions like "given that LTO-active regions have some relation to a property of the APIs in that area, is that property something we want to see more or less of in our APIs?" would be interesting; it could help inform tradeoffs in future designs of APIs.

Partial LTO?

Posted Aug 21, 2012 17:19 UTC (Tue) by cesarb (subscriber, #6266) [Link]

The kernel already does a lot of partial linking (ld -r) in its build. Could it do partial LTO too? That is, LTO a few of the major directories separately (kernel, mm, fs, net, arch - probably everything but drivers), and do not LTO the final link step. This would give smaller gains, but would also compile faster.

Partial LTO?

Posted Aug 22, 2012 4:50 UTC (Wed) by jzbiciak (✭ supporter ✭, #5246) [Link]

At what granularity is it partially linking? If it's of the sort where a library of functions gets partially linked separately of all the places that call that library, the impact of LTO should be noticeably lessened.

Partial LTO?

Posted Aug 22, 2012 13:23 UTC (Wed) by nix (subscriber, #2304) [Link]

It's just ld -r, and it's done in every directory in the kernel tree (that has any .o files in it at all), generating either module.o or built-in.o files that are then linked together to generate final modules or the kernel proper (though the kernel proper also has a bunch of other object files, most of the bulk is in built-in.o files). The LTO sections should get carried along with this, and then at full link time the linker plugin should pick them up and do a full LTO.

Partial LTO?

Posted Aug 22, 2012 14:22 UTC (Wed) by jzbiciak (✭ supporter ✭, #5246) [Link]

I kinda figured it was something such as by-directory (it certainly appeared that way last time I built a kernel). I was pretty sure the normal flow carried the serialized GIMPLE down to the end for the final LTO. Where the boundary becomes important is if you try to implement cesarb's partial-LTO suggestion.

If you did a "partial LTO", where you munged all the GIMPLE together and generated a new object in place of the partially linked library, and didn't pass the GIMPLE up to the next level, you'd be leaving many of the most interesting optimizations aside if many come from optimizing library and core code into the bodies of drivers and other leaves.

It doesn't make sense to re-codegen at partial link time unless your intent is to throw away the GIMPLE, or provide a library that could be linked both with and without further LTO (which seems... weird?).

Partial LTO?

Posted Aug 23, 2012 17:08 UTC (Thu) by rriggs (subscriber, #11598) [Link]

If you build and then load *any* kernel modules, it is, by definition, partial LTO.

Partial LTO?

Posted Aug 23, 2012 19:33 UTC (Thu) by jzbiciak (✭ supporter ✭, #5246) [Link]

Fair enough, but only at the module boundary. Isn't this the same as any DSO? I mean, it's not as if LTO is bringing in libc and all the other shared libraries your other shared libraries when you use it in user-space, is it?

The point is, to get the full benefit of LTO in the kernel, you want to be able to inline or optimize across boundaries such as arch/arm/ and kernel/, for example. (At least, if I understood correctly.) If those are both LTO'd at their respective directory boundaries, and only linked traditionally at the last step, you lose those opportunities.

Sure, drivers built as modules miss out. But, my gut feel (which may be wrong!) is that there's still plenty of things that are compiled in that ought to benefit that won't if you limit LTO to the partial link boundaries.

Link-time optimization for the kernel

Posted Aug 21, 2012 18:09 UTC (Tue) by pj (subscriber, #4506) [Link]

>Andi reports that runs of the "hackbench" benchmark gain about 5%, while kernel builds don't change much at all. Some networking benchmarks improve as much as 18%.

Does this help point out that the networking code could be better link-optimized, even if the rest of the kernel isn't? I mean, if the kernel is being built with different gcc commandlines anyway, does this mean that the network-code-building bits could perhaps be better optimized?

Link-time optimization for the kernel

Posted Aug 21, 2012 20:57 UTC (Tue) by maleadt (subscriber, #59198) [Link]

There has been done similar work in my research lab, performing LTO-optimizations on 2.4 kernels, starting from ARM and x86 object files.

Link-time optimization for the kernel

Posted Aug 21, 2012 22:16 UTC (Tue) by nix (subscriber, #2304) [Link]

As a minor point, GCC can decompose a good bit of the memory-hungry bit of LTO into pieces and then run them in parallel via the -flto={number} option: -flto=jobserver will participate in GNU Make's jobserver feature and run as many LTO instances as that permits (it does this without needing to understand the jobserver protocol via the rather nifty approach of writing a makefile to do the work, running make, and passing it the appropriate jobserver fds).

This way, you can push the memory usage down about as far as a normal compilation for about 90% of the time. Some of the work is nonparallelizable, though, and that will still require a good bit of memory.

(The description of what LTO does is also somewhat unclear. It works by serializing the GIMPLE intermediate representation into special sections in the object files before carrying out most optimization, then, at link time (via the compiler driver, or, for .a files, via a special linker plugin supported by gold and by recent versions of GNU ld) reading the whole lot back in and then running the whole lot through the optimizer almost as if it were a single source file. But that 'almost' covers a multitude of sins, and in order to be useful in practice the thing had to be parallelizable as well. It was a *lot* of work to make GCC capable of this, and I for one believed it could never be made to work reliably. I was wrong.

Link-time optimization for the kernel

Posted Aug 21, 2012 23:22 UTC (Tue) by andikleen (guest, #39006) [Link]

Thanks for the article

I should clarify that the sysycall argument problem in
http://lwn.net/Articles/512685/
is not LTO specific. It's a general problem that the 32bit kernel violates the ABI. It uses some stack arguments for syscalls inside the kernel. The actual syscall interface is defined with registers, but the interface between the syscall entry code and the actual syscall has stack arguments.

The kernel entry code saves all registers on the stack. The same area also doubles as the stack arguments for syscalls.

According to the ABI it's legal for the compiler to change arguments on the stack. But the 32bit kernel cannot tolerate this because it corrupts the saved user space registers.

My 32bit LTO kernel just happened to hit this and the "noinline" is just a workaround to change the register allocation. But there's nothing wrong with inlining in LTO.

The problem could happen without LTO too.

64bit kernel don't have this problem because they pass all 6 syscall arguments in registers even inside the kernel.

It was also not fun to debug.

-Andi Kleen

Link-time optimization for the kernel

Posted Aug 22, 2012 1:56 UTC (Wed) by jhoblitt (subscriber, #77733) [Link]

Sounds like it's not worth trying to support LTO for x86 kernels until it's a proven win on amd64.

Link-time optimization for the kernel

Posted Aug 22, 2012 3:19 UTC (Wed) by andikleen (guest, #39006) [Link]

I don't think the problem is that common. We can work around it too.

Link-time optimization for the kernel

Posted Aug 23, 2012 8:07 UTC (Thu) by gb (subscriber, #58328) [Link]

Interesting experiment, but what's the point of building 32-bit kernels at this moment then 64-bit installations are more common than 32-bit?

Link-time optimization for the kernel

Posted Aug 23, 2012 19:31 UTC (Thu) by stevenb (guest, #11536) [Link]

Is that really true in the mobile space? ;-)

Link-time optimization for the kernel

Posted Aug 24, 2012 12:14 UTC (Fri) by misiu_mp (guest, #41936) [Link]

In the mobile space we have arm and it has plenty of registers doesn't it?

Link-time optimization for the kernel

Posted Aug 26, 2012 4:05 UTC (Sun) by theICEBear (subscriber, #23193) [Link]

It does and AArch64 is not far away, given rate of change within the mobile/pad fad-driven world the new chips will start hitting market sooner rather than later.

Link-time optimization for the kernel

Posted Aug 27, 2012 20:01 UTC (Mon) by stevenb (guest, #11536) [Link]

I was thinking more of 32-bits Atom.

ARM Thumb wouldn't have many registers either, but I don't know whether you could (or could want to) build an ARM Thumb linux kernel.

Link-time optimization for the kernel

Posted Aug 30, 2012 19:20 UTC (Thu) by cibervicho (subscriber, #52552) [Link]

ARM Thumb wouldn't have many registers either, but I don't know whether you could (or could want to) build an ARM Thumb linux kernel.
Thumb and Thumb-2 have as many registers as ARM (32-bit). You can build a Thumb-2 kernel with CONFIG_THUMB2_KERNEL=y

Link-time optimization for the kernel

Posted Aug 21, 2012 23:29 UTC (Tue) by cavok (subscriber, #33216) [Link]

"Andi also suggests that, in the long term, LTO could help to improve the quality of the kernel code base by eliminating the need to put inline functions into include files."

Looking at the quality of the kernel also supporting LLVM would help on the long term.

On LTO builds with 32bit compilers.

Posted Aug 22, 2012 3:30 UTC (Wed) by andikleen (guest, #39006) [Link]

On 32bit LTO builds with 32bit compilers.

I only tried it with a 64bit compiler.

I didn't say needs at least 4GB actually, just that needs not more than 4GB.

The limiting phase is the WPA step that merges all the types. That is a single process. The actual code generation runs with multiple processes later in the LTO-WHOPR build, so is just limited by how many -j* threads are used.

The 32bit compiler will need less memory than the 64bit compiler because a large amount of the compiler memory is pointers, and those are only half as big. So there's a good chance that a normal modular or reasonable sized monolithic build will fit in ~2-3GB or so, which is the limit on a normal 32bit kernel with 32bit compiler

Very large monolithic builds like allyes will not work

On LTO builds with 32bit compilers.

Posted Aug 22, 2012 13:19 UTC (Wed) by nix (subscriber, #2304) [Link]

allyes, or enterprise distro kernels. I don't think we can really call them 'reasonable' either (2000+ modules, great ghu). Still, the enterprise distro people can surely do 64->32-bit cross-compilation.

On LTO builds with 32bit compilers.

Posted Aug 22, 2012 13:21 UTC (Wed) by nix (subscriber, #2304) [Link]

And of course as soon as I post this I realise that having heaps of modules will pose no problem at all: LTO doesn't run over those and the kernel as a unit. I wish LWN let me delete things I posted for a few minutes after posting them...

On LTO builds with 32bit compilers.

Posted Aug 22, 2012 13:34 UTC (Wed) by andikleen (guest, #39006) [Link]

Distro modular kernels should be ok, each module is LTOed on its own and the vmlinux is not too big. The limit is only the largest binary.

That said I haven't actually tried it with a 32bit compiler. Testing welcome.

On LTO builds with 32bit compilers.

Posted Aug 22, 2012 18:25 UTC (Wed) by dlang (✭ supporter ✭, #313) [Link]

This will create yet another advantage to having a monolithic kernel instead of using dozens of modules for your basic functions.

On LTO builds with 32bit compilers.

Posted Aug 24, 2012 15:49 UTC (Fri) by malor (subscriber, #2973) [Link]

Do monolithic kernels even work anymore? Back in the Stone Age, I used to compile my kernels statically to prevent module-load hacks, but as I recall, that support was mostly removed... my memory is unclear, but I under the impression that it was no longer really possible to run Linux without using loadable modules.

Is that incorrect? Are static kernels actually reasonably possible?

On LTO builds with 32bit compilers.

Posted Aug 24, 2012 16:12 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

They are, there's just little reason to use them - especially since you'll probably need to build in any loadable firmware, which may make the result undistributable.

On LTO builds with 32bit compilers.

Posted Aug 24, 2012 23:01 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

that depends on the firmware on on your interpretation of the requirements for source for that firmware.

I haven't seen anyone sued for the source of a firmware blob if the firmware blob itself didn't included GPL code in it. (as opposed to the firmware blob being used as data by GPL code and uploaded to a device)

A lot of the linux kernel developers consider the splitting of the firmware out of the source tree to be a waste of time from a technical and legal point of view, but they don't fight it because it shuts up the people who think that it does matter from a legal point of view.

besides, the GPL only comes in to play when you distribute the resulting binary. There's a huge amount of stuff that you can do (especially in a large company) without triggering this.

On LTO builds with 32bit compilers.

Posted Aug 25, 2012 4:44 UTC (Sat) by mjg59 (subscriber, #23239) [Link]

"Which may make the result undistributable" - you appear to have just spent several paragraphs agreeing with me. What was your point?

On LTO builds with 32bit compilers.

Posted Aug 25, 2012 5:06 UTC (Sat) by dlang (✭ supporter ✭, #313) [Link]

Ok, "MAY" make the result undistributable, heavy emphasis and lots of doubt on the word MAY

There are a LOT of people who don't think it would.

Also, even if it did, it wouldn't matter for lots of people, because they don't distribute the resulting binaries.

Link-time optimization for the kernel

Posted Aug 22, 2012 8:15 UTC (Wed) by jezuch (subscriber, #52988) [Link]

AFAIK there's another benefit of LTO: it can merge identical functions and data from multiple object files. This is especially beneficial for C++, where, for example, every time one #includes <cstream>, the compiler generates several stubs which turn out to be identical and unnecessarily duplicated. This and other optimizations usually result in about 10% reduction in the final binary's size.

That said, I notice that the kernel actually grows with LTO. The modules do get smaller - especially large ones like GPU and filesystem drivers - but not vmlinuz.

Oh, and the development version which will become GCC 4.8 reduces memory use even further. So much that it can actually build itself on my machine with 8 gigs of ram, although with some swapping :)

Link-time optimization for the kernel

Posted Aug 22, 2012 13:34 UTC (Wed) by jwakely (subscriber, #60262) [Link]

> it can merge identical functions and data from multiple object files

But in practice does it actually do so?

> every time one #includes <cstream>, the compiler generates several stubs which turn out to be identical and unnecessarily duplicated

I assume you mean <iostream>, and AFAIK LTO doesn't change that situation at all (see http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44952 for a suggestion to make it do so, and other suggestions that wouldn't rely on LTO.)

Link-time optimization for the kernel

Posted Aug 22, 2012 13:37 UTC (Wed) by andikleen (guest, #39006) [Link]

I don't think that works currently with LTO

There were some linker based approaches for this though (it only really needs a checksum per function)

Link-time optimization for the kernel

Posted Aug 22, 2012 18:19 UTC (Wed) by stevenb (guest, #11536) [Link]

GCC doesn't do this, but AFAIU the gold linker's icf.cc does this. From binutils-2.22:src/gold/icf.cc:

// Identical Code Folding Algorithm
// ----------------------------------
// Detecting identical functions is done here and the basic algorithm
// is as follows. A checksum is computed on each foldable section using
// its contents and relocations. If the symbol name corresponding to
// a relocation is known it is used to compute the checksum. If the
// symbol name is not known the stringified name of the object and the
// section number pointed to by the relocation is used. The checksums
// are stored as keys in a hash map and a section is identical to some
// other section if its checksum is already present in the hash map.
// Checksum collisions are handled by using a multimap and explicitly
// checking the contents when two sections have the same checksum.
//
// However, two functions A and B with identical text but with
// relocations pointing to different foldable sections can be identical if
// the corresponding foldable sections to which their relocations point to
// turn out to be identical. Hence, this checksumming process must be
// done repeatedly until convergence is obtained.

Whether this works with LTO, I don't know. And I suppose it requires -ffunction-sections but I'm not sure about that either.

Not a very helpful post, sorry ;-)

Link-time optimization for the kernel

Posted Aug 22, 2012 22:43 UTC (Wed) by andikleen (guest, #39006) [Link]

gold unfortunately does not work with LTO kernel builds at the moment.
You could only use it without.

Link-time optimization for the kernel

Posted Aug 23, 2012 10:05 UTC (Thu) by jwakely (subscriber, #60262) [Link]

No need to apologise, I'd somehow missed that gold did ICF and thought Microsoft's was the only mainstream linker to support it. Thanks, Steven!

Link-time optimization for the kernel

Posted Aug 23, 2012 10:19 UTC (Thu) by mgedmin (subscriber, #34497) [Link]

Wasn't there recently a post about a kernel bug caused by the compiler merging two different functions that turned out to contain identical code? IIRC it was caused by a check of a function pointer in order to determine an object's type misfiring.

Link-time optimization for the kernel

Posted Aug 24, 2012 15:01 UTC (Fri) by jezuch (subscriber, #52988) [Link]

> But it turns out that most of the changes have the same basic scope: ensuring that the compiler knows that specific symbols are needed even if they appear to be unused; that prevents the LTO stage from optimizing them away. For example, symbols exported to modules may not have any callers in the core kernel itself, but they need to be preserved for modules that may be loaded later. To that end, Andi's first patch defines a new attribute (__visible) used to mark such symbols

One more thing... If I disable loadable modules (CONFIG_MODULES=n), does this also get disabled? I think I have nothing to fear from the symbol reaper if there's no way to use these symbols outside the core kernel :)

Link-time optimization for the kernel

Posted Aug 31, 2012 19:05 UTC (Fri) by dbarnes (guest, #75204) [Link]

Will existing patents on LTO (5966539, 5999737) affect this?

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds