Shrinking the kernel with link-time optimization

January 18, 2018

This article was contributed by Nicolas Pitre

This is the second article of a series discussing various methods of reducing the size of the Linux kernel to make it suitable for small environments. The first article provided a short rationale for this topic, and covered the link-time garbage collection, also called the ld --gc-sections method. We've seen that, though it is pretty straightforward, link-time garbage collection has issues of its own when applied to the kernel, making achieving optimal results more difficult than it is worth. In this article we'll have a look at what the compiler itself can do using link-time optimization.

Please note that most examples presented here were produced using the ARM architecture, however the principles themselves are architecture-independent.

Dead-code elimination

Kernel developers often rely on a compiler feature called "dead-code elimination". This is an important optimization that results in unreachable code simply being dropped from the final binary. Unlike the linker garbage-collection feature, dead-code elimination can be performed by the compiler both within and across functions inside a compilation unit or file.

Let's reuse the example code we used previously as test.c to illustrate it:

    int foo(void)  { return 1; }

    int bar(void)  { return foo() + 2; }

    int main(void) { return foo() + 4; }

Again, the compiler generates the following (simplified) assembly output for that code:

        .text

        .type   foo, %function
    foo:
        mov     r0, #1
        bx      lr

        .type   bar, %function
    bar:
        push    {r3, lr}
        bl      foo
        adds    r0, r0, #2
        pop     {r3, pc}

        .type   main, %function
    main:
        push    {r3, lr}
        bl      foo
        adds    r0, r0, #4
        pop     {r3, pc}

Despite bar() not being called, it is still part of the compiled output because there is no way for the compiler to actually know whether code in some other file might call it. But the author of that code often knows that, and can tell the compiler about it with the static qualifier as follows:

    static int foo(void) { return 1; }

    static int bar(void) { return foo() + 2; }

    int main(void)       { return foo() + 4; }

By marking foo() and bar() static, the developer renders them no longer reachable from other source files. The compiler is then free to perform more optimizations on the compiled code. Of course the entry point (main()) must remain externally accessible and therefore cannot be static.

The above compiles to this:

        .text

        .type   main, %function
    main:
        mov     r0, #5
        bx      lr

Boom! Not only did the compiler get rid of the unused bar() with dead-code elimination, but it also merged foo() directly into main() due to automatic inlining. In addition, it performed the arithmetic operation up front since all of the operands are constants, so that all we have left in the compiled code is the load of the resulting value and the return instruction. Instant code-size reduction that already works better than link-time garbage collection!

As mentioned, this dead-code elimination is heavily relied upon in the Linux kernel source tree, so that large portions of the code can be optimized away at compile time. For example, let's consider the following from include/linux/mmzone.h:

    static inline int is_highmem_idx(enum zone_type idx)
    {
    #ifdef CONFIG_HIGHMEM
            return (idx == ZONE_HIGHMEM ||
                    (idx == ZONE_MOVABLE && zone_movable_is_highmem()));
    #else
            return 0;
    #endif
   }

When CONFIG_HIGHMEM is not defined, is_highmem_idx() (and PageHighMem() derived from it) return zero unconditionally. Any code within functions that follows the "if (PageHighMem(page))" pattern will be automatically optimized away as dead code.

But this works only because is_highmem_idx() is marked static; to avoid duplication of that function everywhere mmzone.h is included it has to be marked inline too. Those optimizations only work within a single compilation unit, missing out on opportunistic dead code elimination across different compilation units. So, what can we do short of concatenating all source files into a single one and making everything static to achieve the full benefit?

As mentioned previously, the core-kernel APIs are split into different C files for ease of maintenance. Those files may provide functions that are not called when some unwanted feature is configured out. It could be argued that the unused core functions should be #ifdef'd in or out along with their call sites, but this gets hairy when multiple features sharing the same core API may be configured in and out independently. To complicate things further, those core functions might be called within "if (PageHighMem(page))" blocks showing no directly visible relationship with a configuration option. So there are limits to how much unused code can be removed by the compiler; doing a more thorough job requires a tool like link-time optimization.

Link-time optimization (LTO)

What is it? LTO is a compilation mode that instructs the compiler to parse the code into an abstract internal representation as usual, and store that representation directly into the resulting object file without any optimization, rather than optimizing and assembling it into final machine instructions. Then, at link time when all the different object files are gathered together, the compiler intercepts the link process to reload that internal representation from all of the object files at once; only then will it perform its optimization passes — on the whole program — before the actual link. So it is basically just like if it concatenated all source files into a single one and made everything static. Great, that is exactly what we wished for.

Let's see how this works in practice with our little example program by having each function in its own file:

   $ gcc -O2 -flto -c foo.c
   $ gcc -O2 -flto -c bar.c
   $ gcc -O2 -flto -c main.c
   $ gcc -O2 -flto -o test foo.o bar.o main.o

   $ nm test | grep "foo\|bar\|main"
   000102c0 T main
   $ objdump -d test
   [...]
   000102c0 <main>:
      102c0:       e3a00005        mov     r0, #5
      102c4:       e12fff1e        bx      lr

As expected, the result is the same as our earlier test despite having separate source files containing non-static functions.

LTO and the kernel

LWN first covered LTO for the kernel more than five years ago. Since then, things have improved a lot. LTO still isn't supported in the mainline, but Andi Kleen's kernel LTO patchset has become much simpler as basic code correctness issues, which LTO is pickier about, have been merged upstream, and many LTO bugs in GCC have been fixed.

One of the biggest LTO showstoppers for the kernel had to do with the fact that a special version of binutils was required. The kernel used to rely solely on partial linking (ld -r) when recursively gathering subdirectory build results, however ld -r doesn't support objects with LTO data unless binutils is patched to do so. And it was rather unlikely that the necessary patch would ever be merged in the upstream binutils tree. Nowadays the kernel build system can use thin archives instead of ld -r, making LTO of the kernel possible with the upstream tools that most distributions ship.

LTO also has a big advantage over link-time garbage collection given that it does not require separate linker sections for each exception table entry and does not suffer from the "backward reference" and "missing forward reference" problems described in the previous article. When the compiler optimizes code away, those exception table entries instantiated by that code are simply dropped automatically with it.

Numbers please!

Let's not forget that our end goal is to fit Linux into tiny systems. So it is about time we looked at actual kernel-size numbers. Let's pick the STM32 target which represents the kind of tiny systems we're aiming for. The advantage here is that mainline Linux already runs on most STM32 microcontrollers, albeit with external RAM. The baseline kernel version is v4.15-rc1 plus the LTO patches.

First, with LTO disabled:

    $ make stm32_defconfig
    $ make vmlinux
    $ size vmlinux
       text    data     bss     dec     hex filename
    1704024  144732  117660 1966416  1e0150 vmlinux

And with LTO enabled:

    $ ./scripts/config --enable CONFIG_LTO_MENU
    $ make vmlinux
    $ size vmlinux
       text    data     bss     dec     hex filename
    1281644  142492  112985 1537121  177461 vmlinux

This is a 22% size reduction right there. For completeness, let's see how link-time garbage collection as described in the previous article fares:

    $ [hacks for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION]
    $ make vmlinux
    $ size vmlinux
       text    data     bss     dec     hex filename
    1304516  141672  113108 1559296  17cb00 vmlinux

Here we get a 21% size reduction. However, this comes with a big disclaimer due to the following hacks:

No KEEP() statements were added to the ARM linker file as required. Worse: the ASSERT() statements about missing processor and architecture tables have been disabled for the sake of successful compilation. This means important pieces of code and data are missing from this kernel.
The ARM unwinding facility needed for function backtraces has been forcefully disabled as it also contained a reference to every function, making garbage collection ineffective. So, unlike the LTO-built kernel, this one would lack an important debugging facility.

Of course those hacks produce a non-functional kernel. Still, the size reduction is slightly lower than what LTO produces, and it would be even less if proper link-time garbage collection support was implemented. And optimal link-time garbage collection as described in the previous article is way more invasive than LTO. We therefore have a clear winner here.

One could wonder if size reduction could improve further by combining both link-time optimization and link-time garbage collection. The answer is no since, once LTO has removed every piece of dead code, there is simply nothing left to garbage-collect.

More numbers

So LTO seems to be the best thing since sliced bread, right? Well, it has drawbacks of its own. The most significant is build time. Let's repeat the above kernel compilation sequence to see what we get.

First with LTO disabled:

    $ make clean
    $ make stm32_defconfig
    $ time make -j8 vmlinux

    real    0m36.645s
    user    3m59.252s
    sys     0m21.026s

And with LTO enabled:

    $ make clean
    $ ./scripts/config --enable CONFIG_LTO_MENU
    $ time make -j8 vmlinux

    real    1m24.774s
    user    8m4.143s
    sys     0m31.902s

LTO requires 1.9x more CPU time and 2.3x more wall-clock time to build the kernel. Having code optimizations performed at the very end creates a bigger serialization point, unlike traditional builds where individual source files are compiled and optimized concurrently without LTO.

But the most annoying case, at least for a kernel developer, is partial rebuild time after some source-code modifications. Without LTO we get:

    $ touch init/main.c
    $ time make -j8 vmlinux

    real    0m3.686s
    user    0m5.803s
    sys     0m1.819s

And with LTO enabled this becomes:

    $ touch init/main.c
    $ time make -j8 vmlinux

    real    0m58.283s
    user    5m6.089s
    sys     0m12.732s

A partial build with LTO is about 15x longer than the non-LTO case, and not very far from the full build time. So LTO is clearly not something suitable during frequent debug/rebuild/test cycles.

And for completeness:

    $ make clean
    $ [hacks for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION]
    $ time make -j8 vmlinux

    real    0m37.572s
    user    3m58.826s
    sys     0m21.616s

More or less the same result as our initial build. Clearly link-time garbage collection is basically free in terms of build time which is its biggest (perhaps only) advantage.

Test-build environment details: GCC version 6.3.1 20170404 (Linaro GCC 6.3-2017.05)
Intel® Core™ i7-4770R CPU @ 3.20GHz
Samsung SSD 850 EVO 500GB

Conclusion

We have two approaches for automatic kernel-size reduction at our disposal, each with a different set of compromises. However the advantage is clearly on the LTO side when considering maintenance costs and intrusiveness. And build time becomes tolerable when building very small kernels anyway. But, did we manage to get a "very small kernel"? Kernels that cross the one-megabyte range cannot realistically be qualified as "very small" or even "tiny" yet. Clearly automatic size reduction alone won't be sufficient, so more assertive approaches will be required to achieve our goal. That will be the subject of the next article.

Meanwhile, anybody wanting to play with LTO with their own kernel in the short term should start with these instructions found in Kleen's patch set.

The next article in this series is Shrinking the kernel with an axe

Index entries for this article
Kernel	Build system
Kernel	Optimization tools
GuestArticles	Pitre, Nicolas

Shrinking the kernel with link-time optimization

Posted Jan 18, 2018 21:18 UTC (Thu) by andresfreund (subscriber, #69562) [Link] (2 responses)

> A partial build with LTO is about 15x longer than the non-LTO case, and not very far from the full build time. So LTO is clearly not something suitable during frequent debug/rebuild/test cycles.

How much less bad does this get when parallelizing LTO like new-ish GCCs can? Or is this already with parallelization enabled?

Shrinking the kernel with link-time optimization

Posted Jan 18, 2018 22:18 UTC (Thu) by npitre (subscriber, #5680) [Link] (1 responses)

This is already enabled with -flto=jobserver i.e. 8 threads in this case.

Shrinking the kernel with link-time optimization

Posted Jan 18, 2018 22:25 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

> This is already enabled with -flto=jobserver i.e. 8 threads in this case.

Too bad. Would be interesting, but I definitely see why you wouldn't necessarily want to measure given it's yet another patchset, how that compares to LLVM's thinlto implementation.

Shrinking the kernel with link-time optimization

Posted Jan 18, 2018 21:25 UTC (Thu) by intgr (subscriber, #39733) [Link] (1 responses)

What about performance? That should be another big advantage of LTO. Does LTO make the system significantly faster?

Shrinking the kernel with link-time optimization

Posted Jan 19, 2018 2:38 UTC (Fri) by gutschke (subscriber, #27910) [Link]

The Linux kernel already does all sorts of tricks to encourage the compiler to inline code where it makes sense and to compute static expressions at compile time. So, I would suspect the main performance benefit is going to be from the smaller binary size. That's nothing to sneer at. More memory that is available for other uses is always a benefit. And more compact code is going to help with better utilization of the instruction cache.

In other words, there almost certainly will be a performance improvement. But my gut feeling is that the numbers are going to be so small, most users would not be able to notice.

Shrinking the kernel with link-time optimization

Posted Jan 19, 2018 12:11 UTC (Fri) by AnimeLife (guest, #116014) [Link] (1 responses)

This will be good for IOT / embedded devices, but for desktops, the kernel is already tiny compared to size of root partition. It is may not be useful for desktop users who build it from scratch or debug it, but it can be used in distros that distribute prebuilt kernels.

Shrinking the kernel with link-time optimization

Posted Jan 19, 2018 12:22 UTC (Fri) by willy (subscriber, #9762) [Link]

It's also good for cloud companies. 5% smaller system => 5% more tenants per rack.
There are of course other optimisations for squeezing more systems into the same amount of hardware, but tinification is one tool.

Shrinking the kernel with link-time optimization

Posted Jan 20, 2018 6:55 UTC (Sat) by ndesaulniers (subscriber, #110768) [Link]

Sami Tolvanen @ Google was just trying to upstream a separate working patch set for LTO as well.
https://lkml.org/lkml/2017/11/15/772

RAM footprint

Posted Jan 21, 2018 14:30 UTC (Sun) by ncm (guest, #165) [Link] (3 responses)

It may be worth mentioning how much RAM the build occupies while doing LTO on the whole freakin' kernel. It hasn't been long, in absolute time, that mere mortals have had access to enough RAM to complete the job.

(Just recalling how happy three of us programming on an LSI-11 were when we got a 20M "Winchester" disk. We had 128K RAM, plus (significantly) 4K in each terminal.)

It probably is also worth mentioning that, in a terrifyingly short time, every ordinary machine will have multiple TB of non-volatile RAM, and no disk or SSD, and all performance-tuned code will be obsolete. Any network packet propagation latency will suddenly become an eternity, and the only remaining performance bottleneck masking inefficiencies in query processing code.

RAM footprint

Posted Jan 23, 2018 14:32 UTC (Tue) by jezuch (subscriber, #52988) [Link] (2 responses)

It's actually not that much. I managed to build chromium with LTO on my puny personal machine. Whole freakin' chromium. The kernel is tiny in comparison. And gcc's memory use improved immensely since the time LTO was introduced.

RAM footprint

Posted Jan 26, 2018 22:21 UTC (Fri) by giraffedata (guest, #1954) [Link] (1 responses)

I managed to build chromium with LTO on my puny personal machine

How puny? (Real memory, swap space)

Probably bigger than the server I usually use to build kernels.

RAM footprint

Posted Jan 29, 2018 17:11 UTC (Mon) by jezuch (subscriber, #52988) [Link]

16 GB RAM might not be "puny", I agree, but that was enough to build chromium without touching swap. Swapping would be murdering performance (and probably hardware) so I made sure to avoid it :) But nothing drastic, I could still comfortably work on it on other stuff.

Building gcc itself, on the other hand...

Shrinking the kernel with link-time optimization

Posted Jan 22, 2018 15:01 UTC (Mon) by ehiggs (subscriber, #90713) [Link] (2 responses)

As there is sometimes talk about using C++ in the kernel, maybe this is a good place to ask whether dead code elimination works in C++.

If I understand correctly, dead code elimination is hindered by C++. For example, any virtual function will automatically be referred to in the vtable and it is not possible to strip any virtual methods from the final executable. Or any virtual methods in derived types regardless of use if any part of the type hierarchy is referred to at all. I suppose it might be possible for a linker to rewrite vtables to drop the functions (or maybe leave a dangling pointer in the vtable...) but as I understand it, this isn't done unless the compiler has all the information - e.g. when writing basic examples where everything is in a single module.

So when I run basic examples using -O3 the aggressive inlining seems to handle it fine (e.g. https://godbolt.org/g/Rn8GH1). It's not clear if this is working ok in the small cases but wouldn't be expected to work in the larger case - or if my mental model is just incorrect.

Shrinking the kernel with link-time optimization

Posted Jan 23, 2018 1:43 UTC (Tue) by inphi (guest, #120554) [Link]

This should automatically happen during LTO. Since the entire class hierarchy will be known at that time, the compiler will be able to de-virtualize whenever possible, thus trivializing dead code elimination. There might be a few cases where the optimizer may decide/unable to infer whether de-virtualization is worth it. But with judicious usage of class sealing (i.e. final) the optimizer shouldn't be inhibited.

Shrinking the kernel with link-time optimization

Posted Feb 2, 2018 4:27 UTC (Fri) by HelloWorld (guest, #56129) [Link]

vtables are used in C as well, except you have to write them manually instead of having the compiler doing it for you (e. g. struct file_operations). So C++ doesn't really add any new problems here.

Shrinking the kernel with link-time optimization

Posted Jan 23, 2018 15:14 UTC (Tue) by mirabilos (subscriber, #84359) [Link] (7 responses)

Now if GCC developers would actually care about LTO and not break it every other version… mksh’s regression testsuite picks up miscompilations fairly well, and since GCC 6 it’s constantly broken *again*, and I don’t even bother reporting this any more because they don’t care, and it’s so bad I’m disabling LTO support in the mksh build script because distros blindly use it then complain, instead of fixing their compilers (what fix?).

Shrinking the kernel with link-time optimization

Posted Jan 23, 2018 17:03 UTC (Tue) by peter-b (guest, #66996) [Link] (6 responses)

Maybe you should ask for your money back?

Shrinking the kernel with link-time optimization

Posted Jan 24, 2018 16:15 UTC (Wed) by ncm (guest, #165) [Link] (5 responses)

Congratulations on entirely missing the point.

It is a matter of respect. When one person puts in substantial efforts to improve the kernel in a way that is important to them and to many other users, a trivial effort not to massively break those improvements would hint that you have something better than contempt for the people who do the work.

To complain that LTO doesn't work would be to demand people work for you for free. Making LTO work is contributing. Improvements of any kind typically take far more effort than it is worth to the individual, who is doing the rest of us that favor. Minimal effort not to break others' work is necessary to a healthy project.

Shrinking the kernel with link-time optimization

Posted Jan 25, 2018 12:58 UTC (Thu) by mirabilos (subscriber, #84359) [Link] (4 responses)

This is great. One replyer misses the point completely, the other assumes I’m not contributing enough and asks me to fix it myself.

Look up my contributions, if you so desire… if you find them all, I know I personally *don’t* even know all places I’ve had my fingers in over the last decades.

It’s just, compilers isn’t what I do well. I’ve patched bootloaders, I’ve got fixes in all Linux libcs except musl (dalias does a great enough job for me to not find any bugs in musl so far), and I’ve been doing tons of other work, and I’m even now expanding to other stuff.

There’s just not enough hours in a day, considering I have a regular, boring $dayjob. Do you wish to sponsor me for a year so I can take a sabbatical and work only on OSS?

Shrinking the kernel with link-time optimization

Posted Jan 26, 2018 22:20 UTC (Fri) by giraffedata (guest, #1954) [Link] (3 responses)

ncm didn't say you don't contribute enough. He said at most that it would be preferable for you to fix GCC (repeatedly, apparently) than to complain that it keeps breaking. (And that's not saying you should fix GCC).

And even that is based on some reading between the lines about the moral value of demanding versus contributing, and the idea that complaining about something is demanding that someone fix it. I don't view complaining that way; for example, I complain about the weather all the time without meaning to criticize anyone or demand that someone fix it. I complain about the presence of ads on Youtube the same way.

Shrinking the kernel with link-time optimization

Posted Jan 26, 2018 22:46 UTC (Fri) by mirabilos (subscriber, #84359) [Link] (2 responses)

The problem isn’t even about fixing vs. not fixing; the problem is that
GCC developers seem to be disinterested in LTO bugs, and for its antecessor
(-fwhole-program --combine) they outright said it won’t get fixed.

I actually find the idea of using LTO to eliminate dead code, in the
Linux kernel or elsewhere, great — I just wanted to point out that GCC
might, with its current bugs, history of bugs, and history of attitude
towards said bugs¹, be a tad too unreliable to do so without excessive
tests that point out miscompiled builds.

① I read “low-hanging fruits” in an LWN article today. One of these,
for the GCC/LTO problem, would be to make building mksh part of the
usual pre-release tests; mksh has a history of spotting compiler,
toolchain, libc, etc. bugs via its testsuite.

Now, with both the footnote #1 and the first paragraph, let’s get to
something: isolating the issue is *hard*. The mksh testsuite is a
bunch of shell scripts together with flags and expected output, with
a Perl driver, ran through the shell compiled with the to-be-tested
compiler/toolchain/libc. That’s a few levels of indirection. The latest
LTO bug occurs in only one testcase: arith-ternary-prec-1, which is:

$ mksh -c 'typeset -i x=2; y=$((1 ? 20 : x+=2))'
mksh: 1 ? 20 : x+=2: += requires lvalue

Basically, ?: binds more than +=, so this is '(1 ? 20 : x) += 2',
and a miscompiled shell silently accepts this. This is *very* hard
to isolate.

GCC developers prefer isolated small test cases. Now, with LTO,
isolating gets even more complicated. I can accept that not having
a small isolated test case is not desirable.

On the other hand, a change in testsuite output between two different
versions of the same compiler, ceteris paribus (i.e. you try the same
version of the testsuite, shell, toolchain, libc, …), *does* indicate
a problem (not necessarily in the compiler, but it’s a prime suspect),
and in the time of “git bisect” it’s at least often possible, for someone
with enough beefy hardware to actually build GCC that often, to figure
out which compiler change introduced the breakage. (Then, it’s still a
matter of deciding whether the bug is actually in the compiler or else‐
where, but the GCC developers at least know their compiler, and each
other on the development team.)

Oh, and: the Linux kernel does not have such a testsuite. Several GNU
distributions’ mksh package maintainers have come to me, independently,
with a testsuite failure report about the above test, and the advice
found after the first analysis (LTO is at fault, GCC miscompiled mksh)
made them compile mksh without LTO, preventing their users from getting
a faulty binary that might misbehave in other situations as well. Now,
the Linux kernel, not so much.

Food for thought?

Shrinking the kernel with link-time optimization

Posted Feb 8, 2018 10:49 UTC (Thu) by dharding (subscriber, #6509) [Link] (1 responses)

Idle curiosity: I'm wondering (though I'm not expecting anyone in this thread to have a ready answer) how many of the problems in LTO builds are specific to LTO, and how many are generic optimization bugs exposed because LTO provides more opportunity for optimization.

Shrinking the kernel with link-time optimization

Posted Feb 8, 2018 20:03 UTC (Thu) by mirabilos (subscriber, #84359) [Link]

That’s an extremely interesting point.

And, yes, sorry, I don’t have even the beginning of an answer for you,
but someone with enough horsepower machine could certainly bisect this
between GCC 5 and 6 I think…

What about unused struct members?

Posted Jan 29, 2018 14:32 UTC (Mon) by jezz (subscriber, #59547) [Link] (1 responses)

I wonder how many function are only referenced by structures and finally nobody reference these members. As far as I understand, LTO is not able to detect/optimize them.

I wrote a script few years ago to detect unused struct members. I extracted information from debug symbols. Unfortunately, this script seems lost.

What about unused struct members?

Posted Feb 2, 2018 18:51 UTC (Fri) by cesarb (subscriber, #6266) [Link]

The LibreOffice project has a clang compiler plugin for that. It's the "unusedfields" plugin at https://cgit.freedesktop.org/libreoffice/core/tree/compil... (there are many other useful compiler plugins there).

Shrinking the kernel with link-time optimization

Posted Feb 25, 2019 20:05 UTC (Mon) by praveenv98 (guest, #116175) [Link]

Hi Great Article!

In this passage :

"When CONFIG_HIGHMEM is not defined, is_highmem_idx() (and PageHighMem() derived from it) return zero unconditionally. Any code within functions that follows the "if (PageHighMem(page))" pattern will be automatically optimized away as dead code.

But this works only because is_highmem_idx() is marked static; ".

Why this works only for static functions ? Static keyword implies only about the internal linkage right?

Thanks