Shrinking the kernel with link-time optimization
This is the second article of a series discussing various methods of reducing the size of the Linux kernel to make it suitable for small environments. The first article provided a short rationale for this topic, and covered the link-time garbage collection, also called the ld --gc-sections method. We've seen that, though it is pretty straightforward, link-time garbage collection has issues of its own when applied to the kernel, making achieving optimal results more difficult than it is worth. In this article we'll have a look at what the compiler itself can do using link-time optimization.
Please note that most examples presented here were produced using the ARM architecture, however the principles themselves are architecture-independent.
Dead-code elimination
Kernel developers often rely on a compiler feature called "dead-code elimination". This is an important optimization that results in unreachable code simply being dropped from the final binary. Unlike the linker garbage-collection feature, dead-code elimination can be performed by the compiler both within and across functions inside a compilation unit or file.
Let's reuse the example code we used previously as test.c to illustrate it:
int foo(void) { return 1; } int bar(void) { return foo() + 2; } int main(void) { return foo() + 4; }
Again, the compiler generates the following (simplified) assembly output for that code:
.text .type foo, %function foo: mov r0, #1 bx lr .type bar, %function bar: push {r3, lr} bl foo adds r0, r0, #2 pop {r3, pc} .type main, %function main: push {r3, lr} bl foo adds r0, r0, #4 pop {r3, pc}
Despite bar() not being called, it is still part of the compiled output because there is no way for the compiler to actually know whether code in some other file might call it. But the author of that code often knows that, and can tell the compiler about it with the static qualifier as follows:
static int foo(void) { return 1; } static int bar(void) { return foo() + 2; } int main(void) { return foo() + 4; }
By marking foo() and bar() static, the developer renders them no longer reachable from other source files. The compiler is then free to perform more optimizations on the compiled code. Of course the entry point (main()) must remain externally accessible and therefore cannot be static.
The above compiles to this:
.text .type main, %function main: mov r0, #5 bx lr
Boom! Not only did the compiler get rid of the unused bar() with dead-code elimination, but it also merged foo() directly into main() due to automatic inlining. In addition, it performed the arithmetic operation up front since all of the operands are constants, so that all we have left in the compiled code is the load of the resulting value and the return instruction. Instant code-size reduction that already works better than link-time garbage collection!
As mentioned, this dead-code elimination is heavily relied upon in the Linux kernel source tree, so that large portions of the code can be optimized away at compile time. For example, let's consider the following from include/linux/mmzone.h:
static inline int is_highmem_idx(enum zone_type idx) { #ifdef CONFIG_HIGHMEM return (idx == ZONE_HIGHMEM || (idx == ZONE_MOVABLE && zone_movable_is_highmem())); #else return 0; #endif }
When CONFIG_HIGHMEM is not defined, is_highmem_idx() (and PageHighMem() derived from it) return zero unconditionally. Any code within functions that follows the "if (PageHighMem(page))" pattern will be automatically optimized away as dead code.
But this works only because is_highmem_idx() is marked static; to avoid duplication of that function everywhere mmzone.h is included it has to be marked inline too. Those optimizations only work within a single compilation unit, missing out on opportunistic dead code elimination across different compilation units. So, what can we do short of concatenating all source files into a single one and making everything static to achieve the full benefit?
As mentioned previously, the core-kernel APIs are split into different C files for ease of maintenance. Those files may provide functions that are not called when some unwanted feature is configured out. It could be argued that the unused core functions should be #ifdef'd in or out along with their call sites, but this gets hairy when multiple features sharing the same core API may be configured in and out independently. To complicate things further, those core functions might be called within "if (PageHighMem(page))" blocks showing no directly visible relationship with a configuration option. So there are limits to how much unused code can be removed by the compiler; doing a more thorough job requires a tool like link-time optimization.
Link-time optimization (LTO)
What is it? LTO is a compilation mode that instructs the compiler to parse the code into an abstract internal representation as usual, and store that representation directly into the resulting object file without any optimization, rather than optimizing and assembling it into final machine instructions. Then, at link time when all the different object files are gathered together, the compiler intercepts the link process to reload that internal representation from all of the object files at once; only then will it perform its optimization passes — on the whole program — before the actual link. So it is basically just like if it concatenated all source files into a single one and made everything static. Great, that is exactly what we wished for.
Let's see how this works in practice with our little example program by having each function in its own file:
$ gcc -O2 -flto -c foo.c $ gcc -O2 -flto -c bar.c $ gcc -O2 -flto -c main.c $ gcc -O2 -flto -o test foo.o bar.o main.o $ nm test | grep "foo\|bar\|main" 000102c0 T main $ objdump -d test [...] 000102c0 <main>: 102c0: e3a00005 mov r0, #5 102c4: e12fff1e bx lr
As expected, the result is the same as our earlier test despite having separate source files containing non-static functions.
LTO and the kernel
LWN first covered LTO for the kernel more than five years ago. Since then, things have improved a lot. LTO still isn't supported in the mainline, but Andi Kleen's kernel LTO patchset has become much simpler as basic code correctness issues, which LTO is pickier about, have been merged upstream, and many LTO bugs in GCC have been fixed.
One of the biggest LTO showstoppers for the kernel had to do with the fact that a special version of binutils was required. The kernel used to rely solely on partial linking (ld -r) when recursively gathering subdirectory build results, however ld -r doesn't support objects with LTO data unless binutils is patched to do so. And it was rather unlikely that the necessary patch would ever be merged in the upstream binutils tree. Nowadays the kernel build system can use thin archives instead of ld -r, making LTO of the kernel possible with the upstream tools that most distributions ship.
LTO also has a big advantage over link-time garbage collection given that it does not require separate linker sections for each exception table entry and does not suffer from the "backward reference" and "missing forward reference" problems described in the previous article. When the compiler optimizes code away, those exception table entries instantiated by that code are simply dropped automatically with it.
Numbers please!
Let's not forget that our end goal is to fit Linux into tiny systems. So it is about time we looked at actual kernel-size numbers. Let's pick the STM32 target which represents the kind of tiny systems we're aiming for. The advantage here is that mainline Linux already runs on most STM32 microcontrollers, albeit with external RAM. The baseline kernel version is v4.15-rc1 plus the LTO patches.
First, with LTO disabled:
$ make stm32_defconfig $ make vmlinux $ size vmlinux text data bss dec hex filename 1704024 144732 117660 1966416 1e0150 vmlinux
And with LTO enabled:
$ ./scripts/config --enable CONFIG_LTO_MENU $ make vmlinux $ size vmlinux text data bss dec hex filename 1281644 142492 112985 1537121 177461 vmlinux
This is a 22% size reduction right there. For completeness, let's see how link-time garbage collection as described in the previous article fares:
$ [hacks for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION] $ make vmlinux $ size vmlinux text data bss dec hex filename 1304516 141672 113108 1559296 17cb00 vmlinux
Here we get a 21% size reduction. However, this comes with a big disclaimer due to the following hacks:
-
No KEEP() statements were added to the ARM linker file as required. Worse: the ASSERT() statements about missing processor and architecture tables have been disabled for the sake of successful compilation. This means important pieces of code and data are missing from this kernel.
-
The ARM unwinding facility needed for function backtraces has been forcefully disabled as it also contained a reference to every function, making garbage collection ineffective. So, unlike the LTO-built kernel, this one would lack an important debugging facility.
Of course those hacks produce a non-functional kernel. Still, the size reduction is slightly lower than what LTO produces, and it would be even less if proper link-time garbage collection support was implemented. And optimal link-time garbage collection as described in the previous article is way more invasive than LTO. We therefore have a clear winner here.
One could wonder if size reduction could improve further by combining both link-time optimization and link-time garbage collection. The answer is no since, once LTO has removed every piece of dead code, there is simply nothing left to garbage-collect.
More numbers
So LTO seems to be the best thing since sliced bread, right? Well, it has drawbacks of its own. The most significant is build time. Let's repeat the above kernel compilation sequence to see what we get.
First with LTO disabled:
$ make clean $ make stm32_defconfig $ time make -j8 vmlinux real 0m36.645s user 3m59.252s sys 0m21.026s
And with LTO enabled:
$ make clean $ ./scripts/config --enable CONFIG_LTO_MENU $ time make -j8 vmlinux real 1m24.774s user 8m4.143s sys 0m31.902s
LTO requires 1.9x more CPU time and 2.3x more wall-clock time to build the kernel. Having code optimizations performed at the very end creates a bigger serialization point, unlike traditional builds where individual source files are compiled and optimized concurrently without LTO.
But the most annoying case, at least for a kernel developer, is partial rebuild time after some source-code modifications. Without LTO we get:
$ touch init/main.c $ time make -j8 vmlinux real 0m3.686s user 0m5.803s sys 0m1.819s
And with LTO enabled this becomes:
$ touch init/main.c $ time make -j8 vmlinux real 0m58.283s user 5m6.089s sys 0m12.732s
A partial build with LTO is about 15x longer than the non-LTO case, and not very far from the full build time. So LTO is clearly not something suitable during frequent debug/rebuild/test cycles.
And for completeness:
$ make clean $ [hacks for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION] $ time make -j8 vmlinux real 0m37.572s user 3m58.826s sys 0m21.616s
More or less the same result as our initial build. Clearly link-time garbage collection is basically free in terms of build time which is its biggest (perhaps only) advantage.
- Test-build environment details
-
GCC version 6.3.1 20170404 (Linaro GCC 6.3-2017.05)
Intel® Core™ i7-4770R CPU @ 3.20GHz
Samsung SSD 850 EVO 500GB
Conclusion
We have two approaches for automatic kernel-size reduction at our disposal, each with a different set of compromises. However the advantage is clearly on the LTO side when considering maintenance costs and intrusiveness. And build time becomes tolerable when building very small kernels anyway. But, did we manage to get a "very small kernel"? Kernels that cross the one-megabyte range cannot realistically be qualified as "very small" or even "tiny" yet. Clearly automatic size reduction alone won't be sufficient, so more assertive approaches will be required to achieve our goal. That will be the subject of the next article.
Meanwhile, anybody wanting to play with LTO with their own kernel in the short term should start with these instructions found in Kleen's patch set.
The next article in this series is Shrinking
the kernel with an axe
Index entries for this article | |
---|---|
Kernel | Build system |
Kernel | Optimization tools |
GuestArticles | Pitre, Nicolas |
Posted Jan 18, 2018 21:18 UTC (Thu)
by andresfreund (subscriber, #69562)
[Link] (2 responses)
How much less bad does this get when parallelizing LTO like new-ish GCCs can? Or is this already with parallelization enabled?
Posted Jan 18, 2018 22:18 UTC (Thu)
by npitre (subscriber, #5680)
[Link] (1 responses)
Posted Jan 18, 2018 22:25 UTC (Thu)
by andresfreund (subscriber, #69562)
[Link]
Too bad. Would be interesting, but I definitely see why you wouldn't necessarily want to measure given it's yet another patchset, how that compares to LLVM's thinlto implementation.
Posted Jan 18, 2018 21:25 UTC (Thu)
by intgr (subscriber, #39733)
[Link] (1 responses)
Posted Jan 19, 2018 2:38 UTC (Fri)
by gutschke (subscriber, #27910)
[Link]
In other words, there almost certainly will be a performance improvement. But my gut feeling is that the numbers are going to be so small, most users would not be able to notice.
Posted Jan 19, 2018 12:11 UTC (Fri)
by AnimeLife (guest, #116014)
[Link] (1 responses)
Posted Jan 19, 2018 12:22 UTC (Fri)
by willy (subscriber, #9762)
[Link]
Posted Jan 20, 2018 6:55 UTC (Sat)
by ndesaulniers (subscriber, #110768)
[Link]
Posted Jan 21, 2018 14:30 UTC (Sun)
by ncm (guest, #165)
[Link] (3 responses)
(Just recalling how happy three of us programming on an LSI-11 were when we got a 20M "Winchester" disk. We had 128K RAM, plus (significantly) 4K in each terminal.)
It probably is also worth mentioning that, in a terrifyingly short time, every ordinary machine will have multiple TB of non-volatile RAM, and no disk or SSD, and all performance-tuned code will be obsolete. Any network packet propagation latency will suddenly become an eternity, and the only remaining performance bottleneck masking inefficiencies in query processing code.
Posted Jan 23, 2018 14:32 UTC (Tue)
by jezuch (subscriber, #52988)
[Link] (2 responses)
Posted Jan 26, 2018 22:21 UTC (Fri)
by giraffedata (guest, #1954)
[Link] (1 responses)
How puny? (Real memory, swap space)
Probably bigger than the server I usually use to build kernels.
Posted Jan 29, 2018 17:11 UTC (Mon)
by jezuch (subscriber, #52988)
[Link]
Building gcc itself, on the other hand...
Posted Jan 22, 2018 15:01 UTC (Mon)
by ehiggs (subscriber, #90713)
[Link] (2 responses)
If I understand correctly, dead code elimination is hindered by C++. For example, any virtual function will automatically be referred to in the vtable and it is not possible to strip any virtual methods from the final executable. Or any virtual methods in derived types regardless of use if any part of the type hierarchy is referred to at all. I suppose it might be possible for a linker to rewrite vtables to drop the functions (or maybe leave a dangling pointer in the vtable...) but as I understand it, this isn't done unless the compiler has all the information - e.g. when writing basic examples where everything is in a single module.
So when I run basic examples using -O3 the aggressive inlining seems to handle it fine (e.g. https://godbolt.org/g/Rn8GH1). It's not clear if this is working ok in the small cases but wouldn't be expected to work in the larger case - or if my mental model is just incorrect.
Posted Jan 23, 2018 1:43 UTC (Tue)
by inphi (guest, #120554)
[Link]
Posted Feb 2, 2018 4:27 UTC (Fri)
by HelloWorld (guest, #56129)
[Link]
Posted Jan 23, 2018 15:14 UTC (Tue)
by mirabilos (subscriber, #84359)
[Link] (7 responses)
Posted Jan 23, 2018 17:03 UTC (Tue)
by peter-b (subscriber, #66996)
[Link] (6 responses)
Posted Jan 24, 2018 16:15 UTC (Wed)
by ncm (guest, #165)
[Link] (5 responses)
It is a matter of respect. When one person puts in substantial efforts to improve the kernel in a way that is important to them and to many other users, a trivial effort not to massively break those improvements would hint that you have something better than contempt for the people who do the work.
To complain that LTO doesn't work would be to demand people work for you for free. Making LTO work is contributing. Improvements of any kind typically take far more effort than it is worth to the individual, who is doing the rest of us that favor. Minimal effort not to break others' work is necessary to a healthy project.
Posted Jan 25, 2018 12:58 UTC (Thu)
by mirabilos (subscriber, #84359)
[Link] (4 responses)
Look up my contributions, if you so desire… if you find them all, I know I personally *don’t* even know all places I’ve had my fingers in over the last decades.
It’s just, compilers isn’t what I do well. I’ve patched bootloaders, I’ve got fixes in all Linux libcs except musl (dalias does a great enough job for me to not find any bugs in musl so far), and I’ve been doing tons of other work, and I’m even now expanding to other stuff.
There’s just not enough hours in a day, considering I have a regular, boring $dayjob. Do you wish to sponsor me for a year so I can take a sabbatical and work only on OSS?
Posted Jan 26, 2018 22:20 UTC (Fri)
by giraffedata (guest, #1954)
[Link] (3 responses)
ncm didn't say you don't contribute enough. He said at most that it would be preferable for you to fix GCC (repeatedly, apparently) than to complain that it keeps breaking. (And that's not saying you should fix GCC).
And even that is based on some reading between the lines about the moral value of demanding versus contributing, and the idea that complaining about something is demanding that someone fix it. I don't view complaining that way; for example, I complain about the weather all the time without meaning to criticize anyone or demand that someone fix it. I complain about the presence of ads on Youtube the same way.
Posted Jan 26, 2018 22:46 UTC (Fri)
by mirabilos (subscriber, #84359)
[Link] (2 responses)
I actually find the idea of using LTO to eliminate dead code, in the
① I read “low-hanging fruits” in an LWN article today. One of these,
Now, with both the footnote #1 and the first paragraph, let’s get to
$ mksh -c 'typeset -i x=2; y=$((1 ? 20 : x+=2))'
Basically, ?: binds more than +=, so this is '(1 ? 20 : x) += 2',
GCC developers prefer isolated small test cases. Now, with LTO,
On the other hand, a change in testsuite output between two different
Oh, and: the Linux kernel does not have such a testsuite. Several GNU
Food for thought?
Posted Feb 8, 2018 10:49 UTC (Thu)
by dharding (subscriber, #6509)
[Link] (1 responses)
Posted Feb 8, 2018 20:03 UTC (Thu)
by mirabilos (subscriber, #84359)
[Link]
And, yes, sorry, I don’t have even the beginning of an answer for you,
Posted Jan 29, 2018 14:32 UTC (Mon)
by jezz (subscriber, #59547)
[Link] (1 responses)
I wrote a script few years ago to detect unused struct members. I extracted information from debug symbols. Unfortunately, this script seems lost.
Posted Feb 2, 2018 18:51 UTC (Fri)
by cesarb (subscriber, #6266)
[Link]
Posted Feb 25, 2019 20:05 UTC (Mon)
by praveenv98 (guest, #116175)
[Link]
In this passage :
"When CONFIG_HIGHMEM is not defined, is_highmem_idx() (and PageHighMem() derived from it) return zero unconditionally. Any code within functions that follows the "if (PageHighMem(page))" pattern will be automatically optimized away as dead code.
But this works only because is_highmem_idx() is marked static; ".
Why this works only for static functions ? Static keyword implies only about the internal linkage right?
Thanks
Shrinking the kernel with link-time optimization
Shrinking the kernel with link-time optimization
Shrinking the kernel with link-time optimization
Shrinking the kernel with link-time optimization
Shrinking the kernel with link-time optimization
Shrinking the kernel with link-time optimization
Shrinking the kernel with link-time optimization
There are of course other optimisations for squeezing more systems into the same amount of hardware, but tinification is one tool.
Shrinking the kernel with link-time optimization
https://lkml.org/lkml/2017/11/15/772
RAM footprint
RAM footprint
RAM footprint
I managed to build chromium with LTO on my puny personal machine
RAM footprint
Shrinking the kernel with link-time optimization
Shrinking the kernel with link-time optimization
Shrinking the kernel with link-time optimization
Shrinking the kernel with link-time optimization
Shrinking the kernel with link-time optimization
Shrinking the kernel with link-time optimization
Shrinking the kernel with link-time optimization
Shrinking the kernel with link-time optimization
Shrinking the kernel with link-time optimization
GCC developers seem to be disinterested in LTO bugs, and for its antecessor
(-fwhole-program --combine) they outright said it won’t get fixed.
Linux kernel or elsewhere, great — I just wanted to point out that GCC
might, with its current bugs, history of bugs, and history of attitude
towards said bugs¹, be a tad too unreliable to do so without excessive
tests that point out miscompiled builds.
for the GCC/LTO problem, would be to make building mksh part of the
usual pre-release tests; mksh has a history of spotting compiler,
toolchain, libc, etc. bugs via its testsuite.
something: isolating the issue is *hard*. The mksh testsuite is a
bunch of shell scripts together with flags and expected output, with
a Perl driver, ran through the shell compiled with the to-be-tested
compiler/toolchain/libc. That’s a few levels of indirection. The latest
LTO bug occurs in only one testcase: arith-ternary-prec-1, which is:
mksh: 1 ? 20 : x+=2: += requires lvalue
and a miscompiled shell silently accepts this. This is *very* hard
to isolate.
isolating gets even more complicated. I can accept that not having
a small isolated test case is not desirable.
versions of the same compiler, ceteris paribus (i.e. you try the same
version of the testsuite, shell, toolchain, libc, …), *does* indicate
a problem (not necessarily in the compiler, but it’s a prime suspect),
and in the time of “git bisect” it’s at least often possible, for someone
with enough beefy hardware to actually build GCC that often, to figure
out which compiler change introduced the breakage. (Then, it’s still a
matter of deciding whether the bug is actually in the compiler or else‐
where, but the GCC developers at least know their compiler, and each
other on the development team.)
distributions’ mksh package maintainers have come to me, independently,
with a testsuite failure report about the above test, and the advice
found after the first analysis (LTO is at fault, GCC miscompiled mksh)
made them compile mksh without LTO, preventing their users from getting
a faulty binary that might misbehave in other situations as well. Now,
the Linux kernel, not so much.
Shrinking the kernel with link-time optimization
Shrinking the kernel with link-time optimization
but someone with enough horsepower machine could certainly bisect this
between GCC 5 and 6 I think…
What about unused struct members?
What about unused struct members?
Shrinking the kernel with link-time optimization