Kernel optimization with BOLT
A pair of talks in the toolchains track at the 2024 Linux Plumbers Conference covered different tools that can be used to optimize the kernel. First up was Maksim Panchenko to describe the binary optimization and layout tool (BOLT) that Meta uses on its production kernels. It optimizes the kernel binary by rearranging it to improve its code locality for better performance. A subsequent article will cover the second talk, which looked at automatic feedback-directed optimization (AutoFDO) and other related techniques that are used to optimize Google's kernels.
Panchenko began with a slide showing a handful of companies and projects that use BOLT, which can be seen in his slides or the YouTube video of the talk. It was designed at first for large applications at Meta, but it turned out to also accelerate compilers. So, for example, it is used by Python since version 3.12; it is also used by LLVM, Rust, and others.
If you look back to ten years ago in the open-source world, getting the
maximum performance from an application was a
matter of using GCC or Clang with -O3, profile-guided
optimization (PGO), and link-time
optimization (LTO). But applying PGO was "kind of painful
", he
said, so only those who cared a lot about performance were using it.
Applying PGO to the kernel was even more painful, so most companies just used
-O2 or -O3 on their kernels.
The main Meta application is a virtual machine running PHP code; around 2015, developers there came up with the idea of a binary optimizer that
would work with both GCC and Clang to speed up the generated code. BOLT
turned out to exceed developers' expectations;
it gave large gains and was able to accelerate the compilers themselves.
It has been used at Meta since 2016; "most of the cycles are spent in
binaries optimized by BOLT
" throughout the Meta fleet. BOLT was
released as open source, under the Apache 2.0 license, in 2018 and has been part of LLVM since 2022.
![Maksim Panchenko [Maksim Panchenko]](https://static.lwn.net/images/2024/lpc-panchenko-sm.png)
Generally, developers think about how the data structures for their programs will be arranged in memory; it is less common for them to consider how the code is arranged. There are exceptions, including the Linux kernel developers, but most times the focus is on the data cache. The instruction cache is much smaller than the data cache and has not grown much over time, maybe doubling from 32KB to 64KB on Intel CPUs over the last 20 years. But, for large applications that do not spend most of their time in tight loops, the layout of code in memory matters a lot, so BOLT can make a major difference, he said.
Compilers do not have enough information to optimally place code, even when they have profiling information; it turns out that inlining functions changes the profile. The profile information may point to a function foo() that is called a lot, but when it is inlined, that information is not present. BOLT operates on the binary to observe the code that is being frequently executed so that all of it can be placed close together in memory. Once the BOLT paper was released in 2019, he said, other efforts, including Propeller, have come about; they are generally tied to a single toolchain, though, while BOLT can be used with GCC or Clang and with different linkers.
He showed a sort of heat map of the memory layout of the HHVM runtime, which is what is used by Meta for most of its workloads. One image showed hot code spread all over, while the post-BOLT image showed all of the hot code confined to the same small region of memory. That drastically reduces instruction-cache misses, translation-lookaside buffer (TLB) misses, and CPU time; for HHVM, BOLT produced a 7% performance increase, Panchenko said.
BOLT is a post-link optimizer that runs on ELF binaries, such as
vmlinux. Even though it is part of the LLVM project, BOLT still
supports GCC code; it also supports the "most popular
"
architectures: x86_64, Arm64, and RISC-V.
Applying BOLT to the kernel took some time, mostly in the form of ensuring
that the resulting kernel would run and not crash. One of the big problems
encountered was in finding a good benchmark to use to measure the impact of
BOLT. There are lots of different micro-benchmarks available, but "I
couldn't find any scalable, large-scale benchmark
". In the end, he
used the RocksDB
db_bench fillseq benchmark, which showed a 2.5% improvement just by
switching to a BOLT-optimized kernel. He and others at Meta ran one
of the company's main services on a BOLT-optimized kernel that produced a 2% queries-per-second (QPS) improvement, "which was quite significant
".
BOLT only changes branches and the location of the basic blocks in the binary to achieve its improvements. Most of the time, there is no need to recompile applications in order to apply BOLT; the exception is for programs built with split functions, which BOLT can do better than the compiler. The application does need to be relinked in order to produce relocation information. With that, and a profile of the kernel running a representative workload, BOLT will only take around four seconds to optimize vmlinux, he said.
The profile can be generated with a variety of mechanisms, including the last branch record (LBR) feature on Intel platforms and similar branch-sampling features on other architectures. If that is not available, the code can be instrumented to gather the needed information, but that has higher overhead than using LBR and the like. There are other options, but the profile quality, thus the BOLT optimizations, will not be as good.
BOLT needs an unstripped vmlinux binary because it uses the symbol names for code discovery, Panchenko said. BOLT can easily identify the boundaries of code and data in the ELF binary using the text and data segments that are defined. The segments are further divided into sections and BOLT uses the symbol-table information to identify the individual functions therein.
Then BOLT disassembles the functions, though, unlike objdump,
it will "symbolize the operands of the instructions
".
Distinguishing between constants and addresses in the instructions
can be a problem, however. The relocation information that is inserted by the
linker helps with that; it can be used to "effectively do symbolic
disassembly
" of the code.
The resulting instruction stream can be optimized
using normal techniques, such as peephole optimization, but "this is not
very efficient
". In order to do more optimizations on the code, an
intermediate representation (IR) of some kind is needed. The IR used is
"essentially a control-flow graph on top of MC instructions
",
Panchenko said; he was referring to the LLVM
machine code (MC) project, which provides a number of tools and
libraries that BOLT uses. The instructions look much like the assembly
code, but some may be annotated or modified to identify tail calls versus
other kinds of jumps, for example. "So if you look at BOLT disassembly,
you will have a much better idea of what's happening in your application
compared to regular objdump
".
BOLT uses the profile information for the basic blocks in the control-flow graph in order to determine where the hot code resides. To make the best code-layout decisions, though, having weights on the edges, rather than just execution counts for the basic blocks, would be useful. The LBR profiling can provide that information, but BOLT can recover some information about edge weights even without it.
Then the graph is used to optimize the code; "The main optimization that
gives us most of the gains is code reordering.
" The basic blocks can
be grouped together to reduce the instruction-cache footprint of the code.
That is done by breaking up the functions into fragments, some of which are
hot code and others that are rarely or never executed (e.g. error-handling
code). Compilers already do reordering, but on the function level; BOLT
takes it one step further and reorders these function fragments.
Once the new code is generated, there is a question about where to put it,
he said. Unless there is a big concern about disk space, it is more efficient to simply create a new text segment that
contains the hot code, which will generally be quite a bit smaller (for a
binary that is huge, "hundreds of megabytes
", 20MB or less of hot
code would end up in the new segment). So it is not much overhead, in
terms of disk space, and "you get a much faster
application
".
Adding another
text segment to the kernel binary may not be viable, he said, so he turned
to an alternative that is "much more feasible
".
BOLT
will simply rewrite the existing functions in the binary; those functions are already
ordered by the compiler based on its analysis, so BOLT effectively uses
that. BOLT could do a bit better function ordering based on the profile, but
that small gain can be sacrificed in order to easily get the basic-block
reordering, he said. It also helps avoid over-specializing the code for
only the workload measured by the profile; for other workloads that were
not measured, some of the cold code will be executed, but it will still be
located nearby due to the compiler choices.
The kernel provides other challenges, because its code is modified at boot time and also while it is running. He spent a good chunk of the talk going into the details of how that type of code is handled. Things like static calls, SMP locks (that are patched out on uniprocessor systems), static keys, and alternative instructions for different subarchitectures are handled with annotations in the disassembled code, which is part of the metadata that BOLT uses to do its job. For some, BOLT simply does not operate on them, while others have mechanisms that the optimizer can use on them. All of that metadata can be dumped using BOLT tools.
Panchenko said that he was skipping over some topics, such as continuous profiling, which can be applied to BOLT-optimized binaries as they run; a new version of a binary can be produced that will reflect changes in the code and the workload. He also did not cover any of the other optimizations that BOLT applies. He finished by showing some of the output that running BOLT produces, noting that a four-second demo of it operating would not be all that interesting.
[ I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Vienna for the Linux Plumbers Conference. ]
Index entries for this article | |
---|---|
Kernel | Optimization tools |
Conference | Linux Plumbers Conference/2024 |
Posted Oct 26, 2024 0:33 UTC (Sat)
by willy (subscriber, #9762)
[Link] (8 responses)
https://lwn.net/1998/1029/als/rope.html
(Not sure why it never got released ...)
And, damn, that name and the "jokes" being made ... I think we're a bit better now.
Posted Oct 26, 2024 8:08 UTC (Sat)
by atnot (subscriber, #124910)
[Link] (7 responses)
yeah this whole thing is just one yikes after another. jesus christ. As uncomfortable as I still am visiting events like this today, at least it's no longer... whatever this is.
Posted Oct 26, 2024 9:06 UTC (Sat)
by intelfx (subscriber, #130118)
[Link] (6 responses)
Posted Oct 26, 2024 9:22 UTC (Sat)
by atnot (subscriber, #124910)
[Link] (1 responses)
Posted Oct 26, 2024 10:01 UTC (Sat)
by intelfx (subscriber, #130118)
[Link]
> I'd like to please stay as far away from you as possible.
The feeling is thus mutual.
Posted Oct 28, 2024 14:33 UTC (Mon)
by LtWorf (subscriber, #124958)
[Link] (1 responses)
Posted Oct 29, 2024 10:13 UTC (Tue)
by atnot (subscriber, #124910)
[Link]
Posted Nov 6, 2024 23:35 UTC (Wed)
by cmkrnl (guest, #59973)
[Link] (1 responses)
Posted Nov 7, 2024 15:15 UTC (Thu)
by paulj (subscriber, #341)
[Link]
I.e., the set of young people who enjoyed mildly vulgar/shocking-to-norms puns then, are mostly the same set of people as the older set who today find it juvenile.
Posted Oct 26, 2024 12:35 UTC (Sat)
by jd (guest, #26381)
[Link] (1 responses)
But one optimisation can potentially interact with another optimisation, and optimal binary reordering may be affected by compiler optimisation which may in turn be potentially affected by optimal binary reordering.
I'm trying to figure out from this article how, exactly, you get the most out of this.
I'd also be intrigued to know if this technique could be used effectively with the Verified Software Toolchain. VST is fine for producing provably correct binaries, but there's obvious drawbacks to this - there's not a whole lot of optimising you can do and still be certain the binaries are correct.
If you can greatly accelerate VST-produced binaries without impacting the proof of correctness in any way, I could imagine scenarios where this could actually be useful.
Posted Nov 6, 2024 12:41 UTC (Wed)
by rolandog (subscriber, #151303)
[Link]
Posted Oct 27, 2024 4:35 UTC (Sun)
by kmeyer (subscriber, #50720)
[Link] (3 responses)
https://github.com/llvm/llvm-project/blob/7b88e7530d4329f...
Also, this is ... kind of insane, for library source code? You can use the compile-time feature detection support and identify the compiler target supports AVX512 (-mavx512 or whatever). You can use runtime cpuid feature bit detection and identify the running CPU supports AVX512. But if your binaries have been though BOLT with the -trap-avx512 flag, your AVX-accelerated function will just trap with a ud2 instruction.
If you find yourself needing to detect, at runtime, this BOLT bastardization of the binary, this ugly hack seems to work: https://github.com/facebook/folly/blob/d5e10f9d076838374f...
Posted Oct 27, 2024 5:57 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
AVX-512 is stupid. It doesn't work on efficiency cores on Alder Lake. Even though P-cores support it.
Posted Oct 27, 2024 6:59 UTC (Sun)
by intelfx (subscriber, #130118)
[Link]
Perhaps it rather means that Alder Lake is stupid?
Posted Nov 1, 2024 18:20 UTC (Fri)
by anton (subscriber, #25547)
[Link]
Posted Nov 1, 2024 18:46 UTC (Fri)
by anton (subscriber, #25547)
[Link] (4 responses)
I-cache and D-cache typically have similar size, and if they differ, it's not always the D-cache that is larger. E.g., Zen4 has 32KB I-cache and 32KB D-cache, Zen5 and Raptor Cove have 32KB I-cache and 48KB D-cache, and Gracemont has 64KB I-cache and 32KB D-cache.
The sizes of L1 caches generally have not grown in the last 20 years; e.g., the 2003 Athlon 64 has 64KB I-cache and 64KB D-cache, and the 2003 Pentium M has 32KB I-cache and 32KB D-cache. Instead, they have added an L3 cache since that time. A number of cores have a microoperation cache in addition to the I-cache, but the sizes are hard to compare.
Posted Nov 4, 2024 10:45 UTC (Mon)
by paulj (subscriber, #341)
[Link]
Posted Nov 4, 2024 23:48 UTC (Mon)
by himi (subscriber, #340)
[Link] (2 responses)
Posted Nov 5, 2024 8:06 UTC (Tue)
by anton (subscriber, #25547)
[Link] (1 responses)
The reason for keeping the L1 cache small is latency. If the cache grows, the miss rate decreases, but the latency increases. You can see the longer latency nicely in the comparison of L2 sizes and latencies in this article.
One reason is that the wires get longer, which increases the time that signals travel.
You also want to use a virtually-indexed physically-tagged (VIPT) cache as L1 cache, which allows to perform the TLB access and the cache access in parallel, i.e., with low latency. But that means that the size of a cache way is at most as large as a page; the number of ways is limited (you typically don't see more than 16-way set-associative caches, and a lower number of ways is common in L1 caches), the page size is 4KB on AMD64, which limits the L1 cache sizes to 64KB (and 32KB or 48KB is more common). Apple's Firestorm (M1 P-core) has larger caches (192KB I-cache, 128KB D-cache), but also 16KB pages, which allows a VIPT cache implementation with a 12-way (I) or 8-way (D) set-associative cache.
Posted Nov 8, 2024 14:42 UTC (Fri)
by raven667 (subscriber, #5198)
[Link]
Grope
Grope
> "rope" is a pun on "cord" but then creates a great word combined with GNU
Grope
Grope
Grope
Grope
Grope
Grope
Grope
Intriguing
Intriguing
Other BOLT weirdness
Other BOLT weirdness
Other BOLT weirdness
The P-cores of Alder Lake don't support AVX-512 (implemented but disabled), either, unless you are using some early firmware. It's a pity that Intel completely disabled that, even in Xeon-E24xx CPUs where the E-cores are disabled. But don't worry, buy an AMD CPU with a Zen4 or Zen5 core, and you will get AVX-512.
Other BOLT weirdness
The claims made in the article (maybe in the talk) about cache sizes are mostly wrong.
Cache sizes
Cache sizes
Cache sizes
The L2 cache of many cores is dedicated to the core, too (e.g., on Intel's P-cores for over a decade and on AMD's Zen-Zen5 cores).
Cache sizes
Cache sizes