Kernel optimization with BOLT

By Jake Edge
October 25, 2024

A pair of talks in the toolchains track at the 2024 Linux Plumbers Conference covered different tools that can be used to optimize the kernel. First up was Maksim Panchenko to describe the binary optimization and layout tool (BOLT) that Meta uses on its production kernels. It optimizes the kernel binary by rearranging it to improve its code locality for better performance. A subsequent article will cover the second talk, which looked at automatic feedback-directed optimization (AutoFDO) and other related techniques that are used to optimize Google's kernels.

Panchenko began with a slide showing a handful of companies and projects that use BOLT, which can be seen in his slides or the YouTube video of the talk. It was designed at first for large applications at Meta, but it turned out to also accelerate compilers. So, for example, it is used by Python since version 3.12; it is also used by LLVM, Rust, and others.

If you look back to ten years ago in the open-source world, getting the maximum performance from an application was a matter of using GCC or Clang with -O3, profile-guided optimization (PGO), and link-time optimization (LTO). But applying PGO was "kind of painful", he said, so only those who cared a lot about performance were using it. Applying PGO to the kernel was even more painful, so most companies just used -O2 or -O3 on their kernels.

The main Meta application is a virtual machine running PHP code; around 2015, developers there came up with the idea of a binary optimizer that would work with both GCC and Clang to speed up the generated code. BOLT turned out to exceed developers' expectations; it gave large gains and was able to accelerate the compilers themselves. It has been used at Meta since 2016; "most of the cycles are spent in binaries optimized by BOLT" throughout the Meta fleet. BOLT was released as open source, under the Apache 2.0 license, in 2018 and has been part of LLVM since 2022.

Generally, developers think about how the data structures for their programs will be arranged in memory; it is less common for them to consider how the code is arranged. There are exceptions, including the Linux kernel developers, but most times the focus is on the data cache. The instruction cache is much smaller than the data cache and has not grown much over time, maybe doubling from 32KB to 64KB on Intel CPUs over the last 20 years. But, for large applications that do not spend most of their time in tight loops, the layout of code in memory matters a lot, so BOLT can make a major difference, he said.

Compilers do not have enough information to optimally place code, even when they have profiling information; it turns out that inlining functions changes the profile. The profile information may point to a function foo() that is called a lot, but when it is inlined, that information is not present. BOLT operates on the binary to observe the code that is being frequently executed so that all of it can be placed close together in memory. Once the BOLT paper was released in 2019, he said, other efforts, including Propeller, have come about; they are generally tied to a single toolchain, though, while BOLT can be used with GCC or Clang and with different linkers.

He showed a sort of heat map of the memory layout of the HHVM runtime, which is what is used by Meta for most of its workloads. One image showed hot code spread all over, while the post-BOLT image showed all of the hot code confined to the same small region of memory. That drastically reduces instruction-cache misses, translation-lookaside buffer (TLB) misses, and CPU time; for HHVM, BOLT produced a 7% performance increase, Panchenko said.

BOLT is a post-link optimizer that runs on ELF binaries, such as vmlinux. Even though it is part of the LLVM project, BOLT still supports GCC code; it also supports the "most popular" architectures: x86_64, Arm64, and RISC-V.

Applying BOLT to the kernel took some time, mostly in the form of ensuring that the resulting kernel would run and not crash. One of the big problems encountered was in finding a good benchmark to use to measure the impact of BOLT. There are lots of different micro-benchmarks available, but "I couldn't find any scalable, large-scale benchmark". In the end, he used the RocksDB db_bench fillseq benchmark, which showed a 2.5% improvement just by switching to a BOLT-optimized kernel. He and others at Meta ran one of the company's main services on a BOLT-optimized kernel that produced a 2% queries-per-second (QPS) improvement, "which was quite significant".

BOLT only changes branches and the location of the basic blocks in the binary to achieve its improvements. Most of the time, there is no need to recompile applications in order to apply BOLT; the exception is for programs built with split functions, which BOLT can do better than the compiler. The application does need to be relinked in order to produce relocation information. With that, and a profile of the kernel running a representative workload, BOLT will only take around four seconds to optimize vmlinux, he said.

The profile can be generated with a variety of mechanisms, including the last branch record (LBR) feature on Intel platforms and similar branch-sampling features on other architectures. If that is not available, the code can be instrumented to gather the needed information, but that has higher overhead than using LBR and the like. There are other options, but the profile quality, thus the BOLT optimizations, will not be as good.

BOLT needs an unstripped vmlinux binary because it uses the symbol names for code discovery, Panchenko said. BOLT can easily identify the boundaries of code and data in the ELF binary using the text and data segments that are defined. The segments are further divided into sections and BOLT uses the symbol-table information to identify the individual functions therein.

Then BOLT disassembles the functions, though, unlike objdump, it will "symbolize the operands of the instructions". Distinguishing between constants and addresses in the instructions can be a problem, however. The relocation information that is inserted by the linker helps with that; it can be used to "effectively do symbolic disassembly" of the code.

The resulting instruction stream can be optimized using normal techniques, such as peephole optimization, but "this is not very efficient". In order to do more optimizations on the code, an intermediate representation (IR) of some kind is needed. The IR used is "essentially a control-flow graph on top of MC instructions", Panchenko said; he was referring to the LLVM machine code (MC) project, which provides a number of tools and libraries that BOLT uses. The instructions look much like the assembly code, but some may be annotated or modified to identify tail calls versus other kinds of jumps, for example. "So if you look at BOLT disassembly, you will have a much better idea of what's happening in your application compared to regular objdump".

BOLT uses the profile information for the basic blocks in the control-flow graph in order to determine where the hot code resides. To make the best code-layout decisions, though, having weights on the edges, rather than just execution counts for the basic blocks, would be useful. The LBR profiling can provide that information, but BOLT can recover some information about edge weights even without it.

Then the graph is used to optimize the code; "The main optimization that gives us most of the gains is code reordering." The basic blocks can be grouped together to reduce the instruction-cache footprint of the code. That is done by breaking up the functions into fragments, some of which are hot code and others that are rarely or never executed (e.g. error-handling code). Compilers already do reordering, but on the function level; BOLT takes it one step further and reorders these function fragments.

Once the new code is generated, there is a question about where to put it, he said. Unless there is a big concern about disk space, it is more efficient to simply create a new text segment that contains the hot code, which will generally be quite a bit smaller (for a binary that is huge, "hundreds of megabytes", 20MB or less of hot code would end up in the new segment). So it is not much overhead, in terms of disk space, and "you get a much faster application".

Adding another text segment to the kernel binary may not be viable, he said, so he turned to an alternative that is "much more feasible". BOLT will simply rewrite the existing functions in the binary; those functions are already ordered by the compiler based on its analysis, so BOLT effectively uses that. BOLT could do a bit better function ordering based on the profile, but that small gain can be sacrificed in order to easily get the basic-block reordering, he said. It also helps avoid over-specializing the code for only the workload measured by the profile; for other workloads that were not measured, some of the cold code will be executed, but it will still be located nearby due to the compiler choices.

The kernel provides other challenges, because its code is modified at boot time and also while it is running. He spent a good chunk of the talk going into the details of how that type of code is handled. Things like static calls, SMP locks (that are patched out on uniprocessor systems), static keys, and alternative instructions for different subarchitectures are handled with annotations in the disassembled code, which is part of the metadata that BOLT uses to do its job. For some, BOLT simply does not operate on them, while others have mechanisms that the optimizer can use on them. All of that metadata can be dumped using BOLT tools.

Panchenko said that he was skipping over some topics, such as continuous profiling, which can be applied to BOLT-optimized binaries as they run; a new version of a binary can be produced that will reflect changes in the code and the workload. He also did not cover any of the other optimizations that BOLT applies. He finished by showing some of the output that running BOLT produces, noting that a four-second demo of it operating would not be all that interesting.

[ I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Vienna for the Linux Plumbers Conference. ]

Index entries for this article
Kernel	Optimization tools
Conference	Linux Plumbers Conference/2024

Grope

Posted Oct 26, 2024 0:33 UTC (Sat) by willy (subscriber, #9762) [Link] (8 responses)

It only took 25 years to replace

https://lwn.net/1998/1029/als/rope.html

(Not sure why it never got released ...)

And, damn, that name and the "jokes" being made ... I think we're a bit better now.

Grope

Posted Oct 26, 2024 8:08 UTC (Sat) by atnot (subscriber, #124910) [Link] (7 responses)

> Alan Cox has grabbed Miguel and forced him to sit down. The two of them are heading to the front. Apparently the harassment in the hallway had reached too high a level. No! He's escaped!
> "rope" is a pun on "cord" but then creates a great word combined with GNU

yeah this whole thing is just one yikes after another. jesus christ. As uncomfortable as I still am visiting events like this today, at least it's no longer... whatever this is.

Grope

Posted Oct 26, 2024 9:06 UTC (Sat) by intelfx (subscriber, #130118) [Link] (6 responses)

And the problem with these “uncomfortable” “yikes” “whatever this is”, besides people having apparently light-hearted fun, is exactly… what?

Grope

Posted Oct 26, 2024 9:22 UTC (Sat) by atnot (subscriber, #124910) [Link] (1 responses)

Sorry, if your idea of "lighthearted fun" is getting "harassed" (direct quote) and physically assaulted into watching a presentation about how proud the author is about his sexual assault joke, I'd like to please stay as far away from you as possible.

Grope

Posted Oct 26, 2024 10:01 UTC (Sat) by intelfx (subscriber, #130118) [Link]

There is nothing the quote or in the original text that would suggest that any kind of _actual_ harasssment, "physical assault", or, worse, "sexual assault" has taken place.

> I'd like to please stay as far away from you as possible.

The feeling is thus mutual.

Grope

Posted Oct 28, 2024 14:33 UTC (Mon) by LtWorf (subscriber, #124958) [Link] (1 responses)

Moralists never have fun, so hold a grudge against others who have fun instead.

Grope

Posted Oct 29, 2024 10:13 UTC (Tue) by atnot (subscriber, #124910) [Link]

I think it is very amusing how fast I flip from being a filthy degenerate that must be kept away from society for their debauchery to a funless prude as soon as I impinge on peoples Sacred ability to make rape "jokes" and non-consenually touch people at conferences. Oh well.

Grope

Posted Nov 6, 2024 23:35 UTC (Wed) by cmkrnl (guest, #59973) [Link] (1 responses)

Would have sounded lighthearted and witty to a close-knit group of young people in 1998, but today it just sounds juvenile an immature.

Grope

Posted Nov 7, 2024 15:15 UTC (Thu) by paulj (subscriber, #341) [Link]

The free software community in 1998 was largely young people - from late teen students to 30-somethings (Linus was 29). So they're now in their 40s to 60s.

I.e., the set of young people who enjoyed mildly vulgar/shocking-to-norms puns then, are mostly the same set of people as the older set who today find it juvenile.

Intriguing

Posted Oct 26, 2024 12:35 UTC (Sat) by jd (guest, #26381) [Link] (1 responses)

It would seem like there are now quite an array of tools for optimising in various ways.

But one optimisation can potentially interact with another optimisation, and optimal binary reordering may be affected by compiler optimisation which may in turn be potentially affected by optimal binary reordering.

I'm trying to figure out from this article how, exactly, you get the most out of this.

I'd also be intrigued to know if this technique could be used effectively with the Verified Software Toolchain. VST is fine for producing provably correct binaries, but there's obvious drawbacks to this - there's not a whole lot of optimising you can do and still be certain the binaries are correct.

If you can greatly accelerate VST-produced binaries without impacting the proof of correctness in any way, I could imagine scenarios where this could actually be useful.

Intriguing

Posted Nov 6, 2024 12:41 UTC (Wed) by rolandog (subscriber, #151303) [Link]

I'm also curious as to whether BOLT is smart enough to distinguish functions that need to run in constant time to prevent timing attacks. (Gotta watch the presentation, though... Maybe it's addressed there).

Other BOLT weirdness

Posted Oct 27, 2024 4:35 UTC (Sun) by kmeyer (subscriber, #50720) [Link] (3 responses)

There is some other BOLT behavior that derives from its use for HHVM: it can replace functions that use AVX512 intrinsics with traps. This is probably not useful for anyone aside from HHVM.

https://github.com/llvm/llvm-project/blob/7b88e7530d4329f...

Also, this is ... kind of insane, for library source code? You can use the compile-time feature detection support and identify the compiler target supports AVX512 (-mavx512 or whatever). You can use runtime cpuid feature bit detection and identify the running CPU supports AVX512. But if your binaries have been though BOLT with the -trap-avx512 flag, your AVX-accelerated function will just trap with a ud2 instruction.

If you find yourself needing to detect, at runtime, this BOLT bastardization of the binary, this ugly hack seems to work: https://github.com/facebook/folly/blob/d5e10f9d076838374f...

Other BOLT weirdness

Posted Oct 27, 2024 5:57 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> You can use runtime cpuid feature bit detection and identify the running CPU supports AVX512

AVX-512 is stupid. It doesn't work on efficiency cores on Alder Lake. Even though P-cores support it.

Other BOLT weirdness

Posted Oct 27, 2024 6:59 UTC (Sun) by intelfx (subscriber, #130118) [Link]

> AVX-512 is stupid. It doesn't work on efficiency cores on Alder Lake

Perhaps it rather means that Alder Lake is stupid?

Other BOLT weirdness

Posted Nov 1, 2024 18:20 UTC (Fri) by anton (subscriber, #25547) [Link]

The P-cores of Alder Lake don't support AVX-512 (implemented but disabled), either, unless you are using some early firmware. It's a pity that Intel completely disabled that, even in Xeon-E24xx CPUs where the E-cores are disabled. But don't worry, buy an AMD CPU with a Zen4 or Zen5 core, and you will get AVX-512.

Cache sizes

Posted Nov 1, 2024 18:46 UTC (Fri) by anton (subscriber, #25547) [Link] (4 responses)

The claims made in the article (maybe in the talk) about cache sizes are mostly wrong.

I-cache and D-cache typically have similar size, and if they differ, it's not always the D-cache that is larger. E.g., Zen4 has 32KB I-cache and 32KB D-cache, Zen5 and Raptor Cove have 32KB I-cache and 48KB D-cache, and Gracemont has 64KB I-cache and 32KB D-cache.

The sizes of L1 caches generally have not grown in the last 20 years; e.g., the 2003 Athlon 64 has 64KB I-cache and 64KB D-cache, and the 2003 Pentium M has 32KB I-cache and 32KB D-cache. Instead, they have added an L3 cache since that time. A number of cores have a microoperation cache in addition to the I-cache, but the sizes are hard to compare.

Cache sizes

Posted Nov 4, 2024 10:45 UTC (Mon) by paulj (subscriber, #341) [Link]

Hell, the AMD *K6* had 32 KiB I and D cache!

Cache sizes

Posted Nov 4, 2024 23:48 UTC (Mon) by himi (subscriber, #340) [Link] (2 responses)

Out of curiosity, how much of that is because larger I and D caches didn't provide as much of a gain as increasing L3 caches? Particularly since the L1 caches are tightly coupled to each core rather than shared across the ever-increasing number of cores - devoting transistors to increasing L1 caches is going to have a very different cost/benefit mix than devoting them to more computational units or shared caches . . .

Cache sizes

Posted Nov 5, 2024 8:06 UTC (Tue) by anton (subscriber, #25547) [Link] (1 responses)

The L2 cache of many cores is dedicated to the core, too (e.g., on Intel's P-cores for over a decade and on AMD's Zen-Zen5 cores).

The reason for keeping the L1 cache small is latency. If the cache grows, the miss rate decreases, but the latency increases. You can see the longer latency nicely in the comparison of L2 sizes and latencies in this article.

One reason is that the wires get longer, which increases the time that signals travel.

You also want to use a virtually-indexed physically-tagged (VIPT) cache as L1 cache, which allows to perform the TLB access and the cache access in parallel, i.e., with low latency. But that means that the size of a cache way is at most as large as a page; the number of ways is limited (you typically don't see more than 16-way set-associative caches, and a lower number of ways is common in L1 caches), the page size is 4KB on AMD64, which limits the L1 cache sizes to 64KB (and 32KB or 48KB is more common). Apple's Firestorm (M1 P-core) has larger caches (192KB I-cache, 128KB D-cache), but also 16KB pages, which allows a VIPT cache implementation with a 12-way (I) or 8-way (D) set-associative cache.

Cache sizes

Posted Nov 8, 2024 14:42 UTC (Fri) by raven667 (subscriber, #5198) [Link]

Some of what you describe is the fact that while software people sort of sort of imagine the computer operates in a virtual realm of abstract logic, hardware is actually a physical electrical device and you can't just abstractly put "more cache" on it the way you could refactor a software program because of the physical reality of electrical circuits and wiring that is the computer.