Topics from the LLVM microconference
A persistent theme throughout the LLVM microconference at the 2015 Linux Plumbers Conference was that of "breaking the monopoly" of GCC, the GNU C library (glibc), and other tools that are relied upon for building Linux systems. One could quibble with the "monopoly" term, since it is self-imposed and not being forced from the outside, but the general idea is clear: using multiple tools to build our software will help us in numerous ways.
Kernel and Clang
Most of the microconference was presentation-oriented, with relatively little discussion. Jan-Simon Möller kicked things off with a status report on the efforts to build a Linux kernel using LLVM's Clang C compiler. The number of patches needed for building the kernel has dropped from around 50 to 22 "small patches", he said. Most of those are in the kernel build system or are for little quirks in driver code. Of those, roughly two-thirds can likely be merged upstream, while the others are "ugly hacks" that will probably stay in the LLVM-Linux tree.
There are currently five patches needed in order to build a kernel for the x86 architecture. Two of those are for problems building the crypto code (the AES_NI assembly code will not build with the LLVM integrated assembler and there are longstanding problems with variable-length arrays in structures). The integrated assembler also has difficulty handling some "assembly" code that is used by the kernel build system to calculate offsets; GCC sees it as a string, but the integrated assembler tries to actually assemble it.
The goal of building an "allyesconfig" kernel has not yet been realized, but a default configuration (defconfig) can be built using the most recent Git versions of LLVM and Clang. It currently requires disabling the integrated assembler for the entire build, but the goal is to disable it just for the files that need it.
Other architectures (including ARM for the Raspberry Pi 2) can be built using roughly half-a-dozen patches per architecture, Möller said. James Bottomley was concerned about the "Balkanization" of kernel builds once Linus Torvalds and others start using Clang for their builds; obsolete architectures and those not supported by LLVM may stop building altogether, he said. But microconference lead Behan Webster thought that to be an unlikely outcome. Red Hat and others will always build their kernels using GCC, he said, so that will be supported for quite a long time, if not forever.
Using multiple compilers
Kostya Serebryany is a member of the "dynamic testing tools" team at Google, which has the goal of providing tools for the C++ developers at the company to find bugs without any help from the team. He was also one of the proponents of the "monopoly" term for GCC, since it is used to build the kernel, glibc, and all of the distribution binaries. But, he said, making all of that code buildable using other compilers will allow various other tools to also be run on the code.
For example, the AddressSanitizer
(ASan) can be used to detect memory errors such as stack overflow, use
after free, using stack memory after a function has returned, and so on.
Likewise, ThreadSanitizer
(TSan), MemorySanitizer
(MSan), and UndefinedBehaviorSanitizer
(UBSan) can find various kinds of problems in C and C++ code. But all are
based on Clang and LLVM, so only code that can be built with that compiler
suite can be sanitized using these tools.
GCC already has some similar tools and the Linux kernel has added some as well (the kernel address sanitizer, for example), which have found various bugs, including quite a few security bugs. GCC's support has largely come about because of the competition with LLVM and still falls short in some areas, he said.
Beyond that, though, there are other techniques beyond "best effort" tools like the sanitizers. For example, fuzzing and hardening are two techniques that can be used to either find more bugs or eliminate certain classes of bugs. He stated that coverage-guided fuzzing can be used to narrow in on problem areas in the code. LLVM's LibFuzzer can be used to perform that kind of fuzzing. He noted that the Heartbleed bug can be "found" using LibFuzzer in roughly five seconds on his laptop.
Two useful hardening techniques are also available with LLVM: control flow integrity (CFI) and SafeStack. CFI will abort the program when it detects certain kinds of undesired behavior—for example that the virtual function table for a program has been altered. SafeStack protects against stack overflows by placing local variables on a separate stack. That way, the return address and any variables are not contiguous in memory.
Serebryany said that it was up to the community to break the monopoly. He was not suggesting simply switching to using LLVM exclusively, but to ensuring that the kernel, glibc, and distributions all could be built with it. Furthermore, he said that continuous integration should be set up so that all of these pieces can always be built with both compilers. When other compilers arrive, they should also be added into the mix.
To that end, Webster asked if Google could help getting the kernel patches needed to build with Clang upstream. Serebryany said that he thought that, by showing some of the advantages of being able to build with Clang (such as the fuzzing support), Google might be able to help get those patches merged.
BPF and LLVM
The "Berkeley Packet Filter" (BPF) language has expanded its role greatly over the years, moving from simply being used for packet filtering to now providing the in-kernel virtual machine for security (seccomp), tracing, and more. Alexei Starovoitov has been the driving force behind extending the BPF language (into eBPF) as well as expanding its scope in the kernel. LLVM can be used to compile eBPF programs for use by the kernel, so Starovoitov presented about the language and its uses at the microconference.
He began by noting wryly that he "works for Linus Torvalds" (in the same sense that all kernel developers do). He merged his first patches into GCC some fifteen years ago, but he has "gone over to Clang" in recent years.
The eBPF language is supported by both GCC and LLVM using backends that he wrote. He noted that the GCC backend is half the size of the LLVM version, but that the latter took much less time to write. "My vote goes to LLVM for the simplicity of the compiler", he said. The LLVM-BPF backend has been used to demonstrate how to write a backend for the compiler. It is now part of LLVM stable and will be released as part of LLVM 3.7.
GCC is built for a single backend, so you have to specifically create a BPF version, but LLVM has all of its backends available using command-line arguments (--target bpf). LLVM also has an integrated assembler that can take the C code describing the BPF and turn it into in-memory BPF bytecode that can be loaded into the kernel.
BPF for tracing is currently a hot area, Starovoitov said. It is a better alternative to SystemTap and runs two to three times faster than Oracle's DTrace. Part of that speed comes from LLVM's optimizations plus the kernel's internal just-in-time compiler for BPF bytecode.
Another interesting tool is the BPF Compiler Collection (BCC). It makes it easy to write and run BPF programs by embedding them into Python (either directly as strings in the Python program or by loading them from a C file). Underneath the Python "bpf" module is LLVM, which compiles the program before the Python code loads it into the kernel. A simple printk() can easily be added into the kernel without recompiling it (or rebooting). He noted that Brendan Gregg has added a bunch of example tools to show how to use the C+Python framework.
Under the covers, the framework uses libbpfprog that compiles a C source file into BPF bytecode using Clang/LLVM. It can also load the bytecode and any BPF maps to the kernel using the bpf() system call and attach the program(s) to various types of hooks (e.g. kprobes, tc classifiers/actions). The Python bpf module simply provides bindings for the library.
The presentation was replete with examples, which are available in the slides [PDF] as well.
Alternatives for the core
There was a fair amount of overlap between the last two sessions I was able to sit in on. Both Bernhard Rosenkraenzer and Khem Raj were interested in replacing more than just the compiler in building a Linux system. Traditionally, building a Linux system starts with GCC, glibc, and binutils, but there are now alternatives to those. How much of a Linux system can be built using those alternatives?
Some parts of binutils are still needed, Rosenkraenzer said. The binutils gold linker can be used instead of the traditional ld. (Other linker options were presented in Mark Charlebois's final session of the microconference, which I unfortunately had to miss.) The gas assembler from binutils can be replaced with Clang's integrated assembler for the most part, but there are still non-standard assembly constructs that require gas.
Tools like nm, ar, ranlib, and others will need to be made to understand three different formats: regular object files, LLVM bytecode, and the GCC intermediate representation. Rosenkraenzer showed a shell-script wrapper that could be used to add this support to various utilities.
For the most part, GCC can be replaced by Clang. OpenMandriva switched to Clang as its primary compiler in 2014. The soon-to-be-released OpenMandriva 3 is almost all built with Clang 3.7. Some packages are still built with gcc or g++, however. OpenMandriva still needed to build GCC, though, to get libraries that were needed such as libgcc, libatomic, and others (including, possibly, libstdc++).
The GCC compatibility claimed by Clang is too conservative, Rosenkraenzer said. The __GNUC__ macro definition in Clang is set to 4.2.1, but switching that to 4.9 produces better code. There were several thoughts on why Clang has chosen 4.2.1, though both are related: 4.2.1 was the last GPLv2 release of GCC, so some people may not be allowed to look at later versions; in addition, GCC 4.2.1 was the last version that was used to build the BSD portions of OS X.
There are a whole list of GCC-isms that should be avoided for compatibility with Clang. Rosenkraenzer's slides [PDF] list many of them. He noted that there have been a number of bugs found via Clang warnings or errors when building various programs—GCC did not complain about those problems.
Another "monopoly component" that one might want to replace would be glibc. The musl libc alternative is viable, but only if binary compatibility with other distributions is not required. But musl cannot be built with Clang, at least yet.
Replacing GCC's libstdc++ with LLVM's libc++ is possible but, again, binary compatibility is sacrificed. That is a bigger problem than it is for musl, though, Rosenkraenzer said. Using both is possible, but there are problems when libraries (e.g. Qt) are linked to, say, libc++ and a binary-only Qt program uses libstdc++, which leads to crashes. libc++ is roughly half the size of libstdc++, however, so environments like Android (which never used libstdc++) are making the switch.
Cross-compiling under LLVM/Clang is easier since all of the backends are present and compilers for each new target do not need to be built. There is still a need to build the cross-toolchains, though, for binutils, libatomic, and so on. Rosenkraenzer has been working on a tool to do automated bootstrapping of the toolchain and core system.
Conclusion
It seems clear that use of LLVM within Linux is growing and that growth is having a positive effect. The competition with GCC is helping both to become better compilers, while building our tools with both is finding bugs in critical components like the kernel. Whether it is called "breaking the monopoly" or "diversifying the build choices", this trend is making beneficial changes to our ecosystem.
[I would like to thank the Linux Plumbers Conference organizing committee
for travel assistance to
Seattle for LPC.]
| Index entries for this article | |
|---|---|
| Conference | Linux Plumbers Conference/2015 |
