Bootstrappable builds
The idea of Reproducible Builds—being able to recreate bit-for-bit identical binaries using the same source code—has gained momentum over the last few years. Reproducible builds provide some safeguards against bad actors in the software supply chain. But building software depends on the tools used to construct the binary, including compilers and build-automation tools, many of which depend on pre-existing binaries. Minimizing the reliance on opaque binaries for building our software ecosystem is the goal of the Bootstrappable Builds project.
For example, GCC is written in C and C++, which means that it requires compilers for those two languages in order to be built from source. In practice, that generally means a distribution would use its existing binary executables of those tools to build a new GCC version, which would then be released to users. One of the concerns with that approach is described in Unix inventor Ken Thompson's Turing Award lecture "Reflections on Trusting Trust" [PDF]. In a nutshell, Thompson said that trusting the output of a binary compiler is an act of faith that someone has not tampered with the creation of that binary—even if the source code is available.
The Bootstrappable Builds project was started as an offshoot of the Reproducible Builds project during the latter's 2016 summit in Berlin. A bootstrappable build takes the idea of reproducibility one step further, in some sense. The build of a target binary can be reproduced alongside the build of the tools required to do so. It is, conceptually, almost like building a house from a large collection of atoms of different elements.
While it is obviously an interesting intellectual puzzle, bootstrapping a Linux distribution from the ground up is a lot of work—and the benefits may not be immediately apparent. The project has a web page outlining the benefits, which are largely about security and portability of the source code. For users, bootstrapping and reproducibility help provide protection against malicious backdoors, while distributions and tool developers will have an easier path in porting code to new architectures.
Since C is at the heart of much of the open-source ecosystem, having a way to bootstrap a C compiler, such as GCC, is among the projects that Bootstrappable Builds is pursuing. One such effort is maintaining a subset of GCC version 4.7, which is the last version that can be built with only a C compiler. GCC 4.7 will be easier to bootstrap from simpler C compilers, such as the Tiny C Compiler (TinyCC or tcc), without requiring a C++ compiler too.
A related effort revolves around GNU Mes, which is the combination of a Scheme interpreter written in C and a C compiler written in Scheme. The two parts are mutually self-hosting, so one can be built from the other (or from a separate binary C compiler or Scheme interpreter). This has been used to halve the size of bootstrap binaries (or "seeds") required to create a version of the GNU Guix distribution:
While that has greatly reduced the amount of binary code that is needed to
create a distribution from scratch, there are plans to go even
further. Stage0 is a
project aimed at bootstrapping from a truly minimal base: a
less-than-500-byte hex monitor ("How you create it is up to you; I
like toggling it
in manually myself
"). That monitor implements a simple hex-code-to-binary
translator that can be used to build ever-more complex binaries, some of which
are available from the project
repository.
In a recent posting to the bootstrappable mailing list, Jan Nieuwenhuizen reports on some progress. Mes was eliminated as a bootstrap seed for Guix by building it starting from the hex-code tool (hex0). There are also efforts outside of Guix to bootstrap a system just using two small seeds (including the hex0 tool); it is currently able to build Mes and TinyCC is in progress.
In addition, at the 2019 Reproducible Builds summit, three distributions created bit-for-bit identical binaries of Mes using three different versions of GCC. Guix, NixOS, and Debian first built Mes with GCC, then built it again using that Mes, which resulted in identical binaries. As noted by David A. Wheeler, that exercise was a real-world application of his diverse double-compiling (DDC) approach to countering Thompson's "trusting trust" attack.
[...] The application described here shows that several different distributions with different executables produce the same underlying result. However, three of these applications are using the same compiler, specifically GCC (albeit different versions). These tests use similar and highly related distributions; they even use many of the same underlying components like glibc, the Linux kernel, and so on (though again, with different versions).
So while this does use DDC, and it does increase confidence, it increases confidence so only to a limited extent because the checking systems are relatively similar. They hope to attempt to use an even more diverse set of compilers in the future, which would give even greater confidence.
By the sound of things, bootstrappability is super fiddly, low-level work. It is not for everyone, but it is important. If we can ensure that the foundations of our software ecosystem are sound, and build up from there, we can be reasonably certain that there is no backdoor hiding in our build tools and subverting everything else. That is a great outcome, but it only pushes the problem down a level, in truth. Some kind of hardware or firmware backdoor could still be lurking. Solutions to that problem will be rather more difficult.
[Thanks to Paul Wise for suggesting the topic.]
| Index entries for this article | |
|---|---|
| Security | Distribution security |
| Security | Integrity |
