Bootstrappable builds

By Jake Edge
January 6, 2021

The idea of Reproducible Builds—being able to recreate bit-for-bit identical binaries using the same source code—has gained momentum over the last few years. Reproducible builds provide some safeguards against bad actors in the software supply chain. But building software depends on the tools used to construct the binary, including compilers and build-automation tools, many of which depend on pre-existing binaries. Minimizing the reliance on opaque binaries for building our software ecosystem is the goal of the Bootstrappable Builds project.

For example, GCC is written in C and C++, which means that it requires compilers for those two languages in order to be built from source. In practice, that generally means a distribution would use its existing binary executables of those tools to build a new GCC version, which would then be released to users. One of the concerns with that approach is described in Unix inventor Ken Thompson's Turing Award lecture "Reflections on Trusting Trust" [PDF]. In a nutshell, Thompson said that trusting the output of a binary compiler is an act of faith that someone has not tampered with the creation of that binary—even if the source code is available.

The Bootstrappable Builds project was started as an offshoot of the Reproducible Builds project during the latter's 2016 summit in Berlin. A bootstrappable build takes the idea of reproducibility one step further, in some sense. The build of a target binary can be reproduced alongside the build of the tools required to do so. It is, conceptually, almost like building a house from a large collection of atoms of different elements.

While it is obviously an interesting intellectual puzzle, bootstrapping a Linux distribution from the ground up is a lot of work—and the benefits may not be immediately apparent. The project has a web page outlining the benefits, which are largely about security and portability of the source code. For users, bootstrapping and reproducibility help provide protection against malicious backdoors, while distributions and tool developers will have an easier path in porting code to new architectures.

Since C is at the heart of much of the open-source ecosystem, having a way to bootstrap a C compiler, such as GCC, is among the projects that Bootstrappable Builds is pursuing. One such effort is maintaining a subset of GCC version 4.7, which is the last version that can be built with only a C compiler. GCC 4.7 will be easier to bootstrap from simpler C compilers, such as the Tiny C Compiler (TinyCC or tcc), without requiring a C++ compiler too.

A related effort revolves around GNU Mes, which is the combination of a Scheme interpreter written in C and a C compiler written in Scheme. The two parts are mutually self-hosting, so one can be built from the other (or from a separate binary C compiler or Scheme interpreter). This has been used to halve the size of bootstrap binaries (or "seeds") required to create a version of the GNU Guix distribution:

Mes+MesCC can compile an only lightly patched TinyCC that is self-hosting. Using this tcc and the Mes C library we now have a Reduced Binary Seed bootstrap for the gnutools triplet: glibc-2.2.5, binutils-2.20.1, gcc-2.95.3. This is enough to bootstrap Guix for i686-linux and x86₆₄-linux.

While that has greatly reduced the amount of binary code that is needed to create a distribution from scratch, there are plans to go even further. Stage0 is a project aimed at bootstrapping from a truly minimal base: a less-than-500-byte hex monitor ("How you create it is up to you; I like toggling it in manually myself"). That monitor implements a simple hex-code-to-binary translator that can be used to build ever-more complex binaries, some of which are available from the project repository.

In a recent posting to the bootstrappable mailing list, Jan Nieuwenhuizen reports on some progress. Mes was eliminated as a bootstrap seed for Guix by building it starting from the hex-code tool (hex0). There are also efforts outside of Guix to bootstrap a system just using two small seeds (including the hex0 tool); it is currently able to build Mes and TinyCC is in progress.

In addition, at the 2019 Reproducible Builds summit, three distributions created bit-for-bit identical binaries of Mes using three different versions of GCC. Guix, NixOS, and Debian first built Mes with GCC, then built it again using that Mes, which resulted in identical binaries. As noted by David A. Wheeler, that exercise was a real-world application of his diverse double-compiling (DDC) approach to countering Thompson's "trusting trust" attack.

They used three different distributions (GNU Guix, Nix, and Debian) with three different major versions of GCC to recompile GNU Mes. They later used the tcc compiler as well (though details about that are sketchy). In all cases they recreated a bit-for-bit identical result of the GNU Mes C compiler!

[...] The application described here shows that several different distributions with different executables produce the same underlying result. However, three of these applications are using the same compiler, specifically GCC (albeit different versions). These tests use similar and highly related distributions; they even use many of the same underlying components like glibc, the Linux kernel, and so on (though again, with different versions).

So while this does use DDC, and it does increase confidence, it increases confidence so only to a limited extent because the checking systems are relatively similar. They hope to attempt to use an even more diverse set of compilers in the future, which would give even greater confidence.

By the sound of things, bootstrappability is super fiddly, low-level work. It is not for everyone, but it is important. If we can ensure that the foundations of our software ecosystem are sound, and build up from there, we can be reasonably certain that there is no backdoor hiding in our build tools and subverting everything else. That is a great outcome, but it only pushes the problem down a level, in truth. Some kind of hardware or firmware backdoor could still be lurking. Solutions to that problem will be rather more difficult.

[Thanks to Paul Wise for suggesting the topic.]

Index entries for this article
Security	Distribution security
Security	Integrity

Bootstrappable builds

Posted Jan 7, 2021 0:53 UTC (Thu) by dvdeug (guest, #10998) [Link] (13 responses)

It's interesting, but it seems to be wandering into purely academic technicalities. Once you have bootstrapped a modern GCC from non-GCC source, you're done for GCC's part; you have a trusted compiler assuming you can trust the source code, in the sense of Ken Thompson's article. The problem is not making the core smaller; the problem is the GCC tarball, uncompressed, is 740 MB, and even after stripping away various documentation and non-C/C++ directories, it's still over 200 MB. How do you trust that?

In 2008, there was a problem with Debian SSH keys due to an actual patch to OpenSSH in Debian. This was accidental and a patch to OpenSSH. It would have been harder but possible to do it intentionally and via a patch to GCC, so it would recognized OpenSSH and miscompile it as needed. It could be all written out in code, and nobody would be the wiser unless they knew what they were doing GCC-wise and were poking at that section of code.

It's not a bad concern, but it seems at this point to be more about something fun and interesting instead of something that provides any more trust in practice.

Bootstrappable builds

Posted Jan 7, 2021 3:13 UTC (Thu) by pabs (subscriber, #43278) [Link]

For trusting source code, there will always be too much code for one developer or one organisation to review. So we need distributed code review, which is being worked on in the rust community.

https://github.com/crev-dev/crev
https://github.com/crev-dev/cargo-crev

Bootstrappable builds

Posted Jan 8, 2021 3:23 UTC (Fri) by goraxe (subscriber, #42374) [Link] (8 responses)

There have been malware in the wild that does attack tool chains and software has been put out that has had backdoors inserted by software houses affected by this type of malware. There is no guarantee that non gcc C compiler is trusted.

So the bootstrapping from tiny understandable principles is pretty interesting especially if the results are bit for bit comparable as this gives cryptographic verification options.

I could see this having utility in build farms like travis ci, paas systems like aws lambda, google app engine etc. If you need truly trusted binaries this seems like a very viable way of getting them

Bootstrappable builds

Posted Jan 8, 2021 6:48 UTC (Fri) by dvdeug (guest, #10998) [Link] (7 responses)

I'm not sure I understand what type of malware you are referring to.

GCC bootstraps itself, which means the final copy of GCC binaries for a certain architecture and GCC version should not depend on what compiler you started with. If you start with two different compilers, you don't need to absolutely trust them; if they came from different sources and any attack they'd be using would be different, you can simply compare the final versions and if the binaries are the same, which starting compiler you used was truly irrelevant, and the "trusting trust" attack is moot.

I don't see how this has utility in build farms, either. The issue where bootstrapping matters is in compilers where attacks can be hidden in the binaries. You're not going to build GCC fresh on every system, and there's a serious question whether downloading a trusted source and building it on a million systems is any safer than downloading a trusted binary and building it on a million systems. If you can get a hacked binary into the pathway, you can get hacked source code into the pathway.

What about the linker?

Posted Jan 8, 2021 7:00 UTC (Fri) by eru (subscriber, #2753) [Link] (1 responses)

Should we not also take into account the "ld", "ar" and other bintools? I think the Thompson attack would also work with "ld" and any other tool that is used in the process of generating the final program.

So the bootstrap process should either start with a compiler that directly produces an executable, or also bootstrap the linker without depending on any existing linker.

What about the linker?

Posted Jan 10, 2021 1:11 UTC (Sun) by JoeBuck (subscriber, #2330) [Link]

The traditional gcc bootstrapping process can work with all of the binutils tools, built together in the same tree (this was pioneered by the Cygnus folks maybe 30 years ago); everything is built again with the new compiler, linker, and assembler to eliminate dependencies. We can demonstrate that the classic attacks in the Thompson paper either don't exist, or have affected every C compiler since the dawn of the language, by starting with unrelated compilers, doing the bootstraps, maybe going through a number of compiler versions and even throwing in cross-compilers, involving a mix of free and proprietary compilers, and verifying that in the end the binaries are the same (for some systems, timestamps have to be filtered out of object files when doing the comparison, but for most, we wind up with every byte identical).

However, we still inherit dependencies from system libraries, and this can include macros and inline functions. Someone could perhaps sneak an attack into a system library function that needs to be coded in assembler for optimal performance, and have this wind up in the compiler. So efforts like these that start with a tiny compiler and a tiny library can eliminate that threat as well.

But I think efforts like this, while fascinating, are a lot less important than they used to be because the real threat these days is in the microcode, the system under the system.

Bootstrappable builds

Posted Jan 8, 2021 20:46 UTC (Fri) by josh (subscriber, #17465) [Link] (4 responses)

> GCC bootstraps itself, which means the final copy of GCC binaries for a certain architecture and GCC version should not depend on what compiler you started with.

As long as the GCC binary didn't have something added that subverts subsequent GCC binaries.

> If you start with two different compilers, you don't need to absolutely trust them; if they came from different sources and any attack they'd be using would be different, you can simply compare the final versions and if the binaries are the same, which starting compiler you used was truly irrelevant, and the "trusting trust" attack is moot.

Only if you start with two different independent compilers, though. The top-level comment of this thread just said "bootstrapped a modern GCC from non-GCC source", which doesn't say anything about diverse double-compilation (using two different non-GCC compilers to compile GCC).

> If you can get a hacked binary into the pathway, you can get hacked source code into the pathway.

Source is harder, though, for multiple reasons.

First, "trusting trust"-style attacks would be difficult to obfuscate; it's one thing to hide a security hole, and quite another to hide code that detects a code pattern from a compiler and modifies it such that it affects subsequently compiled code.

The source of GCC or Clang might be huge, but any *one* change is much smaller and more reviewable.

And finally, malicious source code is more difficult to deny intent about. With a malicious binary, you could try to claim some internal process was subverted, or blame a random employee, or contractor, or other similar diversions. With malicious source code, you'll have a harder time blaming anything other than malice.

Bootstrappable builds

Posted Jan 12, 2021 23:49 UTC (Tue) by dvdeug (guest, #10998) [Link] (3 responses)

> Only if you start with two different independent compilers, though.

I was assuming that you compared to an existing GCC binary. You don't actually have to start from two different non-GCC compilers; if you start from one non-GCC compiler and compare it to the product of an existing GCC binary, if the binaries are the same, then the attack isn't present. If you want to start from two different independent compilers, there's enough of them around.

Also, a "trusting trust" attack for GCC 2.7.2 released in 1995 that targets a chain of compilers eventually building GCC 10 for AMD64, an architecture released in 2000, is inconceivable. (Toss in a pass through Itanium if you think AMD64 is even mildly plausible.) It would be challenging enough to make the attack survive cross-compiling from GCC 10 for AMD64 to GCC 10 for MIPS/ARM/HPPA/PowerPC and back to GCC 10 for AMD64.

> Source is harder, though, for multiple reasons.

The OpenSSL bug was added through a patch. I'm not implying in any way it wasn't an accident, but it was a serious security hole added through source change. For our purposes, the patch fixed a latent bug; OpenSSL relied on reading uninitialized variables, and there's a large bit of rules lawyering on StackOverflow, enough that whatever the actual standard says, a change that detected such a problem and "accidentally" opened up a similar bug, even if limited to certain circumstances, could be plausibly denied to be malicious.

> The source of GCC or Clang might be huge, but any *one* change is much smaller and more reviewable.

A bad actor wouldn't post it for review upstream; you toss into Red Hat or Debian or FreeBSD's patches, or stick it into some insecure mirror's copy of the source. Or you use direct access to the git repository.

> And finally, malicious source code is more difficult to deny intent about.

If GCC is bootstrapping itself and producing a different binary from another GCC bootstrap started from a different compiler, there's almost certainly malicious action. (There have been cases where stage 2 and stage 3 won't match, because the starting compiler miscompiled GCC, but not in an unsurvivable way, but you can run a stage 4 and it will match stage 3 and the final stage from other builds.) Once you've discovered the "trusting trust" attack, you can disassemble the binary and it will be obvious that malice was involved, because that couldn't happen by accident.

With source code, it'd be relatively easy to miscompile a bug into a target like OpenSSL to open a security hole in a plausibly deniable way. Once we've established malice, if Debian or Red Hat were shipping a compromised source or binary, it would trace back to the same paths, and much the same group of people could have slipped it into the supply chain.

Again, the base issue is real, but I think when you start toggling in bootloaders, you've left real-world concerns behind.

Bootstrappable builds

Posted Jan 13, 2021 0:56 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

> With source code, it'd be relatively easy to miscompile a bug into a target like OpenSSL to open a security hole in a plausibly deniable way.

Hasn't this already happened? Didn't somebody slip a "if (userid = 0) then" into some program a while back?

And a lot of people are wondering if the NSA or whoever it was deliberately chose a bunch of Elliptic Curve Cryptography constants that were flawed to slip into a standard...

Cheers,
Wol

Bootstrappable builds

Posted Jan 13, 2021 3:20 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (1 responses)

I remember hearing that too, but wasn't it caught in a code review?

Bootstrappable builds

Posted Jan 13, 2021 4:02 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

That was Linux kernel. An attacker hacked the public CVS mirror to include this code but this was caught by Larry McVoy noticing that BitKeeper history doesn't match.

Here's the fine article from the LWN: https://lwn.net/Articles/57135/

Debian 2008 keys bug

Posted Jan 8, 2021 23:55 UTC (Fri) by aaronmdjones (subscriber, #119973) [Link] (2 responses)

The patch was to OpenSSL, specifically its random number generator, not OpenSSH. OpenSSH just happens to use OpenSSL for its RSA key generation, and RSA key generation requires a good source of random numbers.

Debian 2008 keys bug

Posted Jan 9, 2021 2:11 UTC (Sat) by plugwash (subscriber, #29694) [Link] (1 responses)

Any key generation requires random numbers and AIUI openssh relied on openssl for all it's random number needs.

The key generation issue was awful, but at least you could recognise bad keys (debian shipped a package "openssh-blacklist for a long time because of this), but even worse was that traditional implementations of DSA use random numbers during the signature process and can leak bits of the key if that randomness is not sufficiently random.

This meant that any DSA key that had been merely used with the bad openssl had to be considered compromised. Since there was no way of detecting such keys, this lead to a ban in use of DSA keys on Debians infrastructure (no idea if other organsitions followed suite).

Debian was very fortunate that while it is theoretically possible to transfer keys between the gnupg world and the openssl/openssh/x509 world it was enough of a PITA that people very rarely did. So gnupg (which is the root of identity/trust in the Debian project) could still be considered safe.

Debian 2008 keys bug

Posted Jan 10, 2021 13:34 UTC (Sun) by aaronmdjones (subscriber, #119973) [Link]

> Any key generation requires random numbers and AIUI openssh relied on openssl for all it's random number needs.

Back then, it did, yes. OpenSSH 6.5 (adding support for Ed25519 keys) didn't arrive for another 6 years, and OpenSSH 6.8 (allowing it to be built without OpenSSL) didn't arrive for another year after that. These days you can build it without, and then it will use urandom(4) [Linux, among others] or arc4random(3) [OpenBSD].

Bootstrappable builds

Posted Jan 7, 2021 1:04 UTC (Thu) by pabs (subscriber, #43278) [Link] (1 responses)

Its sad to see that lots of programming languages and build tools aren't bootstrappable, or are only tortiously bootstrappable from a minimal Linux system.

Bootstrappable builds

Posted Jan 7, 2021 8:50 UTC (Thu) by andrewsh (subscriber, #71043) [Link]

E.g. Kotlin.

Bootstrappable builds

Posted Jan 7, 2021 12:05 UTC (Thu) by tsr2 (subscriber, #4293) [Link] (8 responses)

The hardware/firmware backdoor that means we can't trust any of this ever, on Intel hardware at least, is Intel ME.

https://en.wikipedia.org/wiki/Intel_Management_Engine

Bootstrappable builds

Posted Jan 7, 2021 12:36 UTC (Thu) by dgm (subscriber, #49227) [Link]

You can, to some extent. At this level of complexity trust is basically statistical, meaning that you trust bacause it would be rather difficult to tamper all the pieces and go undetected for long. But you cannot be certain.

The only absolutely trustable computer is the one you create yourself from discrete logic and only runs software written by yourself.

Bootstrappable builds

Posted Jan 8, 2021 6:23 UTC (Fri) by marcH (subscriber, #57642) [Link]

https://en.wikipedia.org/wiki/Attack_surface#Surface_redu...

Bootstrappable builds

Posted Jan 8, 2021 17:58 UTC (Fri) by jhhaller (guest, #56103) [Link] (5 responses)

It's not just the ME, there's firmware everywhere, in storage (both controller and drives), in the NIC, in the BIOS or other bootstrap code.

If one is trying to defend against state actors, there is no end to the potential attacks, especially if they are only attacking one entity.
Once they know the defense, it's easier to discover other places to attack.

I remember a British effort to build a mathematically verified computer, so that the results could be provably correct. The problem, as I remember,
is that the computer was a physical device which could have existing and new defects, even if the design was proved correct,
yielding the provably correct program potentially giving an incorrect answer. There is no way to prove that the fabrication of the
verified design was correct. I can't find the original source, I believe this was done in the 80's.

Bootstrappable builds

Posted Jan 8, 2021 21:09 UTC (Fri) by Wol (subscriber, #4433) [Link]

This to me is the perfect description of why Science IS NOT Mathematics.

Mathematics is a provably correct logical model of what we think the world should be.

Science is a description of what the world is. (Or rather, Science is the work involved in making sure reality and theory agree - most practitioners unfortuanately try to make reality agree with theory, rather than the other way round :-)

Cheers,
Wol

Bootstrappable builds

Posted Jan 10, 2021 17:37 UTC (Sun) by eru (subscriber, #2753) [Link]

You are thinking of the VIPER. I recall reading a story about it in some magazine, possibly BYTE. Quick googling turned up the following 1987 paper from Royal Signals and Radar Establishment, with proper old military document vibe (marked UNCLASSIFIED, looks like an old photocopy)

https://apps.dtic.mil/dtic/tr/fulltext/u2/a194561.pdf

Bootstrappable builds

Posted Jan 18, 2021 3:54 UTC (Mon) by gdt (subscriber, #6284) [Link]

The fabrication is shown to be correct using traceability. That is, every part of the proof is expressed in matching parts in hardware, and there is no additional hardware. This leads to a very different hardware design, one which will not perform well (eg, it's desirable to have a very long instruction word, as that makes traceability easier, but there's a high cost to fetching such instructions from memory. Especially since instruction caches and pipelines are very difficult to model, and so are usually not present).

That's the design issue for responding to Spectre. We want mathematical proof that processor designs don't leak state between processes. But we don't want to pay the price for the extreme proof and traceability of cryptographic processors.

Bootstrappable builds

Posted Jan 19, 2021 17:57 UTC (Tue) by immibis (subscriber, #105511) [Link] (1 responses)

Build the computer out of relays, surely.

Of course, such a computer will occupy the size of at least a refrigerator, and execute perhaps 10 instructions per second.

But you cannot possibly introduce fabrication defects into a relay-based design that passes its tests.

Now you can use this to bootstrap your software for other computers that can actually run at practical speeds.

Bootstrappable builds

Posted Jan 20, 2021 10:15 UTC (Wed) by geert (subscriber, #98403) [Link]

Why not? It doesn't make a difference if the logic gates are implemented by relays or semiconductors.

Choosing hardware for bootstraping and diverse double-compiling

Posted Jan 17, 2021 8:55 UTC (Sun) by GNUtoo (guest, #61279) [Link]

There are many issues that needs to be fixed to get a robust free software and open source infrastructure.

Not all the issues affect everybody in the same way, so it's still good to fix them.

For instance for the Management Engine or equivalent, even if it's present in most recent computers, in some cases it's possible to avoid it completely.

For storage devices (SSD, HDD, etc) firmwares, it's still possible to workaround by booting off an SPI flash or raw NANDs and using LUKS on the mass storage device in a way that makes it very difficult for the firmware to attack the host system through modification of the data as everything is encrypted on it. For instance Coreboot/Libreboot+GRUB or u-boot/barebox + Linux + an initramfs can achieve that pretty easily.

So being able to take out of the equation the compiler and what was needed to produce it (both the software and hardware) also makes trusting the software much more easier.

In the case of Mes, as I understand it, it still depends on the system used to do the compilation which includes both the software and the hardware. In addition, getting the same binary out of a diverse double-compilation only ensure that either the backdoor is the same or that both have no backdoor.

The issue is also that while we have some information on real world attacks (XcodeGhost) and that we basically know what it takes to do very simple compiler modifications that propagate themselves inside subsequent compilers being built, it's hard to really understand the threat as some of the companies and government agencies that work on offensive security have large budgets and try to keep a big part of their work secret. In addition, not everything they do is published (for instance Edward Snowden probably didn't retrieve and give everything that he had access to to the Journalists which probably didn't publish everything either).

That said, if we want to bootstrap a C compiler, we still need hardware and software.

If I understood correctly with something like the stage0 implementation, we won't need a kernel nor an operating system, and given enough work it could be used to somehow bootstrap a compiler, kernel and operating system.

So I wonder what type of hardware would make sense to run a stage0 implementation:
- If you use Coreboot / Libreboot on a desktop (to avoid the issue of the embedded controller) with an I945 chipset (as there is free code to initialize the GPU / display controller), and find peripherals that you can somehow trust, you still end up having to build Coreboot / Libreboot, so it's probably not the best option here. And you probably cannot review the assembly of the Coreboot / Libreboot image as they are way too big. As for writing a smaller version of them, they'd probably still end up being quite complex if we need RAM or access to a display controller. We can use the CPU cache as RAM quite easily though, but I'm unsure if that would be sufficient for the display controller part of the GPU and a very basic stage0. Installing that code would also be quite challenging as you'd need to trust some SPI flash programmer as well.
- Another approach would be to find an ARM SOC that has a bootrom that has been dumped and reviewed where users can easily input code and that has a display controller that don't need complex software to be used. This still bring in the hardware as something users have to trust blindly.
- Yet another approach would be to use FPGAs like an ECP5 with something like LiteX and the free toolchain for it. Here you could review the HDL code but this brings in way more software dependencies as you need to actually produce the FPGA image.
- Another option would be to use very old and well known hardware (like an Altair 8800) and somehow manage to use that to bootstrap the stage0. Though they are probably not always easy to find.

There are also procedure in place that could increase trust through randomness: the key signing ceremony, which is a procedure to setup an HSM can also be modified to be used for installing software but not necessarily for producing it.

For instance if you want to install Coreboot / Libreboot and that you trust your SPI flash programmer and can build an image in a reproducible way, you can get random computer, remove the radios, install the software to talk to the SPI flash programmer and make all of them read/write the image to the SPI flash in a random order without giving any of them the ability to know if there will be another computer that will do the same thing right after. This way any computer can detect if what's on the SPI flash is what is supposed to be there or if another computer has modified it. None of the computers will also be able to predict if they'll be the last computer to check that flash chip.

Denis.