Pulling Linux up by its bootstraps
[LWN subscriber-only content]
Welcome to LWN.netThe following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider accepting the trial offer on the right. Thank you for visiting LWN.net! |
|
A bootstrappable build is one that builds existing software from scratch — for example, building GCC without relying on an existing copy of GCC. In 2023, the Guix project announced that the project had reduced the size of the binary bootstrap seed needed to build its operating system to just 357-bytes — not counting the Linux kernel required to run the build process. Now, the live-bootstrap project has gone a step further and removed the need for an existing kernel at all.
The live-bootstrap project was started in 2020 by Samuel Tyler (also known as
"fosslinux") as a way to
automate a complete bootstrap of a modern Linux system. Since then, Tyler has
been joined by Andrius Štikonas and Gábor Stefanik as co-maintainers, along with
17 other contributors.
The project's goal is to
create a usable system "with only human-auditable, and wherever possible,
human-written, source code
". The project pulls in a number of other pieces
of software, from bootstrapping tools like
stage0-posix and
GNU Mes, to historical versions of software that are necessary to build
their more modern counterparts, such as GCC version 4.0.4 (the most recent
version that does not require a C++ compiler to build).
The whole process of bootstrapping a
system is automated, making it possible to run automatic tests as new steps are
added or software is updated.
The process
Running through the bootstrapping process is remarkably straightforward; the first step is to clone the project's Git repository in order to obtain the necessary sources. If Git is not available, it is also possible to download the release tarball for the project and its submodules from GitHub. The project does use a large number of submodules in order to incorporate other bootstrapping software, so using a recursive clone is the most convenient method:
git clone --recursive https://github.com/fosslinux/live-bootstrap
The repository contains a tool called rootfs.py that can be used to run the entire bootstrapping process. There are a few different configurations, to make contributing to or using the project easier: building in a chroot environment, building in a virtual machine using QEMU, or building on bare metal. For chroot and QEMU builds, rootfs.py oversees the whole process. For a bare-metal build, it assembles a bootable disk image. The simplest way to run everything is ./rootfs.py --qemu, but running a moderately complex script does not seem like the best way to build confidence in a bootstrapped system.
Luckily, using rootfs.py is entirely optional — the repository includes instructions for building an equivalent bare-metal disk image by hand. The first step is to run ./download-distfiles.sh, which downloads release tarballs for the pieces of software used in the bootstrap that are not included as Git submodules. Then, one hand-assembles the Builder-Hex0 minimal kernel, places it at the beginning of a new disk image, and concatenates all of the necessary files and downloaded sources onto the image in a particular format: for each file, a plain-text header of the form "src <number of bytes> <path of file>\n" followed by the content of the file. The result is a disk image consisting mostly of compressed source tarballs that should run on any x86 machine with enough memory. The project recommends having 4GB of memory and, ideally, multiple cores available.
A minimal kernel
The whole process starts with the Builder-Hex0 kernel. Started by Rick Masters in 2022, Builder-Hex0 is a minimal 32-bit kernel. Its sole purpose is to be small enough to be verified by hand, and yet able to run the shell scripts that direct the first phase of the live-bootstrap build. The full kernel runs to 2682 lines of commented machine code, but it can be built and loaded by a bootloader that fits in a single 512-byte disk sector.
The bootloader uses BIOS commands to read the human-readable sources from disk, strip out comments, and convert hexadecimal numbers into raw bytes. It assembles those bytes at a fixed address, and then jumps to it. The kernel itself has a built-in shell that it uses to interpret a shell script that comes after it on the disk. The kernel eschews niceties such as a disk-based filesystem or the ability to run multiple programs at once. Instead, the shell script manually creates an in-memory filesystem with the sources necessary for the next step, assembles, and runs them.
Toward a larger kernel
At this point, the computer begins building stage0-posix, a set of increasingly capable assemblers and shells that eventually culminate in being able to build basic filesystem tools such as mkdir and chown. That is enough to build Mes — a Scheme interpreter written in C, and a C compiler written in Scheme. Mes's Scheme interpreter is written using only a handful of C features, so that it can be compiled by a macro assembler from stage0-posix. Mes's C compiler is not a fully conformant C compiler, but it can build Fabrice Bellard's Tiny C Compiler (tcc), which, in turn, can build Fiwix, an OS kernel that aims for compatibility with Linux 2.0 system calls.
Fiwix supports luxuries such as preemptive multitasking, filesystems, and virtual memory. The final steps of the Builder-Hex0 kernel's shell-script are to build an ext2 filesystem image for Fiwix to use, place it in memory, and then jump to the Fiwix kernel.
A chain of legacy software
Equipped with a POSIX-compliant kernel and a mostly compliant C99 compiler, the rest of the build process is a matter of building older versions of various open-source projects until GCC can be built. The most complicated part is building musl. The C library linked to tcc is the one provided by Mes, which among other things cannot handle floating-point numbers. Fixing that is a multi-step process that involves building musl, rebuilding tcc, rebuilding musl, and then rebuilding tcc.
Eventually, however, the system can build Perl 5 (which is used by GCC's build system), and then GCC 4.0.4. GCC can build Linux 4.14.341-openela — a long-term support version maintained by the Open Enterprise Linux Association. Then Fiwix can kexec the new Linux kernel. From there, the project builds a series of successively newer versions of Perl, Python, GCC, and their dependencies, culminating in a minimal but usable Linux user space with GCC 13.1.0, Python 3.11.1, and a handful of other necessary tools and libraries.
Overall
In total, the process takes many hours and a good deal more CPU time than seems entirely reasonable. At one particularly frustrating point, the computer I was testing the bootstrapping process on ended up running a lengthy build using kaem, a minimal shell that is part of the stage0-posix project that does not show any progress indications. I had to take it on faith that the computer had not in fact hung, but it did eventually make progress.
With the Linux kernel removed from the set of bootstrapping requirements, it is finally possible to definitively lay to rest the worries raised by Ken Thompson's "Reflections on Trusting Trust" Turing award lecture. David Wheeler described a technique — diverse double-compilation — to use a trustworthy compiler (such as the one produced by the bootstrapping process) to check whether there was a trusting-trust backdoor in another compiler. But in reality, the increasing attention paid to reproducible and bootstrappable builds has made trusting-trust-based attacks persisting without notice increasingly unlikely over the past several years.
The real benefit of bootstrappable builds comes from a few things. Like reproducible builds, they can make users more confident that the binary packages downloaded from a package mirror really do correspond to the open-source project whose source code they can inspect. Bootstrappable builds have also had positive effects on the complexity of building a Linux distribution from scratch — such as by convincing critical GNU projects to make releases that are compressed with gzip, instead of XZ, to cut XZ out of the complicated web of interdependent software that underlies a modern Linux user space.
But most of all, bootstrappable builds are a boon to the longevity of our software ecosystem. It's easy for old software to become unbuildable. By having a well-known, self-contained chain of software that can build itself from a small seed, in a variety of environments, bootstrappable builds can help ensure that today's software is not lost, no matter where the open-source community goes from here.
Did you like this article? Please accept our trial subscription offer to be able to see more content like it and to participate in the discussion.
Posted Jul 31, 2024 19:29 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (2 responses)
I'm not sure how big the ROM in my Jupiter Ace was, but it wasn't much ...
Cheers,
Posted Jul 31, 2024 19:59 UTC (Wed)
by daroc (editor, #160859)
[Link] (1 responses)
There are quite small Forths available — I've come across sectorforth, which is less than 512 bytes, but I'm sure there are many others. But I admit it didn't occur to me to ask that question when putting the article together. I briefly corresponded with one of the maintainers, so I'll pass on the question and see if they're willing to provide an answer.
Posted Aug 1, 2024 10:16 UTC (Thu)
by dottedmag (subscriber, #18590)
[Link]
Some ideas to bounce that _could_ cut down on the amount of bootstrap work:
- Throw away problematic configuration/build systems, especially for old fixed version of software. Their complexity comes from their portability, and here the target is pretty much nailed down. A particular approach that worked well for me to trim down compilation dependencies is to run a configuration script, run the compilation, record all the compilation steps and make a shell file to play them back. This approach has a benefit of having zero logic in the resulting build script, and no maintenance burden for fixed software versions. Another benefit for C and especially C++ software is that a ton of separate compilation commands may be merged into one, and that improves compilation speed.
- Do the "good enough" implementation of various tools in assembly/whatever language is able to issue syscalls to short-circuit their dependencies.
Posted Jul 31, 2024 22:49 UTC (Wed)
by Foxboron (subscriber, #108330)
[Link] (2 responses)
To our knowledge that is the first non-academic proof of DDC.
https://reproducible-builds.org/news/2019/12/21/reproducible-bootstrap-of-mes-c-compiler/
Posted Jul 31, 2024 23:16 UTC (Wed)
by ms-tg (subscriber, #89231)
[Link] (1 responses)
Posted Jul 31, 2024 23:49 UTC (Wed)
by rahulsundaram (subscriber, #21946)
[Link]
Yep, that's linked from the news post referenced.
Posted Aug 1, 2024 12:01 UTC (Thu)
by Phantom_Hoover (subscriber, #167627)
[Link]
Why not Forth?
Wol
Why not Forth?
Why not Forth?
Proof of DDC
Proof of DDC
* [Diverse Double Compiling](https://dwheeler.com/trusting-trust/dissertation/html/whe...)
Proof of DDC
>* [Diverse Double Compiling](https://dwheeler.com/trusting-trust/dissertation/html/whe...)
Practicalities
