Building secure images with NixOS

By Daroc Alden
November 6, 2024

Image-based Linux distributions have seen increasing popularity, recently. They promise reliability and security, but pose packaging problems for existing distributions. Ryan Lahfa and Niklas Sturm spoke about the work that NixOS has done to enable an image-based workflow at this year's All Systems Go! conference in Berlin. Unfortunately, LWN was not able to cover the conference for scheduling reasons, but the videos of the event are available for anyone interested in watching the talks. Lahfa and Sturm explained that it is currently possible to create a NixOS system that cryptographically verifies the kernel, initrd, and Nix store on boot — although doing so still has some rough edges. Making an image-based NixOS installation is similarly possible.

Lahfa started by giving a brief overview of NixOS for those attendees who were unfamiliar with it. He described the distribution as a "standard systemd-based Linux", but with some differences mostly centered around the fact that it does not follow the filesystem hierarchy standard. In NixOS, all of the binaries on the system live in /nix/store, and are configured to use a path and library path that are tightly scoped to only their declared dependencies. This has a lot of benefits, Lahfa said, including NixOS's ability to run multiple versions of the same software. But it also has consequences for secure boot.

Lahfa explained that secure boot "controls who is allowed to run software on your computer". It relies on using signed binaries; the computer will only boot into the provided kernel if the signature on it is valid. On systemd systems, it is possible to use unified kernel images (UKIs), which package a unified extensible firmware interface (UEFI) boot stub, the kernel, and its initrd together. This has security benefits, because it means that secure boot validates the initrd as well as the kernel. But it causes problems for NixOS, which needs to present many more options in the bootloader than most other distributions in order to support its efficient rollback features.

NixOS's separation of binaries into individual paths under /nix/store — and ability to share libraries between different versions — allows the distribution to keep a large number of previous configurations around. Every time a NixOS system has its configuration changed, from a software update, for example, the complete state of the installed programs is saved as a "generation". In the bootloader, the user can select any previous generation they would like (at least until the old generations are cleaned up to reclaim their storage space), and the kernel will load the appropriate initrd for that generation, which in turn sets up all of the configuration files from that generation. This allows for fearless upgrades, since the previous configuration is available in the boot menu — a value proposition quite similar to image-based distributions. Unfortunately, this ability doesn't work well if the initrd needs to be bundled with the kernel, because that increases both the size of each kernel image, and the number of different kernel images that must be stored. Doing so will quickly fill up the EFI (Extensible Firmware Interface) system partition (ESP).

So, to work around this, Lahfa, Sturm, and other contributors wrote Lanzaboote, a NixOS-aware reimplementation of systemd-stub, the UEFI secure-boot stub included in the systemd project. Lanzaboote is signed, for secure boot, and then separately verifies the hashes of the kernel and initrd that it loads. Now, generations can share kernels and initrds when possible, making it possible to have more generations available. Unfortunately, Lanzaboote is hard to upstream. The systemd project suggested using systemd-boot addons instead — binaries that are verified by secure boot but not directly run. If initrds and kernels could be made into addons, then Lanzaboote's functionality may not be needed, and NixOS could just use systemd's loader.conf to mix-and-match initrds and kernels. In the future, efforts to "denormalize" UKIs into their component parts like this without losing their attendant security benefits could potentially benefit image-based distributions as well, by enabling more sharing between images.

Building images

For the second half of the talk, Sturm spoke about how NixOS could actually be used as an image-based distribution. He started by describing why — when NixOS already supports atomic rollbacks — users might want to use an image-based model. Ultimately, it comes down to security, he said. An image-based workflow enables more security features — "things we really want, but cannot offer by default to everyone". The core benefit is cryptographically enforced immutability, which makes a computer more tamper-resistant.

Sturm then explained how to use NixOS to build images. Nix (NixOS's package manager and build system) can produce ISO files directly as an output from the build process, but he instead suggested an approach based on changing how NixOS stores generations. Normally, generations are stored as files inside /nix/store. Any data that is shared between generations (such as a program that was not updated) can be referenced by multiple generations. To produce images based on NixOS, he suggested having a separate partition for each generation. This loses NixOS's ability to share data between generations, but it also brings the distribution closer to what existing image-based distributions do, meaning that existing tooling can be reused. Sturm specifically called out systemd-repart, ukify, and systemd-sysupdate as existing tools that fit naturally into this approach.

Not every tool can be directly adapted, though. Sturm said that mkosi — systemd's OS image builder — would "break the developer flow" on NixOS, and that it was better to build everything directly with Nix. Eventually, the folks working on image-based NixOS may want mkosi support to enable more comprehensive testing by the upstream systemd project, but for now it is not a goal.

Sturm then gave an example of the partition layout that a computer using this approach might use: two separate sets of partitions for generations, so that the one not in use can be updated, each protected by dm-verity. Each set of partitions has a partition for the dm-verity metadata, and one for the store itself. The second set of store partition can be created upon first boot, so a system using this approach can be installed just by copying the EFI system partition (ESP) and first store partition into place. Finally, a persistant partition for user data takes up the rest of the disk.

In a computer set up like this, the Nix store could be mounted using nix-store-veritysetup-generator, or it could be mounted under /usr with the normal systemd-veritysetup-generator and bind-mounted into place. In either case, the Nix store would be protected by dm-verity. None of this work is enabled by default on NixOS, Sturm cautioned. A user needs to opt-in to get it. But it "lets you kind of build your own distro" in that you can potentially use Nix to produce system images that don't rely on any NixOS tools at run time with this method.

One audience member asked whether it was possible to have the Nix store both verified by dm-verity and encrypted. Sturm said that it was "not supported by repart but doable". Specifically, the user would need to create an encrypted block device and then layer dm-verity over the top. Another audience member asked whether users would really have to give up on sharing data between generations to use this approach — wouldn't it be possible to have a shared baseline that multiple generations could reference? Sturm agreed that would be possible as well, but not while simultaneously using dm-verity. The tradeoff between efficiency and security is something that each user will need to consider when deciding which one is right for them.

NixOS breaks a lot of assumptions about Linux systems, starting with the filesystem hierarchy standard and going from there. Despite that, a lot of the same tools that are being used with other image-based distributions can be used to create an image-based NixOS installation. Users who want the benefits of an image-based distribution at the same time as NixOS's unique advantages may have to do a bit of tweaking, since support is still a work in progress, but should nonetheless find that combining the two is possible.

How do image-based systems figure out what partition sizes they need?

Posted Nov 6, 2024 17:14 UTC (Wed) by Karellen (subscriber, #67644) [Link] (8 responses)

Sturm then gave an example of the partition layout that a computer using this approach might use: two separate sets of partitions for generations, so that the one not in use can be updated [...] Finally, a persistant partition for user data takes up the rest of the disk.

With such a system, how does determine partition sizes that are a) big enough to be future-proof for whatever system software you might want to install later, while b) not taking up unneeded space that might be later wanted for user data?

I've tried a bunch of different partition strategies over the years, including separate /usr, /home, /boot, /srv and /var in various combinations (although not all at once, I think). Given the difficulties of resizing and repartitioning disks and filesystems while some of them are in use, and running out of space in various partitions I'd not given enough room for despite having plenty of room on the disk overall, I've gone with a unified encrypted root with only a separate /boot partition for probably about a decade now, and not had to worry about it again.

Is the solution just "have an order of magnitude more disk space than you're ever going to use"? Or what?

How do image-based systems figure out what partition sizes they need?

Posted Nov 6, 2024 17:39 UTC (Wed) by RaitoBezarius (subscriber, #106052) [Link]

(I'm one of the speaker in the said talk)

> With such a system, how does determine partition sizes that are a) big enough to be future-proof for whatever system software you might want to install later, while b) not taking up unneeded space that might be later wanted for user data?

Usually, there's well-known sizes which are parameters of your deployment:

- ESP, fixed size (1GB or so)
- Nix store (so the code, the config, etc.), we put some reasonable limits here, let's say 200GB
- Verity partition for Nix store, answer: https://github.com/systemd/systemd/pull/34636 (another contributor of this whole project made this PR)
- User data, minimized size for initial startup and then automatic grow

(add the extra salt for A/B schema)

If that's not enough, I kind of asked this question during the image based summit, see: https://lwn.net/Articles/994704/ under "ESP resizing", the suggested solution is dm-linear.

So theoretically, if you overcome your 200GB budget, and you really need to do fancy things, I suppose a dm-linear based solution can do smart things if you can safely downsize the user partition.

Nonetheless, for an image-based system: `/srv`, `/home` doesn't really exist, I would map everything to `/var`, the only mutable partition in the setup. And bind mount `/home` to `/var/home`, etc. So it may look like your unified encrypted root setup, except that the Nix store doesn't require confidentiality as long you don't put secrets in your code or configuration and use systemd-credentials (or similar concepts).

How do image-based systems figure out what partition sizes they need?

Posted Nov 6, 2024 18:00 UTC (Wed) by geert (subscriber, #98403) [Link] (1 responses)

> I've gone with a unified encrypted root with only a separate /boot partition for probably about a decade now, and not had to worry about it again.

Same here. But I tend to run into /boot becoming too small after a while :-(

How do image-based systems figure out what partition sizes they need?

Posted Nov 7, 2024 10:08 UTC (Thu) by k3ninho (subscriber, #50375) [Link]

>> I've gone with a unified encrypted root with only a separate /boot partition for probably about a decade now, and not had to worry about it again.

> Same here. But I tend to run into /boot becoming too small after a while :-(

Same same. I'm greedy about the firmwares that go into the initrd, so that's on me.

K3n.

How do image-based systems figure out what partition sizes they need?

Posted Nov 6, 2024 18:21 UTC (Wed) by ballombe (subscriber, #9523) [Link] (1 responses)

I use LVM and keep 30% of the disk unallocated.
LVM allow you to extend mounted partitions without any hassle.

How do image-based systems figure out what partition sizes they need?

Posted Nov 6, 2024 18:38 UTC (Wed) by arsen (subscriber, #161285) [Link]

images don't have to contain FSes, just archives suffice (e.g. mkosi can generate tar, cpio, and plain dirs besides GPT disks)

How do image-based systems figure out what partition sizes they need?

Posted Nov 6, 2024 18:54 UTC (Wed) by walters (subscriber, #7396) [Link] (2 responses)

> Is the solution just "have an order of magnitude more disk space than you're ever going to use"? Or what?

We're working on https://github.com/containers/composefs/ which doesn't have this issue; its tagline is "The reliability of disk images, the flexibility of files".

How do image-based systems figure out what partition sizes they need?

Posted Nov 7, 2024 6:40 UTC (Thu) by bof (subscriber, #110741) [Link] (1 responses)

Interesting.

I'm confused though about the separate content-addressed store. That gets populated only from mkcomposefs runs? And the EROFS images can only work if the content store in place, happens to contain all the content that was there when the image was created in the first place?

Won't that mean that any newer image is going to break if the content store ever is restored from older backup? And how would garbage collection in the content store work?

I feel like I'm missing something there.

How do image-based systems figure out what partition sizes they need?

Posted Nov 7, 2024 18:28 UTC (Thu) by walters (subscriber, #7396) [Link]

> That gets populated only from mkcomposefs runs?

Not necessarily.

Basically the composefs git repository is a low level unopinionated generic tool. Yes, `mkcomposefs` can write files into the object store, but so can any other tool and in practice we expect sophisticated higher level software to do exactly that.

> And the EROFS images can only work if the content store in place, happens to contain all the content that was there when the image was created in the first place?

Well yes, you are responsible for ensuring the object store is populated. But for example you could totally have higher level software, before invoking "mount.composefs" quickly check that the expected objects are there, and if not fetch them from a network object store on-demand. There's CLI tools and a C library that lets you read and write the composefs EROFS for doing things like this.

> And how would garbage collection in the content store work?

That is again up to higher level tooling; I'll try to fill out the docs more but the simple baseline is:

- You maintain a set of "GC roots" (images) that are composefs blobs
- Iterate over those composefs blobs, parse them to find which objects they reference
- Iterate over the object store and remove unreferenced objects

We are working on a higher level Rust project that will be a much more opinionated interface to things like this, including direct integration with e.g. OCI.

But there are already others who have integrated composefs, e.g. there was a lightning talk at All Systems Go from https://rauc.io/ about using composefs.

Overlayfs ?

Posted Nov 7, 2024 16:00 UTC (Thu) by Lennie (subscriber, #49641) [Link]

I get the feeling that like containers overlayfs could be used in some way to solve the partitioning problems ? Not that I know exactly how I would do it yet...