Leading items
Welcome to the LWN.net Weekly Edition for February 16, 2023
This edition contains the following feature content:
- NASA and open-source software: a FOSDEM talk on how the US space agency uses — and supports — open-source software.
- Free software and fiduciary duty: a Bitcoin-related case creates worries in the free-software community.
- The extensible scheduler class: enabling CPU schedulers to be written in BPF and loaded into the kernel.
- A proposed threat model for confidential computing: what are confidential-computing efforts working to defend against?
- An overview of single-purpose Linux distributions: several small and focused distributions were in the spotlight at FOSDEM.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
NASA and open-source software
From the moon landing to the James Webb Space Telescope and many other scientific missions, software is critical for the US National Aeronautics and Space Administration (NASA). Sharing information has also been in the DNA of the space agency from the beginning. As a result, NASA also contributes to and releases open-source software and open data. In a keynote at FOSDEM 2023, Science Data Officer Steve Crawford talked about NASA and open-source software, including the challenges NASA has faced in using open source and the agency's recent initiatives to lower barriers.
Software has always been a big part of NASA's work. Who hasn't seen the photo of computer scientist Margaret Hamilton next to a hard-copy stack of the Apollo software she and her team at MIT produced? The stack of code is as tall as she is. In 2016, the original Apollo 11 Guidance Computer source code for the command and lunar modules was published on GitHub in the public domain. You can even compile the code and run it in a simulator.
Sharing its discoveries has also always been a part of NASA's heritage, Crawford emphasized. He showed section 203(a) of the National Aeronautics and Space Act of 1958, which created NASA. It states that the agency shall "provide for the widest practicable and appropriate dissemination of information concerning its activities and the results thereof".
From open development to spin-offs
In recent years, more and more of this sharing was also in the form of releasing software. For instance, when NASA's drone copter Ingenuity made it first flight on Mars in 2021 as part of the Perseverance mission, it used an open-source flight-control framework, F Prime. NASA's Jet Propulsion Laboratory (JPL) released the framework in 2017 under the Apache 2.0 license. One of the example deployments even runs on the Raspberry Pi. But the NASA mission also used a lot of open-source dependencies. To celebrate Ingenuity's first flight, GitHub recognized the more than 12,000 people who contributed to these dependencies with a badge on their profile.
Another high-profile mission that Crawford talked about is the James Webb Space Telescope, launched in December 2021. It's a collaboration with the European Space Agency (ESA) and the Canadian Space Agency (CSA), and the successor to the Hubble Space Telescope launched in 1990. Its calibration software is developed openly on GitHub, enabling scientists to test their projects. The software library is developed in Python and makes use of Astropy, which handles common astronomy tasks, such as converting between different coordinate systems and reading and writing files with astronomical data. NASA not only uses open-source projects in F Prime and the James Webb Space Telescope's calibration software, but in a lot of other projects too. Some of those projects can be seen in Crawford's slide above.
Crawford listed a lot of other open-source software NASA has released in its long history, calling it "just a small sampling". He also noted that some of these projects became spin-offs. For instance, in 2008 NASA started a project called Project Nebula to standardize its web sites. It then evolved into something that was addressing more general needs. This caused NASA to join forces with Rackspace to create the open-source cloud-computing infrastructure project OpenStack, which is currently managed by the non-profit OpenStack Foundation.
Astronomical challenges
While the previous examples may be some high-profile successes, open source at NASA doesn't come without its challenges. "Civil servants can't release anything copyrightable", Crawford said, referring to the fact that under US copyright law, a work prepared by an officer or employee of the United States Government as part of that person's official duties is in the public domain.
Of course NASA has contributed to many open-source projects, but according to Crawford people often do this "not in their official capacity as NASA employees". In 2003 NASA created a license to enable the release of software by civil servants, the NASA Open Source Agreement. This license has been approved by the Open Source Initiative (OSI), but the Free Software Foundation doesn't consider it a free-software license because it does not allow changes to the code that come from third-party free-software projects. "It isn't widely used in the community and complicates the reuse of NASA software with this license", Crawford said.
Another challenge is NASA's famous bureaucracy, Crawford admitted: "NASA does not always engage well with the open-source community." As an example, he showed how curl's main developer Daniel Stenberg received an email from NASA's Commercial IT Acquisition Team, asking him to supply country of origin information for curl, as well as a list of all "authorized resellers". Stenberg noted the keynote (which he barely missed attending) in a recent blog post.
Lowering barriers for open-source software
There are some initiatives brewing at NASA that should make these challenges a thing of the past, though. In his talk, Crawford presented NASA's Open-Source Science Initiative (OSSI). Its goal is to support scientists to help them integrate open-science principles into the entire research workflow. Just a few weeks before Crawford's talk, NASA's Science Mission Directorate published its new policy on scientific information.
Crawford summarized this policy with "as open as possible, as restricted as necessary, always secure", and he made this more concrete: "Publications should be made openly available with no embargo period, including research data and software. Data should be released with a Creative Commons Zero license, and software with a commonly used permissive license, such as Apache, BSD, or MIT. The new policy also encourages using and contributing to open-source software." Crawford added that NASA's policies will be updated to make it clear that employees can contribute to open-source projects in their official capacity.
This new policy should lower the barriers between open-source software and NASA. Funding for external open-source projects has already improved: in 2021-2022 NASA selected 16 proposals to support 22 different open-source projects financially. This included the Python projects NumPy (used by Astropy), pandas, and scikit-learn, as well as the Julia programming language.
As part of its Open-Source Science Initiative, NASA has started its five-year Transform to Open Science (TOPS) mission. This is a $40-million mission to speed up adoption of open-science practices; it starts with the White House and all major US federal agencies, including NASA, declaring 2023 as the "Year of Open Science". One of NASA's strategic goals with TOPS is to enable five major scientific discoveries through open-science principles, Crawford said.
Keep contributing
Open-source software will clearly play an important role in open science, and was already instrumental in various breakthrough discoveries. When scientists created the first image of a black hole in 2019 from data generated by the Event Horizon Telescope, Dr. Katie Bouman who led the development of the imaging algorithm was explicit about it: "We're deeply grateful to all the open source contributors who made our work possible." This was also the message Crawford ended his talk with: "Keep contributing, building, and sustaining your code." After his "Thank you for your contributions", his words were followed by big applause from a room full of open-source developers.
Free software and fiduciary duty
Serial litigant Craig Wright recently won a procedural ruling in a London court that allows a multi-billion-dollar Bitcoin-related lawsuit to proceed. This case has raised a fair amount of concern within the free-software community, where it is seen as threatening the "no warranty" language included in almost every free-software license. As it happens, this case does not actually involve that language, but it has some potentially worrisome implications anyway.Wright is known for, among other things, claiming to be Satoshi Nakamoto, the author of the original paper describing Bitcoin, and for filing numerous lawsuits within the cryptocurrency community. In the case at hand, he (in the form of his company "Tulip Trading Limited") claims to own about $4 billion in Bitcoin sitting in the blockchain — a claim that, like his others, is not universally acknowledged — but to have lost the keys giving access to that Bitcoin after his home network was broken into. It is, Wright claims, incumbent upon the maintainers of the Bitcoin network software to develop and merge a patch allowing the claimed Bitcoin to be transferred to a key that he controls.
The various Bitcoin developers, it turns out, are unconvinced by Wright's claim to that Bitcoin and even less convinced that the Bitcoin miners would accept a software update that included such a patch. Wright, allegedly backed by some deep pockets with eyes on part of a $4 billion prize, has taken 15 of these developers (and one organization) to court. The case fared poorly in in its first round, but now an appeals court has issued a ruling allowing an appeal to proceed, saying that there are issues of interest to be litigated.
At a first look, this case appears to be a warranty issue, and many observers have seen it that way. Wright is asserting that a bug in the Bitcoin system is keeping him from getting his hands on his well-earned billions, and that the maintainers of that code owe him a fix. The code in question is covered by the MIT license, which explicitly disclaims the existence of any warranty; if the court were to find that a warranty obligation exists anyway, the resulting precedent could put free-software developers at risk worldwide. It is not surprising that people are concerned.
The appeals-court ruling, though, makes no mention of warranties. The question of whether Wright is entitled to a "fix" hinges on a different issue:
The essence of Tulip’s case is that the result of all this is that the developers, having undertaken to control the software of the relevant bitcoin network, thereby have and exercise control over the property held by others (i.e. bitcoin), and that this has the result in law that they owe fiduciary duties to the true owners of that property with the result that, on the facts of this case, they are obliged to introduce a software patch along the lines described above, and help Tulip recover its property.
In other words, it is not the software license that might possibly create an obligation here. It is the control over software that manages somebody else's assets that might. Note that this ruling does not reach any conclusions regarding whether Wright's claim to the Bitcoin is valid.
In your editor's opinion, though, the court misunderstands the nature of the control that the Bitcoin network developers have:
Without the relevant password (etc.) for the bitcoin software account in Github, no one else, such as a concerned bitcoin owner, could fix the bug. If a bitcoin owner identified a bug and wrote the code to fix it, that fix could still only be implemented if the developers agreed to do so in the exercise of their de facto power. In a very real sense the owners of bitcoin, because they cannot avoid doing so, have placed their property into the care of the developers. That is, in my judgment, arguably an "entrustment".
The crucial point that has been missed here, of course, is that nobody can prevent others from applying a fix to the code — that is part of the fundamental freedom that comes with free software. If the maintainers of a given repository refuse to apply needed fixes, the community can fork the project and route around those maintainers. Wright, as the alleged creator of Bitcoin, should certainly be capable of writing this patch and convincing the mining community — which is where the real power to decide which software is run lies — that it should be applied.
Similarly, and importantly, a code fork could also happen if those maintainers were to merge a hack that somehow forced a transaction into the blockchain despite the absence of the private key controlling the Bitcoin in question, handing control of a contested resource to an actor seen by many as, at best, a scammer. Such an act would be highly likely to cause those maintainers to lose control over the software entirely. Even if Wright were to somehow win a ruling compelling the developers to apply an unwanted patch to a specific repository, the chances of that change being deployed to the network of Bitcoin miners seem small.
The appeals court, seemingly, fails to understand that aspect of how free software works and attributes a power to the maintainers of a specific repository that they do not, in fact, have; therein lies the risk from this case. If, somehow, maintenance of a body of software that is used to maintain assets owned by others — whether those assets are cryptocurrency, social-media posts, cat videos, or indeed software repositories — can be seen to create some sort of fiduciary responsibility toward the users of that software, our community's maintainer shortage will get significantly worse. Maintainership can be a thankless job as it is; the prospect of being sued for failing to write or apply a given "fix" would certainly push many maintainers over the line and out the door.
This ruling is just a preliminary allowing the case to proceed; the real trial is yet to happen. Hopefully some sense will be applied there, preferably before the defendants are bankrupted by legal fees (the newly created Bitcoin Legal Defense Fund is now handling their defense). Yes, we place trust in our maintainers, but we do so in an environment that limits just how much trust is required and gives us recourse should a maintainer fail to live up to that trust. A ruling that maintainers owe us more than they already give will not do the community — or almost anybody else — any good.
[Thanks to Paul Wise for a heads-up on this topic.]
The extensible scheduler class
It was only a matter of time before somebody tried to bring BPF to the kernel's CPU scheduler. At the end of January, Tejun Heo posted the second revision of a 30-part patch series, co-written with David Vernet, Josh Don, and Barret Rhoden, that does just that. There are clearly interesting things that could be done by deferring scheduling decisions to a BPF program, but it may take some work to sell this idea to the development community as a whole.The core idea behind BPF is that it allows programs to be loaded into the kernel from user space at run time; using BPF for scheduling has the potential to enable significantly different scheduling behavior than is seen in Linux systems now. The idea of "pluggable" schedulers is not new; it came up in this 2004 discussion of yet another doomed patch series from Con Kolivas, for example. At that time, the idea of pluggable schedulers was strongly rejected; only by focusing energy on a single scheduler, it was argued, could the development community find a way to satisfy all workloads without filling the kernel with a confusion of special-purpose schedulers.
Of course, the idea that the kernel only has one CPU scheduler is not quite accurate; there are actually several of them, including the realtime and deadline schedulers, that applications can choose between. But almost all work on Linux systems runs under the default "completely fair scheduler", which indeed does a credible job of managing a wide variety of workloads on everything from embedded systems to supercomputers. There is always a desire for better performance, but there have been almost no requests for a pluggable scheduler mechanism for years.
Why, then, is the BPF mechanism being proposed now? In clear anticipation of a long discussion, the cover letter for the patch series describes the motivation behind this work at great length. In short, the argument goes, the ability to write scheduling policies in BPF greatly lowers the difficulty of experimenting with new approaches to scheduling. Both our workloads and the systems they run on have become much more complex since the completely fair scheduler was introduced; experimentation is needed to develop scheduling algorithms that are suited to current systems. The BPF scheduling class allows that experimentation in a safe manner without even needing to reboot the test machine. BPF-written schedulers can also improve performance for niche workloads that may not be worth supporting in the mainline kernel and are much easier to deploy to a large fleet of systems.
Scheduling with BPF
The patch set adds a new scheduling class, called SCHED_EXT, that can be selected with a sched_setscheduler() call like most others (selecting SCHED_DEADLINE is a bit more complicated). It is an unprivileged class, meaning that any process can place itself into SCHED_EXT. SCHED_EXT is placed between the idle class (SCHED_IDLE) and the completely fair scheduler (SCHED_NORMAL) in the priority stack. As a result, no SCHED_EXT scheduler can take over the system in a way that would prevent, for example, an ordinary shell session running as SCHED_NORMAL from running. It also suggests that, on systems where SCHED_EXT is in use, the expectation is that the bulk of the workload will be running in that class.
The BPF-written scheduler is global to the system as a whole; there is no provision for different groups of processes to load their own schedulers. If there is no BPF scheduler loaded, then any processes that have been put into the SCHED_EXT class will be run as if they were in SCHED_NORMAL instead. Once a BPF scheduler is loaded, though, it will take over the responsibility for all SCHED_EXT tasks. There is also a magic function that a BPF scheduler can call (scx_bpf_switch_all()) that will move all processes running below realtime priority into SCHED_EXT.
A BPF program implementing a scheduler will normally manage a set of dispatch queues, each of which may contain runnable tasks that are waiting for a CPU to execute on. By default, there is one dispatch queue for every CPU in the system, and one global queue. When a CPU is ready to run a new task, the scheduler will pull a task off of the relevant dispatch queue and give it the CPU. The BPF side of the scheduler is mostly implemented as a set of callbacks to be invoked via an operations structure, each of which informs the BPF code of an event or a decision that needs to be made. The list is long; the full set can be found in include/sched/ext.h in the SCHED_EXT repository branch. This list includes:
- prep_enable()
- enable()
- The first callback informs the scheduler of a new task that is entering SCHED_EXT; the scheduler can use it to set up any associated data for that task. prep_enable() is allowed to block and can be used for memory allocations. enable(), which cannot block, actually enables scheduling for the new task.
- select_cpu()
- Select a CPU for a task that is just waking up; it should return the number of the CPU to place the task on. This decision can be revisited before the task actually runs, but it may be used by the scheduler to wake the selected CPU if it is currently idle.
- enqueue()
- Enqueue a task into the scheduler for running. Normally this callback will call scx_bpf_dispatch() to place the task into the chosen dispatch queue, from which it will eventually be run. Among other things, that call provides the length of the time slice that should be given to the task once it runs. If the slice is specified as SCX_SLICE_INF, the CPU will go into the tickless mode when this task runs.
It's worth noting that enqueue() is not required to put the task into any dispatch queue; it could squirrel that task away somewhere for the time being if the task should not run immediately. The kernel keeps track, though, to ensure that no task gets forgotten; if a task languishes for too long (30 seconds by default, though the timeout can be shortened), the BPF scheduler will eventually be unloaded.
- dispatch()
- Called when a CPU's dispatch queue is empty; it should dispatch tasks into that queue to keep the CPU busy. If the dispatch queue remains empty, the scheduler will try to grab tasks from the global queue instead.
- update_idle()
- This callback informs the scheduler when a CPU is entering or leaving the idle state.
- runnable()
- running()
- stopping()
- quiescent()
- These all inform the scheduler about status changes for a task; they are called when, respectively, a task becomes runnable, starts running on a CPU, is taken off a CPU, or becomes no longer runnable.
- cpu_acquire()
- cpu_release()
- Inform the scheduler about the status of the CPUs in the system. When a CPU becomes available for the BPF scheduler to manage, a callback to cpu_acquire() informs it of the fact. The loss of a CPU (because, perhaps, a realtime scheduling class has claimed it) is notified with a call to cpu_release().
There are numerous other callbacks for the management of control groups, CPU affinity, core scheduling, and more. There is also a set of functions that the scheduler can call to affect scheduling decisions; for example, scx_bpf_kick_cpu() can be used to preempt a task running on a given CPU and call back into the scheduler to pick a new task to run there.
Examples
The end result is a framework that allows the implementation of a wide
range of scheduling policies in BPF code. To prove the point, the patch
series includes a number of sample schedulers. This
patch contains a minimal "dummy" scheduler that uses the default for
all of the callbacks; it also has a basic scheduler that implements five
priority levels and shows how to stash tasks into BPF maps. "While not
very practical, this is useful as a simple example and will be used to
demonstrate different features
".
Beyond that, there is a "central" scheduler that dedicates one CPU to scheduling decisions, leaving all others free to run the workload. A later patch adds tickless support to that scheduler and concludes:
While scx_example_central itself is too barebone to be useful as a production scheduler, a more featureful central scheduler can be built using the same approach. Google's experience shows that such an approach can have significant benefits for certain applications such as VM hosting.
As if that weren't enough, scx_example_pair
implements a form of core scheduling using
control groups. The scx_example_userland
scheduler "implements a fairly unsophisticated sorted-list vruntime
scheduler in userland to demonstrate how most scheduling decisions can be
delegated to userland
". The series concludes with the Atropos
scheduler, which features a significant user-space component written in
Rust. The cover letter describes one more, scx_example_cgfifo,
which wasn't included because it depends on the still out-of-tree BPF rbtree
patches. It "provides FIFO policies for individual workloads, and a
flattened hierarchical vtree for cgroups
", and evidently provides
better performance than SCHED_NORMAL for an Apache web-serving
benchmark.
Prospects
This patch set is in its second posting and has, so far, not drawn a lot of
review comments; perhaps it is too big to bikeshed. Scheduler maintainer
Peter Zijlstra responded to the first
version, though, saying:
"I hate all of this. Linus NAK'ed loadable schedulers a number of times
in the past and this is just that again -- with the extra downside of the
whole BPF thing on top
". He then proceeded to review many of the
component patches, though, suggesting that he may not intend to reject this
work outright.
Even so, the BPF scheduler class will clearly be a large bite for the core kernel community to swallow. It adds over 10,000 lines of core code and exposes many scheduling details that have, thus far, been kept deep within the kernel. It would be an acknowledgment that one general-purpose scheduler cannot optimally serve all workloads; some may worry that it would mark an end to work on the completely fair scheduler toward that goal and an increase in fragmentation across Linux systems. The BPF-scheduling developers argue the opposite, that the ability to freely experiment with scheduling models would, instead, accelerate improvements to the completely fair scheduler.
How this will play out is hard to predict, other than to note that the BPF juggernaut has, thus far, managed to overcome just about every objection that it has encountered. The days of locking up core-kernel functionality within the kernel itself seem to be coming to an end. It will be interesting to see what new scheduling approaches will be enabled by this subsystem.
A proposed threat model for confidential computing
The field of confidential computing is still in its infancy, to the point where it lacks a clear, agreed, and established problem description. Elena Reshetova and Andi Kleen from Intel recently started the conversation by sharing their view of a potential threat model in the form of this document, which is specific to the Intel Trust Domain Extension (TDX) on Linux, but which is intended to be applicable to other confidential-computing solutions as well. The resulting conversation showed that there is some ground to be covered to achieve a consensus on the model in the community.
This security specification constitutes the first public draft of a concise threat model for confidential computing for the Linux kernel. The first few paragraphs of the threat model describe the key confidential-computing assumption: the guest kernel in virtualized environments cannot trust the hypervisor. This probably is no surprise to readers familiar with the subject, but Greg Kroah-Hartman expressed his reservations:
That is, frankly, a very funny threat model. How realistic is it really given all of the other ways that a hypervisor can mess with a guest?So what do you actually trust here? The CPU? A device? Nothing?
The threat model provides an answer; the trusted computing base (TCB) for Intel TDX is limited to the Intel platform, the TDX module, and the software stack running inside the TDX guest. We can therefore conclude that the confidential-computing TCB on any system features the platform hardware, the guest, and any intermediary that communicates between the platform and guest.
Hardening preexisting interfaces
Memory encryption and hardware attestation can help guarantee privacy from malicious software while verifying the integrity of both the guest memory and the trusted software accessing that memory. If the attestation process confirms that the confidential-computing system has not been tampered with, then it is guaranteed that the guest's private memory is unreadable to the host. These techniques offer strong security guarantees but are not enough to protect the guest from attacks that exploit the communication interfaces between the host and the guest. The threat model addresses this gap by defining a threat-mitigation matrix that lists potential interface entry points and their possible mitigations.
Non-robust device drivers are an example of a vulnerable interface that can be exploited to feed malicious input from the hypervisor side. Kroah-Hartman complained that the "hardening" terminology used in the threat model can be misleading; broken drivers should be fixed and not hardened, he said. Reshetova disagreed, stating that certain fixes apply to systems where the hardware is operating correctly, but where the hypervisor is malicious. The primary concern around this is that an untrusted host can use device interfaces to attack the guest, but the Linux device drivers were not developed with this potential threat in mind.
Regarding device drivers, the specification recommends the maintenance of a list of allowed devices. In practice, the virtualized guest needs little more than the virtio drivers. James Bottomley referred to this when he noted that virtio devices needed by a guest to boot are potentially the most dangerous. Christophe de Dinechin questioned just how much harm a malicious virtio device might cause to the guest kernel and whether such an attack could really disclose confidential information held by the guest. To date, this question remains open, but there have been efforts to mitigate virtio threats that have led to kernel patches.
The specification also explains that several subsystems could be used for fuzz-testing of the communication interfaces exposed to the malicious hypervisor. For example, Intel-specific TDVMCALL hypercalls communicate between the guest and the TDX module and can be used for fuzz testing. Randomness inside the guest also needs extra precautions. A failed RDRAND or RDSEED function must trigger an infinite loop, precluding the guest from using alternative options that the host can tamper with. The KVM clock (kvm-clock) also becomes untrusted and must be disabled inside the guest. ACPI tables are never mapped as shared with the host, a new interface is therefore introduced to allow the host to obtain the operating regions declared in those tables. The confidential-computing guest must acknowledge the private memory pages allocated by the host in order to be protected from attacks that affect the guest paging.
The confidential-computing threat model also addresses the interesting problem of how to panic the guest kernel. The host controls inter-processor interrupts, which thus cannot be trusted to safely stop other CPUs. Furthermore, some driver notifiers perform tasks that may involve waiting for some host action. Reshetova mentioned that denial-of-service (DoS) attacks that trigger a guest crash (which can be preceded by multiple oopses) are out of scope in this model, but that one cannot assume that all crashes are safe. She further explained that certain crashes, such as those related to memory corruption, can be a starting point for further security attacks, leading to privilege escalation, information disclosure, and data corruption — the sorts of outcomes that confidential computing seeks to prevent.
The Linux TDX software stack uses dm-crypt with LUKS to protect the guest's storage devices by providing encryption and authentication for the storage volumes. However, Richard Weinberger noted that the cryptography used in LUKS is meant to safeguard data in storage, but not data in transport. Reshetova responded that the disk encryption presumes that the attacker can observe all encrypted data on the disk — and the alterations that occur — when a new block is written; she is therefore uncertain of the potential for this type of attack.
Finally, a confidential-computing guest must be aware of transient execution attacks that exploit speculative CPU optimizations. For example, the kernel running inside the guest should take extra precautions to prevent any potential Spectre vulnerabilities associated with the above-mentioned host-controlled interfaces. The specification proposes using static analyzers like Smatch to identify potential attack surfaces. Nothing can replace manually inspecting identified lines of code, but this review time can be lessened by filtering against drivers that the guest kernel depends on.
In conclusion
Agreeing on a particular threat model is one of the most pressing challenges for confidential computing in the cloud, and its agreement will affect how confidential computing integrates with the larger kernel development community. For example, with regard to the efforts to strengthen drivers, some developers argued that it would be easier to create confidential-computing-specific drivers instead of relying on the existing Linux drivers, which were not written with this threat model in mind. The fuzzing efforts conducted by those working on the Linux TDX software stack have already laid the groundwork for several patches, but these are pending reviewer acceptance. Maintaining the hardness of the system, though, will require that the maintainers accept the model of what it is being hardened against.
An overview of single-purpose Linux distributions
Many people, when they are installing a Linux distribution for a single purpose such as running containers, would prefer an install-and-forget type of deployment. At FOSDEM 2023 in Brussels, several projects of this minimal Linux distribution type were presented. Fedora CoreOS, Ubuntu Core, openSUSE MicroOS, and Bottlerocket OS all tackle this problem in their own way. The talks at FOSDEM gave an interesting overview of how these projects differ in their approaches.
Fedora CoreOS
Akashdeep Dhar and Sumantro Mukherjee, who are both members of the Fedora Council and work at Red Hat as software engineers, explained how they use Fedora CoreOS as the base operating system to run multiplayer game servers in containers. As described in its documentation, Fedora CoreOS is "an automatically updating, minimal, monolithic, container-focused operating system".
Fedora CoreOS (sometimes abbreviated FCOS) provides the host operating system for these containers; it only includes those packages that are needed for a minimal networking-enabled and container-ready setup. At the time of this writing, the latest stable release had 415 packages. It supports the x86_64, aarch64 (including the Raspberry Pi 4), and s390x architectures; it runs on bare metal, virtualized, or on various cloud platforms.
A Fedora CoreOS machine is provisioned using Ignition, which is a tool that partitions disks, formats partitions, enables systemd units, and configures users. Ignition only runs once during the first boot of the system, from the initramfs. An Ignition configuration file is formatted as JSON, but for end users Fedora CoreOS recommends using a Butane configuration, which is a YAML file that Butane translates into an Ignition configuration. The "System Configuration" section in Fedora CoreOS's documentation shows some examples of how to configure storage, network, containers, users and groups, time zones, and more in a Butane configuration. In their talk, Dhar and Mukherjee showed a Butane configuration to set up a Minecraft server in a container, and they also published it in their GitHub repository.
When installing Fedora CoreOS, you choose one of three update streams. "Next" is for experimenting with new features, "testing" represents what is coming in the next stable stream, and "stable" is the stream with changes that have spent a time in the testing stream. Most end users should choose the stable stream. You refer to the Ignition file with your customizations, in a manner that depends on your installation type. For instance, when installing from PXE you append the coreos.inst.ignition_url=URL option to the kernel, referring it to the location of the Ignition file on a web server.
After installation, the system is updated automatically when a new release is rolled out on the chosen stream. The Zincati agent checks for operating-system updates and applies them using rpm-ostree. Zincati can be configured as well; for example, one can configure how "wary" it is to update (that is, how early in the phased rollout cycle it receives updates) and how eager it is to reboot after applying an update (immediately or only within configured maintenance windows). If an update causes problems, the user is always able to manually roll back to the previous system state with:
$ sudo rpm-ostree rollback -r
There are no dnf or yum commands in Fedora CoreOS. Extending the package set is done with rpm-ostree, which layers the packages on top of the current operating-system image. But, since Fedora CoreOS is a container-focused system, extra services would generally be installed as containers.
Ubuntu Core
Canonical's Valentin David talked about Ubuntu Core. According to the project's home page, it's "a secure, application-centric IoT OS for embedded devices". Ubuntu Core targets high-end embedded devices such as industrial computers for IoT gateways, signage, robotics, and automotive applications; at home it could be useful on a Raspberry Pi to run services such as Nextcloud or home-automation software. The distribution's software is based on Ubuntu's main operating-system builds, but without using deb packages or the dpkg and apt commands. Instead, it only uses snaps to install software. In essence, a snap package is a squashfs image with some metadata about how to install and run the software.
Snaps are isolated from other snaps and the underlying operating system. If a snap is run in strict confinement, it runs in a sandbox, making use of AppArmor, seccomp, and control groups. By default, snaps don't have access to resources outside of the sandbox, but they can get access to specific resources using interfaces.
David explained that there are five types of snaps in Ubuntu Core. The "gadget" snap contains device-specific or architecture-specific components such as the boot loader, device tree, board-specific packages, and configurations. The "kernel" snap comes with the Linux kernel, modules, firmware, and systemd stubs. The "base" snap contains the root file system for the Ubuntu Core operating system. The "snapd" snap has snapd, the daemon that installs and updates all snaps. And last but not least, each application is packaged in an application snap. This includes the root file system of a base snap; it can also make services and commands available to the underlying operating system.
The gadget snap also describes the disk layout. Ubuntu Core typically has four partitions. On UEFI systems, the "seed" partition is the EFI System Partition (ESP), containing the configuration for the first-stage boot loader and at least one recovery system. The "boot" partition contains the second-stage boot loader, a kernel, and initramfs that decrypts the "save" and "data" partitions. The latter two are LUKS2 encrypted. The save partition contains a backup of the device identity and other data to facilitate recovery, while the data partition stores the user and system data.
Most of the Ubuntu Core operating system is read-only. For instance, /etc and /var are read-only by default. However, specific paths are bind-mounted from the data partition, such as /etc/systemd, which allows the system to change any systemd unit files for the services and mount the snap's squashfs image. Transactional updates are handled by snapd: if an update of a snap fails, the system automatically rolls back to the previous version of the snap.
openSUSE MicroOS
Ignaz Forster, research engineer at SUSE, described the design of openSUSE MicroOS. It's a rolling-release distribution of openSUSE Tumbleweed, developed to run as a single-purpose system. A typical target would be for hosting containers, but it can even be used to create a minimal desktop. As with Fedora CoreOS and Ubuntu Core, openSUSE MicroOS automatically updates itself and has a minimal package selection. In openSUSE MicroOS's case, these are just RPM packages from openSUSE's repositories. There's also an enterprise version, SUSE Linux Enterprise Micro, and a community version based on the latter, Leap Micro.
OpenSUSE MicroOS has a read-only root file system, using Btrfs. Transactional updates are handled by a SUSE-specific wrapper script around the package manager zypper, transactional-update. This creates a new Btrfs snapshot of the root file system and then performs an update of the system. If the installation was successful, the script marks the new snapshot as the default snapshot. On errors, the snapshot is discarded and the previous one remains as the default. A reboot activates the new snapshot. Forster announced that, since all of the read-only parts of openSUSE MicroOS have now been moved to /usr, the upcoming 4.2.0 release of transactional-update would also be able to apply new snapshots without rebooting. MicroOS runs a health-checker systemd service that checks whether the system boots as expected after an update. This starts an automatic rollback to the previous default snapshot of the root file system if the system isn't healthy.
The original transactional-update script has been evolving into a generic library for atomic system updates, libtukit. The current implementation only supports Btrfs with openSUSE's snapshot utility Snapper, as used in openSUSE MicroOS. But according to Forster, the API is developed to support other backends.
In contrast to Ubuntu Core, all of /var and /etc are writable, while /usr is read-only. For instance, the default system configuration is put in /usr. Only changes made by the administrative user are in /etc. OpenSUSE's libeconf merges the configuration files placed in several locations. Most of the default MicroOS packages have been changed to work with this. Only /etc/fstab does not follow this convention yet. Forster concluded that openSUSE MicroOS takes a pragmatic approach to use existing infrastructure and packages, and that it's "a functional read-only OS in an imperfect world".
Bottlerocket
While the previous three operating systems originated from a general-purpose mother distribution, Bottlerocket is created by Amazon, tailored to host containers in its Amazon Web Services (AWS) cloud. In his talk, AWS software development engineer Sean McGinnis was quick to emphasize that the operating system is "backed by AWS, but not AWS-only". As an example, the project's GitHub repository has instructions to run it on bare-metal servers.
Bottlerocket was announced in March 2020 and made generally available in August 2020. To keep its footprint as small as possible, Amazon publishes variants for particular use cases. For instance, there's a aws-k8s variant with containerd and kubelet to run as a Kubernetes node on AWS, a vmware-k8s variant to do the same on VMware with Amazon Elastic Kubernetes Service (EKS), and a metal-k8s variant that supports Amazon EKS running on bare metal.
Bottlerocket runs two completely separate container runtimes. One is running host containers for operational tasks. The other one is used for running containers with an orchestrator, such as Kubernetes pods. Both runtimes have different security profiles.
Each container, be it a host container or a for-orchestrator container, is running an API client that talks over a Unix socket to an API server running on Bottlerocket. When Bottlerocket boots, its boot configuration (including user data) is loaded into the API server. User interaction is also typically done through this API to make real-time changes to the system configuration.
One of the host containers is the "control" container, which is launched on boot. This container is used to configure the Bottlerocket host. Another host container is the "admin" container. This isn't launched by default: it should only be launched in exceptional circumstances to troubleshoot the host operating system. It has additional privileges and can use the root process namespace to access the other containers for troubleshooting purposes. The admin container runs an SSH server that is reachable through the host's primary network interface. A final type of host container is the bootstrap container: this bootstraps the host before services like Kubernetes or Docker start. It has additional permissions, for instance to provide access to the underlying host file system.
Security is one of the focal points of Bottlerocket. The root file system is read-only, and /etc is backed by a tmpfs file system that is regenerated on boot. For container images and volumes, a separate user partition is mounted. Moreover, there's no package manager, no shell, and no Python interpreter. "If an attacker is able to escape a container, there are not many tools to work with", McGinnis said.
To check the integrity of the block devices, Bottlerocket uses dm-verity. The kernel boots in lockdown mode, which prevents the root user from modifying the kernel. McGinnis explained that this increases assurance that the running kernel corresponds to the booted kernel. Another security feature he emphasized is that Bottlerocket runs with SELinux in enforcing mode.
For updates, Bottlerocket uses an image-based model. The kernel, system packages, and container runtime packages are all stored inside an operating-system image. The first block device of the host has an active and inactive partition. An upgraded image is downloaded to the inactive partition, and upon reboot the host boots into this partition, which is then made active. The previous Bottlerocket image is still stored in the then inactive partition, and can be rolled back if required.
Conclusion
When looking at the different approaches of these single-purpose Linux distributions, it's clear that there's no one best way. Which one you choose depends on how they align to your goals and what tools you're comfortable with. Are you heavily invested in the API-first AWS or Kubernetes world? Then Bottlerocket seems to be the best fit. Do you prefer snaps to run your services? Then Ubuntu Core is a no-brainer. If you want to run containers on a host system without too much maintenance, then Fedora CoreOS or openSUSE MicroOS are for you. Whether they are using rpm-ostree or Btrfs snapshots under the hood is probably less important when all of the workloads are running in containers anyway.
Page editor: Jonathan Corbet
Next page:
Brief items>>
