Leading items

Welcome to the LWN.net Weekly Edition for July 27, 2023

This edition contains the following feature content:

A status update for U-Boot: a report from the Embedded Open Source Summit on recent work with the U-Boot bootloader.
A discussion on Linux in space: a panel session on how Linux is being used off-planet.
Much ado about SBAT: a proposed security mechanism for locked-down systems creates controversy.
Exceptions in BPF: giving BPF programs a way to abort execution when things go badly wrong.
Randomness for kmalloc(): hardening memory allocation for (probably) the 6.6 kernel.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

A status update for U-Boot

By Jake Edge
July 26, 2023

EOSS

The U-Boot "universal boot loader" is used extensively in the embedded-Linux world. At the 2023 Embedded Open Source Summit (EOSS), Simon Glass gave a presentation (slides, YouTube video) on the status of the project, with a focus on new features added over the last several years. He also wanted to talk about complexity in the firmware world, which he believes is increasing, and how U-Boot can help manage that complexity. The talk was something of a grab bag of ideas and changes throughout the increasingly large footprint of the project.

Background

The idea behind U-Boot is its universality; "boot anything on anything", Glass said. As a project, it has been around for 20 years or so and is currently under active development with 6000 yearly commits. U-Boot is released on a regular schedule, four times per year.

It has around three-million lines of C code, with some additional Python tools. U-Boot has a lot of similarities with the Linux kernel: it has the same code style, uses Kconfig for build-time configuration, and has some compatibility layers that allow porting Linux subsystems and drivers to U-Boot "without too much pain". U-Boot has a continuous-integration (CI) testing system that "covers a large subset of the features" of the boot loader.

U-Boot's popularity stems from its large set of features, but "it's also easy to hack"; it is not difficult to go in and add a new command or feature to the code base. It is a single-threaded application, which makes it easier to deal with because there is no locking needed; "it's not trying to be an operating system". The U-Boot developers are generally open to new ideas and features, which also helps.

Complexity

"Everything is getting more complex in this world", he said. The systems-on-chip (SoCs) that are being used in our devices have a lot more features and they need firmware to operate. The firmware needs to be packaged into an image, but that process is getting more complex as well. There are security and signing requirements for firmware these days, in order to be sure that the systems are using the correct firmware. Ten years ago, those kinds of problems were much less common.

There is also a lot of diversity in the boot flow; systems are "now jumping through multiple projects, different binaries and so on, in order to get to an operating system". Beyond that, there is a proliferation of different device models that are all slightly distinct from each other, necessitating a different firmware image for each. It is not easy to deal with all of that diversity.

The U-Boot driver model deals with the SoC complexity "fairly well", Glass said. There can be a tree of devices, with devices being in different classes; there are relationships between the various devices "and devicetree is used to pull it all together".

One of the most complicated things to deal with is clocks; unlike in times past, the data sheets for boards do not even try to provide a clock diagram these days. But U-Boot makes a lot of that much easier with its infrastructure. "You can say 'please give me the MMC device'" and U-Boot will set up all of the clocks and pin multiplexing that is needed, turn on any power domains that are required, and return the device. "That's really really complicated to do manually", he said.

The problem of multiple, slightly different, device models has traditionally required a different boot loader for each, but U-Boot can change its configuration at run time, so that that only a single firmware image is needed. It does this using different devicetrees that can represent the various models. The devicetree for the specific model can be added into the boot image later, perhaps even as part of the manufacturing process, he said.

There are a lot of different ways to package up firmware, each person or project that does so probably has their own way. It does not seem difficult at the beginning, so some scripts are written, but eventually an entire build system evolves simply to package firmware. Binman is a tool that tries to solve this problem; it comes with U-Boot, but it is applicable to other projects, such as Zephyr, for example. It has a configuration file that describes all of the different pieces that need to be pulled into the firmware image and has support for various firmware formats so it can create the proper format for installation into the flash.

Binman solves a number of problems, he said. It allows easily changing the contents of the image, but it also provides documentation of what is present in the image. Instead of needing to decode the binary image using a variety of different tools, the binman configuration can be consulted. It also has mechanisms to fetch and build various types of tools that need to be run on bits of the firmware (e.g. for signing or vendor-specific formatting tasks).

Standard boot

The "standard boot" feature is, as its name suggests, "an attempt to make booting more standard in U-Boot". The traditional U-Boot bootm command can be used to boot a variety of images in the Flattened Image Tree (FIT) format, it will handle things like signature verification and image decompression, but it lacks a way to figure out which images should be used and where they are located on the device. That is currently done using scripts. Over the last few years, the distro_bootcmd scripts that come with U-Boot have been used to provide the proper parts and pieces for handling the boot.

Standard boot adds a higher-level interface that will allow U-Boot to fully see what boot devices and images are available on the device; it can then offer a menu to users to choose between the options. A bootdev is a layer on top of a storage device (e.g. MMC) that can enumerate the bootable images it contains. A bootmeth can then enumerate the available bootflow files that describe a specific boot process for the system. The "distro" bootmeth looks through the filesystems that were discovered by the bootdevs for files named extlinux/extlinux.conf. Those files describe the bootflow for the particular distribution, such as Android, Fedora, Ubuntu, Chrome OS, and so on. It is a straightforward model, but can be a bit hard to wrap one's head around, he said, and U-Boot is still in the process of converting to it.

U-Boot has long had UEFI support and can boot distributions, such as Fedora and Ubuntu, directly via UEFI. It also supports UEFI Secure Boot and can handle things like using the Trusted Platform Module (TPM) for measurement and attestation of the boot process. U-Boot can also be run as an EFI application directly if desired.

There is an alternative to UEFI these days, though. Verified Boot for Embedded (VBE) is an effort to answer the question: "if we are not going to use UEFI, what would it look like?" He did not want to get into the details of VBE and recommended his talk from last year's Open Source Firmware Conference (OSFC) for more information. It is based on existing standards, such as the FIT format. Instead of having a bunch of callbacks from the boot image to U-Boot, as is done in UEFI Secure Boot, it provides the image with all that it needs up front, which is a simpler approach, he thinks.

The U-Boot project has moved to using Sphinx for its documentation over the past few years; there was an earlier U-Boot manual, but it "just sort of died" along the way. The good thing about the new documentation is that it lives in the source tree, so a patch that adds a new feature or command can contain the documentation (and, hopefully, tests) as well. Most of the existing documentation has been converted to reStructuredText at this point, but there are still commands and features that lack documentation; the hope is to close that gap before too long.

The testing and CI system for U-Boot is another area that has "expanded considerably in the last few years". Glass noted that there had been a Zephyr talk at EOSS about emulated devices for testing; U-Boot uses those as well, so you can test U-Boot from an emulated SPI flash, for example, on a Linux workstation. It allows writing all sorts of tests that can be run without the hardware; the emulators can be altered to make the "hardware" do things that need to be checked. Now, if someone sends a patch without a test, they can be asked to provide one and it is not a huge burden to do so.

Devicetree and beyond

U-Boot adopted devicetree in 2011 at roughly the same time that Linux did, which is a bit of problem because the bindings (properties, nodes, etc.) are different between them. Those differences are being resolved over time. Another problem is that U-Boot has not been able to upstream its schema requirements to the kernel, though that is changing as well. The idea is that U-Boot can get its devicetree from Linux and, as long as it has all of the drivers for the hardware it needs from the devicetree, it can simply use it.

While U-Boot has used Kconfig at least as long as Glass has been working on it (nearly ten years), it still had a lot of #define configuration statements scattered around in header files, but that has all been cleaned up as of the beginning of the year. There is now a text file that describes the configuration, which means that there is a path toward having no board-specific config.h files at all; a board can be fully specified in the configuration text file.

The project has added link-time optimization (LTO) as well. "It makes your board take four times as long to build, but it is 5% smaller." It is generally a win, so it is turned on by default for most Arm boards. In addition, the events feature provides a way to spy on things like device creation; it is an alternative to using weak functions, which he is not a big fan of because it is hard to determine where they are being called. An event gets published and other locations can indicate their interest in particular events, then process them when they occur; the spies can be easily listed using a tool, so they have more visibility.

Something that is going on behind the scenes is an effort to allow different parts of the firmware to hand off configuration data and other information to the next stage in the firmware. Currently, there can be multiple stages (including the verifying program loader (VPL), secondary program loader (SPL), trusted execution environment or trusted firmware, and U-Boot) that are separate projects, each with its own configuration mechanism. The Firmware Handoff specification is for a "simple, tagged data structure" that can be passed between these components.

There have "been quite a few changes" in the networking support for U-Boot, starting with the addition of TCP/IP; that means Wget can be used, "which is kind of handy". TFTP support has been around for a while, but it is much more limited. There is also IPv6 support now and a new PHY API. There are currently ongoing discussions about whether U-Boot should switch to using the lightweight TCP/IP stack, lwIP, or whether it should continue with its own implementation.

There is now support for 21 different RISC-V boards in U-Boot. Booting distributions on x86 is now just a few pending patches away from being fully supported. Using U-Boot as a coreboot payload on x86 has been enhanced. Coreboot can be used to do the initial boot of the system, setting up the hardware, including the ACPI tables, then jump into U-Boot to boot the operating system. That process is now "a little more polished".

Tracing can be used in U-Boot to track down boot-time bottlenecks. The tracer will record function entry and exit into a memory buffer that can be exported to another system for analysis. Tracing has been part of U-Boot for some time, but it was recently updated to use the trace-cmd interface.

He had mentioned that U-Boot is single-threaded, which makes it easier to work with, but it can be annoying too. If you do a USB scan, for example, you have to wait while the scan is being done, timeouts are being reached for ports, and so on; it can take several seconds if there are a lot of ports. It would be nice to have a way to start the scan process and to do other things, which is what the cyclic subsystem could help to provide.

Whenever U-Boot is idle, it jumps into the cyclic scheduler. The cyclic subsystem has been used to reset the watchdog timer for many years, but now it is available to other users in the system. He has a number of ideas for how it could be used (including for USB scanning), but looks forward to seeing what others come up with as well.

A new, experimental feature is to add a GUI of sorts using expo menus. These menus are a set of screens, called scenes, that the user can navigate using the keyboard to set or change various parameters. They look and act much like simplified x86 BIOS menus.

Glass did a fairly fast-moving and somewhat hard-to-follow series of demos that were meant to show some of the different features he had mentioned in the talk. He started by showing the standard boot process in QEMU and then ran binman to display. the contents of the boot image that he used. He also demonstrated how binman can be used to go out and fetch a missing tool (e.g. fiptool) so that it can be used to build the firmware image. Beyond that, he pushed a branch to GitLab and briefly poked around the CI system there, created a flame graph from the tracing output, and showed the expo GUI.

After that, he took a few audience questions; one of those was about priority in the standard boot. Is there a way to specify that certain bootdevs (or types of storage, such as removable or not) should have priority over others? Glass said that each bootdev type does have a priority that can be specified, but that there is no distinction based on whether the media is removable or not, so that would need to be added. The boot order can also be specified by the order that devices appear in an environment variable.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for assisting with my travel to Prague.]

Comments (none posted)

A discussion on Linux in space

By Jake Edge
July 25, 2023

EOSS

There was something of a space theme that pervaded the Embedded Linux Conference (ELC) portion of the 2023 Embedded Open Source Summit (EOSS), which is an umbrella event for various sub-conferences related to embedded open-source development. That may partly be because one of the organizers of EOSS (and ELC), Tim Bird, described himself as "a bit of a space junkie"; he made that observation during a panel session that he led on embedded Linux in space. Bird and four panelists discussed various aspects of the use of Linux in space-related systems, including where it has been used, the characteristics and challenges of aerospace deployments, certification of Linux for aerospace use, and more.

The panelists in the room were Lenka Třísková, a lecturer and researcher at the Technical University of Liberec (which is in near Prague, Czechia, ~~as was~~ where the conference was held), who is also the main technical advisor for the Linux4Space project, and David VomLehn, who is the lead flight-software engineer at Astra. VomLehn gave a keynote about Linux in space (YouTube video) earlier in the conference. They were joined by two remote participants, both on the US west coast, thus joining the session at a pretty early hour for them: Steven VanderLeest, chief technologist for Boeing Linux, and Rob Bocchino, from the NASA Jet Propulsion Laboratory (JPL).

Projects

Bird started with what he hoped was "a softball question" about projects that the panelists had been involved with that used Linux for aerospace applications; was that work done for "production" deployments or simply research, he asked, and "were they successful?" He started with VomLehn, who quickly listed two rocket projects, two satellite projects (plus one where it sounds like only the military knows if it was successful), and a lunar lander. They all used Linux, and the first series of satellites was successful, one of the rockets was successful (he is awaiting launch of the other); the second series of satellites was not successful and the company behind the lunar lander ran out of money before launch, unfortunately. Bird asked if any of the failures were Linux-related; "absolutely not", VomLehn said, the satellites all failed due to hardware problems.

[Bird, Třísková, VomLehn, VanderLeest, and Bocchino in Zoom]

VanderLeest said that his experience is "more on the air than the space side"; ten years ago he worked for a small company doing research on certifying Linux for safety-critical systems and for flight certification in particular. Now he is at Boeing, which has used Linux for both air and space applications. As far as he is aware, Linux has only been certified to level "D", which is one of the lower software assurance levels; he is currently working on a project to develop a Yocto-based Linux system for flight certification at the higher levels, up to and eventually including "A".

Bocchino said that he had worked on the ASTERIA low-earth-orbit satellite based on a 6U CubeSat, which was deployed in 2017 from the International Space Station (ISS). It ran Linux and was "a very successful mission". Its original mission was for exoplanet detection and it successfully demonstrated high-precision pointing of its space telescope; it is the smallest satellite to ever perform those maneuvers, he said. After the main mission was completed, the team was able to upload new software to the satellite to test out other capabilities.

While he was not directly involved with it, the Mars Ingenuity helicopter is another big Linux success story at JPL. He did help develop one of the components that is used by the helicopter; the F' (or F Prime) flight-software framework is used by ASTERIA and Ingenuity. Bird said that he had been watching Ingenuity with great interest; "it's like the little helicopter that could". Bird also pointed out that F' is open-source software (it is available under the Apache 2.0 license).

Třísková said that in early 2020 she did some consulting on the CubeSat instruments for the Hera Mission, which is an asteroid-investigation mission that will be launched by the European Space Agency (ESA) in 2024. The two CubeSats run Linux, but she noticed that there is no common Linux platform for them, which is why she became involved in Linux4Space, which is an effort to build that platform.

In his keynote, VomLehn had said that "space is hard", Bird said, which is something of a "standard mantra" in the industry; he asked the panelists what kinds of requirements there are for Linux in aerospace that make it particularly difficult. VomLehn said that generally there is both a mass budget and a power budget for a project; both of those have to be met. Radiation is an enormous problem in space; "I would love it if there were standard solutions for that, both hardware and software." Bandwidth is another problem area; the lunar-lander project he worked on had only 500bps of bandwidth available to it for parts of its mission. Even in low earth orbit, bandwidth is going to be a problem until satellites can plug into things like Starlink, he said.

Beyond the physical characteristics that these systems need to withstand (such as radiation, vibration, and temperature), there is a need to provide evidence that the system itself is reliable, VanderLeest said. Once you start adding human crew and passengers into the mix, the requirements for those assurances get quite rigorous. The system must be shown to be both reliable, meaning that it does do what it is meant to do, and safe, which means that it does not do what it is not meant to. Doing that is a costly process and every added feature means that the cost rises, so there is a focus on a small footprint of only the critical features for the system and mission.

Bird suggested that part of the reason that realtime operating systems (RTOS) have inertia for aerospace applications is due to the "provability of assurance". He had attended a flight-software workshop recently where people were talking about 100% test coverage for the lines of code; "that's somewhat shocking for a project that has 25-million lines of code". VanderLeest noted that, while Linux itself has that many lines of code, any given instance of it will only use a subset. The two largest pieces are drivers and architectures; only a limited number of drivers and a single architecture will be present in a running system. That may still add up to a million lines of code, though, "and that's a lot of lines to prove".

VomLehn noted that simply hitting the lines of code in testing is not necessarily sufficient; there is also "feature coverage" where you actually execute things in the sequences they would follow when "real commands" are being executed. VanderLeest bobbed his head in agreement. Bird said that "it's a hard problem", which both of the panelists acknowledged.

Why Linux?

That led Bird to ask: "Why are we here? Why should we be using Linux?" Třísková said that for the CubeSats, they had decided to use Linux for the non-mission-critical software, in part because of the amount of other software that can be used on Linux. For example, there is lots of code for doing data and image processing; in addition, the drivers for the different protocols used in space are all available for Linux. Beyond that, there are a lot of people who know and understand Linux, which makes it easier to hire people to work on the project. And "it's a lot of fun", she said.

Bird asked Bocchino why they chose Linux for ASTERIA. Like Třísková, Bocchino said that Linux is an easier environment to work in, with a familiar shell and set of tools. It turns out that if you are willing to tolerate some risk, the shell can be used during the mission. Both the ASTERIA and Ingenuity teams have used shell access in that way.

Bird raised the idea of doing AI processing on these systems, which was another topic from the workshop; he knows that it is fairly straightforward to get that kind of code running on Linux but wondered whether an RTOS has the services needed to do so. VanderLeest said that "Linux provides a collaborative environment for innovations, whether it's AI or anything else"; it gives access to "cutting-edge technology more quickly". In addition, when vulnerabilities are identified, they "tend to be fixed much quicker" in Linux, he said.

Another important thing to consider, VomLehn said, is that "there is just so much Linux out there"; in some sense, the functional coverage comes from the amount of use that Linux gets. In terms of assurance, "process is paramount when it comes to reliability"; the kernel-development process and, especially, the amount of varied testing it gets on lots of different platforms "really drives the reliability up". There are ways to mathematically prove the correctness of programs, which is not the approach the Linux community has taken, but it has come up with a different process that produces "demonstrably excellent results".

Realtime

Bird asked about the realtime requirements for aerospace and whether the realtime patch set for Linux would satisfy them; are there still concerns about whether Linux can hit realtime deadlines? Bocchino said that ASTERIA was using an earlier version of the kernel with the realtime patches, but he suspects that what he found then is still true today: "Linux is just never going to be a hard realtime OS." So if there are hard realtime deadlines in the system, which cannot be missed, "then don't use Linux for that".

On the other hand, soft realtime is often good enough, he said. If not, the system can be partitioned into a part that is running an RTOS and a part that is running Linux, as was done on the Mars helicopter. VomLehn said that he would "quibble a little bit about how hard 'hard' is in Linux realtime"; if you look at graphs from a well-tuned system, it "has pretty well-bounded latencies". A microcontroller with an RTOS has lower latencies, which is "nice if you are trying to prevent something like an explosion". But a lot of the other parts of the system, such as the guidance software, can tolerate an occasional missed deadline.

VomLehn stressed the importance of tuning the system properly; he said that using realtime Linux for things like controlling rockets is possible. VanderLeest said that a system is only realtime to him if he can prove deterministically that the worst-case latency is still within the requirements of the system. For many of the use cases he is looking at, those requirements are measured in milliseconds "and Linux can perform at that level".

Bocchino said that his experience had been that the worst-case latencies for realtime Linux "can often be pretty bad", though maybe that could be fixed with tuning. If you have an application that can "tolerate a missed deadline every now and then, then it works fine"; if you cannot miss any deadlines, though, his experience is that it would be hard to use Linux. Třísková agreed with that assessment; mission-critical pieces that are, say, preventing explosions are better done in an RTOS. There are plenty of other pieces that can benefit from the power and availability of software that Linux brings.

Cost

Bird wondered whether Linux provided an advantage in terms of development cost; his theory was that if there was less effort needed to work on the OS, that would allow for more time on the science and other aspects of the mission. VomLehn definitely agreed that Linux was an advantage due to "the sheer massive number of features" that come with Linux and its overall ecosystem. He is convinced that it is less costly to develop these systems using Linux.

As an example of the range of the ecosystem, VomLehn noted that there is a satellite out there running Python, which surprises a lot of people; its developers wanted to do as much processing on the satellite as possible because of severe bandwidth limitations for transmitting results. He has hired "a lot of sharp people" who are not from the aerospace industry, but who know Linux. In combination with other aerospace veterans, it works out quite well because of the capability of the Linux platform. Bocchino added that having developers who already know Linux makes it more cost-effective than trying to retrain them for another environment.

Bird then asked about certification, noting that he had seen requirements for a design-driven process that is quite different from what Linux uses; "does that mean it's impossible to certify Linux for certain space applications?" VanderLeest said that he definitely did not believe it was impossible; there are already examples of Linux being certified at fairly low assurance levels. Going further is "a very hard problem", but he and his team do not think it is impossible.

The certifications are based around a process where requirements are described and a design is created before code is written, which is decidedly different than the approach Linux takes. But there are guidelines for reverse-engineering the requirements and design; "there is a design to Linux, it's just emergent". There are also implied requirements for Linux. VanderLeest and his team think that can be successful with that approach "and we're driving towards that".

VomLehn pointed out that there is a large company that is launching people into space using a Linux-based system. Since that company, SpaceX, is a competitor of his, he did not want to say the name, though Bird filled it in with a chuckle. The SpaceX system is two-tiered, with a low-level microcontroller and RTOS tier and a Linux executive above that. VomLehn's company and others are using Linux as well, and SpaceX has gotten a crew rating for its systems, which is the highest level of assurance. It took years of work with NASA to get there, but it does show that it can be done.

The use of commercial off-the-shelf (COTS) products in space-based systems was next up. Bird noted that an earlier CubeSat talk (YouTube video) mentioned the use of COTS parts that were subjected to a higher level of testing; he was perhaps a bit surprised at that approach given that space missions cost so much already. But VomLehn pointed out that an FPGA that costs a few dollars may cost $30,000 for space-rated version, so the cost difference can be considerable.

VanderLeest said that, 25 years ago or so, there was a shift in the thinking about processors; instead of designing special-purpose space processors that had redundancy and fault-tolerance built-in, the industry switched to standard processor designs and added the fault-tolerance at the system level. The purpose-built processors were hard to keep up-to-date with the latest innovations elsewhere, which he likened to the shift to Linux being seen today. By switching to a more commodity OS, the aerospace applications get the updates and innovations that continuously come from the community.

Bird asked for questions from the audience; the first (and only, as it turned out) was not that, exactly, but more of a lengthy comment from realtime Linux developer John Ogness. He wanted to clarify a misconception that he had already heard a few times at the conference and in this panel session: "The realtime Linux team is committed to hard realtime". They are not at all aiming for soft realtime, "so a single missed deadline is a completely failed system". If there are systems where the realtime patch set is not providing that, it is likely either a misconfiguration or a bug in the user-space code, Ogness said. In any case, the realtime team is looking for feedback on any problems that people are encountering, so he encouraged the space-Linux community to get in touch if the realtime patches were not meeting its goals. That was met with approval (thumbs up) from several panel members and a round of audience applause.

As time ran out on the session, Bird asked his final question; he wondered what impact the open-source licenses had on the use of open source in aerospace. In particular, were there things that the industry did in order to work with the GPL as the license of Linux? VomLehn said that the industry generally does not sell rockets, but sells "launch services" instead; that means that there is no real distribution of the code. He said that Astra employees are encouraged to work with upstream communities and there are parts of his kernel work that he "would be delighted to be able to give back"; he hopes that can happen eventually. There are various legal restrictions on what the industry can say and do, but he thinks there is still room for those in the industry to be good open-source community members.

A YouTube video of the panel session is available.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for assisting with my travel to Prague.]

Comments (10 posted)

Much ado about SBAT

By Jonathan Corbet
July 20, 2023

Sometimes, the shortest patches lead to the longest threads; for a case in point, see this three-line change posted by Emanuele Giuseppe Esposito. The purpose of this change is to improve the security of locked-down systems by adding a "revocation number" to the kernel image. But, as the discussion revealed, both the cost and the value of this feature are seen differently across the kernel-development community.

The patch in question adds three lines to the kernel's x86 makefile, the end result of which is to add a new ELF section to the kernel executable, called ".sbat", containing these two lines of text:

    sbat,1,SBAT Version,sbat,1,https://github.com/rhboot/shim/blob/main/SBAT.md
    linux,1,The Linux Developers,linux,$(KERNELVERSION),https://linux.org"

Where $(KERNELVERSION) is replaced with the current kernel version number. The first line describes the section as conforming to version 1 of the UEFI Secure Boot Advanced Targeting (SBAT) specification. The second line is the one that matters; it identifies the executable component as "linux", with version number 1 (also called the "generation number"); the rest of the line is just documentation.

The purpose of the SBAT entry is to give the bootloader a way to know whether the given executable is safe to run. Since we are dealing with secure boot, that executable should already be signed by a key known to the system, but it is possible that a security vulnerability has been found in that binary since it was signed. To protect against that possibility, the bootloader will contain a minimum acceptable generation number; if the generation found in the SBAT section is below that number, the bootloader will refuse to boot it. Whenever a vulnerability that will lead to a secure-boot compromise is fixed, the minimum generation will be incremented, with the effect of immediately blocking the loading of any vulnerable versions of the code.

The SBAT idea is not new; it has been supported by the GRUB bootloader since the 2.06 release. The developers of the unified kernel image concept are implementing ways to load a kernel directly, without using a separate bootloader; as a part of this work, they want to add SBAT checks for the kernel image itself. The addition of SBAT version number to the build is a necessary part of this scheme.

The secure boot mechanism already contains ways to prevent the loading of old, vulnerable images through explicit revocation. There is just one little problem: given the diversity of the software used in the Linux world — and the large number of releases of each — the revocation lists quickly grow larger than the low-level firmware implementing secure boot can handle. SBAT is an attempt to overcome that problem by creating a single number that can be used to determine whether a given image is safe to boot.

Generation-number management

Much of the above was not really explained in the patch posting, which led to a number of questions on the list. Other signs of inattention (the inclusion of a linux.org URL, for example) also raised red flags. It is fair to say that this idea did not get the reception that its backers were hoping for.

At its core, the disagreement was over the management of that version number: how often would it change, and who would be responsible for deciding when it should be changed? Greg Kroah-Hartman suggested that the kernel's version number, which is already included in the image, could be used instead, but Luca Boccassi replied that the version number would not work, since there are too many of them for the system to track. In other words, kernel-version numbers present the same problem that revocation lists do — the problem that SBAT was developed to work around.

Kroah-Hartman later asked a number of questions about how the decision to update the generation number would be made, who would make it, who would change the number in the kernel, and so on. He suggested looking at past kernel history to come up with an idea of how many times that number would have changed over time. Boccassi's answers were not seen as entirely satisfactory; he said that the decision would be made by: "most likely those who understand the problem space". He suggested that the kernel project makes "3 releases a year", and that generation-number changes would be no more frequent than that. Kroah-Hartman replied that he does "a release or two a week across multiple stable kernel versions", rather more than three per year, and repeated his process questions.

Boccassi called the stable releases "irrelevant for the case at hand" and said that the process questions didn't matter: "the question here was about mechanism and storage. And it already works btw, it's just the kernel that's lagging behind, as usual". Kroah-Hartman was not moved:

To think that "let's add a security canary to the kernel image" is anything other than a process issue shows a lack of understanding about exactly how the kernel is released, how the existing kernel security response team works, and who does any of this work. To ignore that means that there is no way in the world this can ever be accepted.

Daniel P. Berrangé did make an attempt to address the process-related questions, saying that generation-number increments would be tied to CVE numbers and would likely be infrequent. He also acknowledged that there are some open questions with regard to how backports to the stable releases would be handled. Kroah-Hartman responded that most security-relevant kernel bugs are never assigned CVE numbers, and that he knows of many example of bugs that could be used to break secure boot. He also admonished: "as the person running the stable releases, you BETTER be working with me to try to figure this all out".

One might not think that the management of a simple number would be so hard. But the question of when it should be incremented is not trivial. As Kroah-Hartman and others pointed out, the kernel project is fixing bugs that may have security implications almost every day. Nobody knows how often exactly, because monitoring the patch stream for possible security issues is a task that nobody has the resources to keep up with for any extended period of time. It is probably safe to say, though, that almost every mainline kernel release has at least one fix for a bug that could be used to attack secure boot. So perhaps the generation number would simply need to increment for each release.

There is a worse problem, though, in that almost nobody runs mainline releases; instead, most users are running kernels derived from the stable updates. It is far from clear how the mainline generation-number updates should be backported to the stable releases, which happen much more frequently than mainline releases. Each stable release may have a subset of the fixes that were identified as needing generation-number increments in the mainline; how should the generation number be calculated in such cases? If a given fix is not applicable to a specific kernel release, should that number be incremented anyway — thus causing older binaries to fail to boot, even though they lack the vulnerability in question?

Letting distributors do it

For these reasons and more, it was occasionally suggested that, if such a generation number is to be a part of a kernel build, it should be created and managed by the distributors who are building the kernels. As Ard Biesheuvel put it:

Therefore, I don't think it makes sense for the upstream kernel source to carry a revocation index. It is ultimately up to the owner of the signing key to decide which value gets signed along with the image, and this is fundamentally part of the configure/build/release workflow. No distro builds and signs the upstream sources unmodified, so each signed release is a fork anyway, making an upstream revocation index almost meaningless.

Boccassi's reply, after describing the linux-kernel list as "an open sewer", dismissed this idea as unworkable:

The 'owner of the signing key' is not good enough, because there are many of those - as you know, the kernel is signed by each distro. But the key here is that the revocation is _global_ (again: global means it applies to everyone using shim signed by 3rd party CA), so each distro storing their own id defeats the purpose of that.

If this global number is not stored in the mainline kernel source, he said, somebody would have to maintain an external registry to somehow map generation numbers to points in the kernel's development history.

Paolo Bonzini, though, thought that even a distributor-managed generation number is unworkable: "I'm quite positive that a revocation index attached to the kernel image cannot really work as a concept, not even if it is managed by the distro". That led to another missive from Boccassi stating that the mechanism has been shown to work elsewhere, and that "the kernel is not special in any way":

The only thing that matters is if, given a bug, somebody either observed it being used as a secure boot bypass by bad actors in the wild, or was bothered enough to write down a self-contained, real and fully working proof of concept for secure boot bypass. If yes, then somebody will send the one-liner to bump the generation id, and a new sbat policy will be deployed. If no, then most likely nobody will care, and that's fine, and I expect that's what will happen most of the time.

It is not clear that this approach will satisfy the developers who see the whole mechanism as a sort of security theater.

As of this writing, the discussion appears to be at an impasse, with little mutual understanding between the participants. The proponents of the SBAT mechanism see a way of addressing their revocation problems that only needs an occasional one-line kernel patch to maintain. Longtime kernel developers, though, see a raft of unresolved process issues and strongly doubt that a single integer value can describe the security status of the huge variety of kernels in the wild. The kernel is more complicated that that, as is the security environment it operates in; any sort of global revocation mechanism may have to be as well.

Comments (114 posted)

Exceptions in BPF

By Jonathan Corbet
July 21, 2023

The BPF virtual machine in the kernel has been steadily gaining new features for years, many of which add capabilities that C programmers do not ordinarily have. So, from one point of view, it was only a matter of time before BPF gained support for exceptions. As it turns out, though, this "exceptions" feature is aimed at a specific use case, and its use in most programs will be truly exceptional.

Kumar Kartikeya Dwivedi posted the BPF exceptions patch set on July 13. The API presented to BPF programs is simple, taking the form of two kfuncs. To raise an exception, a BPF program can call:

    void bpf_throw(u64 cookie);

A call to bpf_throw() will cause the program's entire call stack to be unwound, and the program will return to its caller with (by default) a return status of zero; the cookie value is ignored. There is no way for a program to catch an exception called further down the call stack. It is, however, possible to define a function to be called after the call stack has been unwound, but before control is returned to the caller:

    void bpf_set_exception_callback(int (*callback)(u64));

The given callback() will be called once unwinding is complete, and will be passed the cookie value given to bpf_throw(); its return value will then be returned to the original caller of the BPF program. There can be only one bpf_set_exception_callback() call in a program; once the callback is set, it cannot be changed.

One might thus be forgiven for thinking that this exception mechanism does not look like it does in other languages supporting the feature, and that bpf_throw() might better be spelled exit(). It clearly is not meant to allow BPF programs to catch and respond to unusual situations. The use case for exceptions, as it turns out, is different and unique to BPF.

BPF programs must, famously, convince the kernel's verifier that they are safe to run before they can be successfully loaded. Doing so requires handling every possible case — even cases that the programmer knows can never happen, but which the verifier is less certain about. So, for example, a BPF function far down the call stack might have to check that an integer value is within a given range, even though the developer knows that it must be, because the verifier does not know that. The check must do something reasonable in response to an out-of-bounds value and, perhaps, return a failure status all the way back up the call chain, all for a case that can never happen.

And, as we all know, developers are never wrong about cases that can never happen.

As Dwivedi described, exceptions are intended to address this problem:

The primary requirement was for implementing assertions within a program, which when untrue still ensure that the program terminates safely. Typically this would require the user to handle the other case, freeing any resources, and returning from a possibly deep callchain back to the kernel. Testing a condition can be used to update the verifier's knowledge about a particular register.

So, in other words, the real reason for exceptions is to provide a mechanism by which the verifier can be informed of invariants that the developer knows about while having an emergency exit mechanism for those times when the developer is wrong. There is a set of assertion macros provided to make this feature easily available in BPF programs. So, for example, a developer will be able to write:

    bpf_assert_lt(foo, 256);

This assertion will perform the indicated test and, should it fail, make a call to bpf_throw(). Meanwhile, the verifier will be able to use the knowledge that foo is, indeed, less than 256 as it evaluates the subsequent code.

There is one notable problem still, as described in the changelog to this patch in the series:

For now, bpf_throw invocation fails when lingering resources or locks exist in that path of the program. In a future followup, bpf_throw will be extended to perform frame-by-frame unwinding to release lingering resources for each stack frame, removing this limitation.

Given that the verifier is now counting on bpf_throw() to prevent execution from proceeding past a failed assertion, this seems like a significant limitation indeed. It could probably be used by a sufficiently malicious developer to convince the verifier to accept a program that does something unpleasant. That suggests that implementing the frame-by-frame unwinding will be a prerequisite to getting this work merged.

Both BPF and Rust are intended to make kernel programming safer, but they take a different approach to the problem. A Rust program will, by default, panic if any of a large number of things goes wrong. BPF programs, instead, are intended to be verified as simply lacking that sort of wrong behavior before they are ever allowed to execute. BPF exceptions can be seen as an admission that the "prove correctness before loading" approach has its limits, and that sometimes it is necessary to just throw up your hands and bail out.

Comments (20 posted)

Randomness for kmalloc()

By Jonathan Corbet
July 24, 2023

The kernel's address-space layout randomization is intended to make life harder for attackers by changing the placement of kernel text and data at each boot. With this randomization, an attacker cannot know ahead of time where a vulnerable target will be found on any given system. There are techniques, though, that can be effective without knowing precisely where a given object is stored. As a way of hardening systems against such attacks, the kernel will be gaining yet another form of randomization.

"Heap spraying" attempts to fill the target system's heap with known data; it generally works by allocating large amounts of memory and filling it with the data of interest. A successful attack can fill much of the heap with a known pattern. If the target system can then be convinced to dereference an invalid pointer into the heap, chances are good that the access will land on attacker-controlled data.

As an example, consider a use-after-free vulnerability. Once an object has been freed, it can be allocated by the attacker and overwritten. Then it is just a matter of waiting for the targeted code to use its (now invalid) pointer, and the game is over; a vulnerability that simply let an attacker allocate and write some memory has been escalated into something rather more severe. Needless to say, attackers find that kind of capability attractive; heap spraying has been used in a number of successful exploits.

The heap, in the kernel, is a bit of a nebulous concept; one could think of it as comprising almost all of the physical memory in the system. But much of the memory management within the kernel is handled by the slab allocator. In theory, there are hundreds of independent slabs in the Linux kernel; /proc/slabinfo shows nearly 300 on your editor's system. This number reflects a large number of in-kernel users of slabs — subsystems that each need to allocate objects of different sizes. That separation should help to thwart heap-spraying attacks, since the ability to spray one slab should leave the others unaffected. An attacker cannot spray a heap that they cannot allocate from.

In practice, though, there is not as much isolation between users as it might seem. Calls to kmalloc(), which are how much of the memory in the kernel is allocated, all share a set of common slabs. There are tens of thousands of calls to kmalloc() (and variants like kzalloc()) in the kernel, so memory obtained that way is a fairly obvious target for spraying attacks. The slab allocator will also often merge slabs containing objects of similar sizes, again putting multiple users into the same "heaps". In summary, the Linux kernel, too, can be vulnerable to heap-spraying attacks.

As Ruigi Gong notes in this patch, separating those slab users to thwart spraying attacks is not a practical alternative. All of those kmalloc() users are not going to change, and turning off slab merging entirely would have a performance cost. Instead, a lot of protection could, in theory, be had if all of those kmalloc() users were somehow separated from each other, at least to a degree.

The kmalloc() allocator uses a set of a dozen or so slabs for objects of different sizes; on your editor's system, the smallest is for eight-byte objects, while the largest is for 8KB objects, with the size (usually) doubling from one slab to the next. When kmalloc() is called, the requested size is rounded up to the nearest slab size, and the allocation is made from that slab. So, for example, a 36-byte request will result in the allocation of 64 bytes of memory. Just to make things more complicated, there are actually four sets of slabs as described here: one for "normal" allocations, one for memory to be used for DMA, one for memory marked as reclaimable (allocated with __GFP_RECLAIMABLE), and one for non-reclaimable memory charged to control groups.

Gong's patch set adds another dimension to this matrix by adding another 15 slabs for each size — but only for "normal" allocations. Whenever a kmalloc() call falls into the normal category (which is most of the time), one of the 16 slabs for the appropriate size will be chosen at random, and the allocation will be made from that slab. In this way, it is hoped, any memory that can be sprayed by an attacker will be separated from the memory used by the vulnerable code that is under attack.

To raise the chances that things turn out that way, some thought has gone into the selection of the random slab for the allocations. There are two values that are used in this calculation: a random seed generated at boot time and the address from which kmalloc() was called. As a result, any given kmalloc() call site will always allocate from the same slab, but the specific slab will vary from one boot to the next. So, for an attacker, it is not just a matter of performing more allocations to spray all of the relevant slabs; instead, a call that hits the correct slab for the current boot cycle must be found. It is not an absolute defense, but splitting the slabs in this way raises the bar for a successful attack considerably.

Benchmarks included with the patch show a small performance overhead and a bit of increased memory use when this feature is enabled; the cost seems low enough that it would not be noticed by most users. In response to a previous post, Kees Cook had remarked that it provided "a nice balance" between the various options that are available for hardening against heap-spraying attacks. The fifth revision, posted on July 14, was quickly applied by slab maintainer Vlastimil Babka; this work is now in linux-next and appears set to enter the mainline during the 6.6 merge window.

Comments (4 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>