LWN.net Weekly Edition for October 19, 2023
Welcome to the LWN.net Weekly Edition for October 19, 2023
This edition contains the following feature content:
- Improving C-library scalability with restartable sequences: the rseq() system call is perhaps one of the kernel's more obscure features, but it may be able to improve the scalability of the GNU C Library in a number of ways.
- Recent improvements in GCC diagnostics: better diagrams and more from the GCC analyzer.
- Finer-grained BPF tokens: another try at making it possible to hand out limited BPF capabilities.
- Defining open hardware: what "open hardware" really means.
- The 2023 Image-Based Linux Summit: a detailed report on this effort to rethink how distributions are built.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Improving C-library scalability with restartable sequences
The Linux kernel has supported restartable sequences (sometimes referred to as "RSEQ") since 2018, but it remains a bit of a niche feature, mostly useful to performance-oriented developers who do not mind writing assembly code. According to Mathieu Desnoyers, the developer behind the kernel's implementation of restartable sequences, this feature can be applicable to a much wider range of performance-sensitive code with proper library support. He came to the 2023 GNU Tools Cauldron to present the case for use of restartable sequences within the GNU C Library (glibc).There are, he began, a number of approaches that are used to improve the scalability of user-space code; most of them revolve around partitioning the workload in one way or another. Use of thread-local storage to minimize contention for shared data is one example. Applications can also use read-copy-update, hazard pointers, or reference counting (which works best in the absence of frequent changes, he said). Another approach is per-CPU data structures; they are heavily used in the kernel, he said, but can be made to work in user space as well. The kernel can rely on techniques like disabling preemption to guarantee exclusive access to a per-CPU data structure, but user space has no such luxury. That is where restartable sequences can help.
The best approach for any given situation will depend heavily on the
workload and its data-access patterns. The choice should be based on
metrics — benchmarks, profiles, tracing, and the like — that clearly show
where the scalability bottlenecks lie and how a given technique improves
the situation. He stressed that the ideas he was presenting did not have
such a firm foundation; they were mostly based on code review, and would
need to prove their value before any work in that direction goes far.
Restartable sequences, he said, were added to the kernel in the 4.18 release. The feature has been used within glibc to implement sched_getcpu(), since it makes the current CPU number available to every thread. Code using restartable sequences starts by sharing a structure indicating an address range delineating a critical section, along with an abort address, with the kernel. In normal execution, the code in the critical section will run and commit its work with a single atomic store instruction at the end. If the thread is preempted while executing the critical section, though, it will have lost its exclusive access to the per-CPU data structure it was working with and cannot safely continue; in that case, it will be made to jump to the abort address, from which it can restart the operation.
The 6.3 kernel added a NUMA ID field to the data shared between threads and the kernel, allowing glibc's implementation of getcpu() to be optimized. This release also added the concept of per-memory-map concurrency using virtual CPU numbers, allowing more efficient use of per-CPU data structures. A new, extensible API was added as well, but support for it has not yet landed in glibc; that will need to happen soon, he said, since there is little room left for expansion in the older API.
The librseq library is being developed to make it easier for developers to take advantage of restartable sequences. It implements a number of common data structures, including per-CPU counters, linked lists, and spinlocks, along with a number of low-level atomic operations. There is support for several architectures. This library is still in an early stage of development; there has been no proper release yet.
The implementation of per-CPU counters follows the usual pattern: a separate counter is maintained on each CPU and can be incremented without contention. The total value of the counter is obtained by adding up each of the per-CPU values. This algorithm is good, he said, in situations where counters are frequently updated but seldom read. There is, however, the problem of how to access "remote data" (counters on CPUs other than the current one) safely to get a precise sum or make changes to the counter array as a whole.
The answer turns out to be an extension to the expedited membarrier() system call. The MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ option was added to the 5.10 kernel; it executes a memory barrier but also aborts any currently running RSEQ critical sections; Desnoyers called this operation an "RSEQ fence". For a data structure like a set of per-CPU counters, a manager thread can replace the entire structure, then issue the RSEQ fence. After that, no threads will be working with the older structure, and the manager will have exclusive access to it.
There is an implementation of a per-CPU spinlock that is intended to be fast when accessed only by the local CPU; it can be taken remotely in a slower path. Each CPU's spinlock contains a bit describing how the lock is to be acquired; most of the time the bit is clear, indicating that no thread is trying to gain remote access. In this case, a thread can obtain the lock by reading its value, checking that it is free, and claiming it by setting its value to locked — all within an RSEQ critical section, of course. When a remote thread needs to take the lock, it begins by setting the remote-access bit in the lock, followed by an RSEQ fence. When that bit is set, an atomic compare-and-swap is required to obtain the lock; at that point, the local and remote thread can contend for it in the traditional (slower) way.
Memory allocation was one of the original use cases for restartable sequences. One approach is to use per-CPU free lists, indexed with the virtual CPU number. Additions to and removals from the list can be done with fast push and pop operations within an RSEQ critical section. If a slow path is needed for some operations, an RSEQ fence or per-CPU spinlock can be used to gain the needed access.
There are a number of locks within glibc, he said, that could be optimized with either RSEQ or RCU; these include the dynamic loader lock, the dynamic loader stack cache lock, the default pthread attribute lock, gettext locks, and the timezone lock. The last one is acquired frequently, since it is needed by functions like localtime(), but the time zone changes infrequently (if at all). An RSEQ-based update mechanism could help here, he said.
POSIX condition variables are a harder problem; they use a mutex internally to serialize operations and can become a contention point when waits and wakeups are frequent. That is inherent in the design of POSIX threads, though, and difficult to change. As an alternative, he has created an API called urcu_wait(), implemented in liburcu. It implements a stack of waiting threads that can be accessed quickly, and falls back to futexes when the need arises.
At this point, time was running out. Desnoyers quickly mentioned the adaptive spinlock implementation that he is working on with André Almeida. With a small extension to the RSEQ API, user-space code is able to tell if another thread is currently running and decide whether it should actively wait for a lock to become free.
Overall, there was interest in the ideas presented there; glibc already has some support for restartable sequences, so there should be no real impediment to using the feature more extensively internally. It is all, as they say, just a matter of writing the code and verifying that it truly improves the situation.
Recent improvements in GCC diagnostics
The primary job of a compiler is to translate source code into a binary form that can be run by a computer. Increasingly, though, developers want more from their tools, compilers included. Since the compiler must understand the code it is being asked to translate, it is in a good position to provide information about how that code will execute — and where things might go wrong. At the 2023 GNU Tools Cauldron, David Malcolm talked about recent work to improve the diagnostic output from the GCC compiler.Much of the talk was dedicated to improvements in the ASCII-art output created by the compiler's static analyzer. In the existing GCC 13 release, the compiler is able to quote source code, underline and label source ranges, and provide hints for improving the code. All of this output is created by a module called pretty-print.cc, which has a lot of nice capabilities but which is proving increasingly hard to extend. It does not create two-dimensional layouts well, is not good with non-ASCII text, and its colorization support falls short.
This module tries to explain potential code problems found by the analyzer using text, and "sort-of succeeds", he said. But it is lacking spatial information that would be helpful for developers. If the compiler is complaining about a potential out-of-bounds access, which direction is this access going? Is it before or after the valid area, or perhaps overlapping with it? To illustrate this point, Malcolm showed this example (taken from his slides):
This output describes a potential buffer overflow and provides useful information, but it still may not be enough for the developer to visualize what is really going on. So GCC 14 adds a diagram:
More complex situations can be illustrated as well; see the slides for other examples. There will also be better diagrams for string operations that show, when possible, the actual string literal involved and which handle UTF-8 strings.
All of these pictures are the result of a new text-art module that can do everything provided by pretty-print.cc and quite a bit more. It handles two-dimensional layouts and the full Unicode character set. It has support for color and other text attributes, including "blink" — though he requested that the audience not actually use that feature. It is "round-trippable", meaning that its output can be parsed back into a two-dimensional buffer; this feature will be useful for future diagrams, he said. As a demonstration of what text-art can do, he put up the output from the "most useless GCC plugin ever" — a chessboard.
There is, naturally, still work to be done. One project is a new
#pragma operation to have GCC draw the in-memory layout of a
structure so that developers can see how the individual fields will be
packed. Another is to provide output in the SVG format, though he confided
that he is not sure about how useful that capability will be. "Crude
prototypes" of both features exist, he said.
Moving on to the GCC static analyzer, Malcolm talked about some new features for analyzing C string operations. He implemented a new warning for operations that might be passed an unterminated string, but then took it back out and created a more flexible module that is able to scan for an expected null byte. It can, for example, check format strings for proper null termination, and is able to detect uninitialized bytes in strings as well.
He has added an understanding of the semantics of a number of standard string functions — strcat(), strcpy(), strlen(), and the like. The analyzer is now able to detect operations that will overrun a string buffer, though it only works with fixed-size strings at the moment. More advanced analysis is in the works for the future. There is also a check for overlapping strings passed to strcat(); he said that he wanted to use the restrict keyword to indicate where such checks make sense, but "nobody really understands what restrict does". So, for now, the checker just looks for overlaps in situations where that is not allowed.
Future plans, he said, include implementing a new function attribute to indicate the need for a null-terminated string as a parameter. The visualizations for the diagnostics produced by the analyzer can always use improvement. He would also like to add an understanding of the semantics of more standard-library functions so that their usage can be checked.
The analyzer currently only works with C code; adding the ability to handle C++ is a desired feature. Basic support for C++ does exist now, but it is unsupported, "don't use it". The biggest problem, he said, is that it has no concept of exceptions and is badly confused by code using them, but there are a number of other problems as well. There has been a Google Summer of Code student (Benjamin Priour) working on C++, focusing only on the no-exceptions case for now. The goal is to be able to use the analyzer on GCC itself (GCC has moved to C++, but does not use exceptions). A test suite has been added, and much of the analyzer code has been made able to work with either language. The handling of the C++ new keyword has been improved. There is still a lot to be done, though.
Another project Priour has worked on, also with regard to C++, is improving output when an error is detected deeply within a nested set of system header files. In such cases, a simple mistake can generate pages of output. A new compiler option, the concisely named -fno-analyzer-show-events-in-system-headers option makes all that output go away.
Despite these improvements, Malcolm said, an attempt to use the analyzer with non-trivial C++ code "will still emit nonsense".
Within the analyzer code itself, a new integration-testing suite has been established. Every analyzer patch is tested by building a whole set of projects, including coreutils, Doom, Git, the kernel, QEMU, and several others. The warnings emitted are captured and compared against a baseline to look for regressions (or improvements). The analyzer is now able to use the alloc_size function attribute to check accesses to objects returned by functions. Another feature that might make it into the GCC 14 release is a warning for potential infinite loops. This check is not ready yet; it generates false positives and runs in O(n2) time, neither of which is ideal.
Malcolm concluded with a longer-term goal: improving the handling of errors related to C++ templates. A simple typo in the wrong place can end up generating pages of useless error information. There are various groups trying to figure out what information is actually useful in such situations. The real problem, he said, is that the compiler is still stuck in the 1970s and the batch-mode interaction style that was established then. For more complex errors there really needs to be a more interactive way for developers to explore the situation.
[Thanks to the Linux Foundation, LWN's travel sponsor for supporting my travel to this event.]
Finer-grained BPF tokens
Programs running in the BPF machine can, depending on how they are attached, perform a number of privileged operations; the ability to load and run those programs, thus, must be a privileged operation in its own right. Almost since the beginning of the extended-BPF era, developers have struggled to find a way to allow users to run the programs they need without giving away more privilege than is necessary. Earlier this year, the idea of a BPF token ran into some opposition from security-oriented developers. Andrii Nakryiko has since returned with an updated patch set that significantly increases the granularity of the privileges that can be conferred with a BPF token.In the early days, the ability to load most BPF programs was restricted to processes with the CAP_SYS_ADMIN capability. That capability, though, allows a user to do far more than load BPF programs; it is essentially equivalent to full root access. In the 5.8 release, the CAP_BPF capability was added to regulate access to most BPF operations; other capabilities may be required as well for some specific actions. CAP_BPF still allows a process to do a lot of things, though, probably more than an administrator would like.
As a result, there is a longstanding interest in finding ways to further confine what processes can do with BPF. Various approaches, including adding authoritative hooks to the Linux security module mechanism and a special BPF device have been tried and rejected. The BPF token, a sort of digital cookie conferring the right to load BPF programs, seemed like it could be headed toward a similar end. Nakryiko and other BPF developers remain convinced that the security needs for BPF are unique, though, and that a unique solution for those needs is required; they have not yet given up on the idea of a token as that solution.
Much of the time, the answer to the limited-privilege question is to run the process needing privilege within some sort of container. User namespaces can often be used for this purpose, perhaps combined with a properly constructed mount namespace. Many of the things that BPF programs can do, such as attaching to tracepoints, are inherently global in nature, though, and cannot be contained in this way. For this reason, simply giving a process CAP_BPF within a user namespace is not a solution to the problem; the kernel ignores CAP_BPF at the namespace level.
BPF tokens are a way to give a process within a container the equivalent of the CAP_BPF capability. One of the concerns expressed with the original proposal, though, was that a token might escape from the container it was intended for, causing privilege to leak into the rest of the system. In the current proposal, a BPF token is tied to both a specific instance of the BPF filesystem (which holds persistent BPF objects like maps) and a user namespace. Any given token should, as a result, be useless outside of the context it was intended for.
The first step in enabling this functionality is to augment the BPF filesystem with a new set of mount options controlling a specific instance's interaction with BPF tokens:
- delegate_cmds lists the commands that a BPF token associated with this mount can allow. Thus, for example, a BPF filesystem could be mounted to support tokens allowing reading elements from maps but nothing else, while another could allow map creation, loading programs, or the creation of tokens.
- delegate_maps controls the types of maps that a token can enable a process to create. This option only makes sense if BPF_MAP_CREATE is included in delegate_cmds.
- delegate_progs specifies which types of programs a token can enable a process to load; BPF_PROG_LOAD must also be in delegate_cmds for any type of program loading to be allowed.
- delegate_attachs (not attaches, alas) controls the attach types that a token can allow — once again, if the loading of programs is allowed at all. See this page for a list of program and attach types.
All of these values are bitmaps corresponding to the definitions in <uapi/linux/bpf.h>. Thus, for example, mounting a BPF filesystem with:
delegate_cmds=0x21
would enable BPF_PROG_LOAD (0x20) and BPF_MAP_CREATE (0x01). Nakryiko acknowledges that this is not the friendliest of interfaces, especially since the values are defined in enums and the user must carefully count to find the relevant bit numbers; the ability to use symbolic names will probably appear at some point in the future. Meanwhile, the special value "any" is equivalent to setting all of the bits for a given option.
Once a suitable BPF filesystem has been mounted, presumably within a container, a program with the right privileges can use the BPF_TOKEN_CREATE command to create a new BPF token. An open file descriptor indicating the BPF filesystem mount to use must be passed as a parameter; the resulting token will be forever associated with that BPF filesystem, which defines the operations that the token can allow. It also is attached to the user namespace associated with the BPF filesystem mount. That association prevents the token from being used outside the namespace, but has another important implication as well.
The return value from a BPF_TOKEN_CREATE operation is a new file descriptor representing the token; it can then be passed to the intended user via the usual mechanisms. That user can include the token with any bpf() calls that it makes by putting it into the new fields added to the the sprawling bpf_attr union for each command. For example, when creating a new map with BPF_MAP_CREATE, the token can be placed in the new map_token_fd field. It's worth noting that, consistent with the BPF subsystem's conventions, zero is not considered to be a legitimate file-descriptor number, and so cannot be used for the token descriptor.
When the kernel is considering whether to allow a specific bpf() call to succeed, it will check for the presence of a token that allows the requested operation. Possession of the token is not sufficient to allow the operation, though; the calling process must also have the required capabilities (CAP_BPF, plus perhaps others for some operations). In current kernels, these capabilities must come from the global init namespace; with the patch applied, they can granted by the user namespace containing the requesting process. As a result, a container can be given the ability to execute specific BPF operations without giving that privilege to every process within the container.
Thus, a process running within a user namespace will be able to carry out BPF operations, but only if it possesses a valid BPF token, that token allows the specific operation being requested, and the process has the requisite capabilities within its user namespace. This patch series does seem to provide the way to tightly control what can be done with BPF. For now, the abilities set in a BPF filesystem mount apply to all tokens created with that filesystem. In the future, there will likely be an additional operation that allows the holder of a token to remove specific abilities, further restricting what the token allows.
Thus far, there have been few comments on this version of the patch set, and none that question the core concept; most of the discussion has focused on the details of integrating Linux security module support. So it would seem that most of the objections to previous versions have been addressed. Should that situation hold, then the path into the mainline for this work seems reasonably clear. Token-based security mechanisms have not had a place in Linux until now, so some new ground is being broken here. That could be said to be unsurprising; BPF has been challenging established ways of doing things in Linux for some years at this point.
Defining open hardware
Open-source hardware (or open hardware) refers to hardware that is developed in a manner similar to open-source software. There's a widely accepted definition of open-source hardware, but it is probably not as well known as its open-source-software counterpart. In addition, there is a popular certification program that hardware makers can use to indicate which of their devices meets that criteria. But there are some vendors that are showing more enthusiasm than others in participating in the process—or in producing open hardware at all.
The leading organization advocating for open-source hardware is the Open Source Hardware Association (OSHWA). Established in 2012, it provides a definition of open-source hardware based on the Open Source Definition maintained by the Open Source Initiative. The OSHWA definition's introduction describes its main principle as:
Open Source Hardware (OSHW) is a term for tangible artifacts — machines, devices, or other physical things — whose design has been released to the public in such a way that anyone can make, modify, distribute, and use those things.
This principle is then made explicit in 12 criteria that are not unlike the ten criteria of the Open Source Definition.
OSHWA recommends eight licenses for open-source hardware in its FAQ. These are divided into two categories: copyleft licenses, which require derivative works to be released under the same license, and permissive licenses, which allow for proprietary derivatives. Recommended copyleft licenses include the non-hardware-specific GPL and Creative Commons Attribution-ShareAlike (CC-BY-SA), as well as the hardware-specific CERN Open Hardware license and TAPR Open Hardware License. Recommended permissive licenses include the non-hardware-specific FreeBSD license, MIT license, and Creative Commons Attribution (CC-BY), as well as the hardware-specific Solderpad Hardware license.
It's important to note that, as with open-source software, licenses that prohibit commercial use are not compatible with the OSHW definition. Since the creation of hardware invariably involves money, it's difficult to make use of a hardware design without some form of commercial activity.
OSHWA also has a set of best practices for
creators of open-source hardware projects. For example, these recommend
sharing the "original source files that you would use to make modifications
to the hardware's design
". While the best practices encourage using FOSS
software for designing the hardware, they acknowledge the reality of
proprietary programs and file formats in this domain and allow their use.
If we limit ourselves to printed-circuit boards (PCBs) such as microcontroller boards and single-board computers, then what needs to be shared includes mechanical drawings, electronic schematics, a bill of materials, and the design of the printed-circuit-board layout. If any of these components is lacking, the hardware can't be recreated.
One prominent example of hardware with limited information is the Raspberry Pi single-board computer. While some electronic schematics are published, they primarily show the pinouts of the connectors. These reduced schematics are useful for users or those who want to design add-on boards, but not for those who want to make their own version of a Raspberry Pi. According to the license information, the schematics are using the Attribution-NoDerivatives 4.0 International (CC BY-ND) license, which is not an open-source hardware license because it does not allow derivative works.
Self-certification program
In 2016, OSHWA set up a certification program, which relies on creators to voluntarily self-certify their projects. Doing so allows them to use a logo indicating their compliance with the OSHW definition.
OSHWA has the authority to revoke a certification, as it has done on a few occasions. The first time was in 2018 for the Motedis XYZ 3D printer because the project's documentation link was no longer working. After OSHWA was unable to obtain a copy of the documentation from the contact person in the certification application, the organization revoked the certification. This also happened for the Atmel SAM D10C Breakout Board by San Antonio Technologies for the same reason.
In 2022, OSHWA started publishing documentation for decertified hardware. This was possible because OSHWA had been archiving documentation as part of the certification process for a few years. In cases where the documentation is no longer available on the manufacturer's web site, OSHWA now publishes the documentation as part of the decertification process.
Earlier this year, OSHWA revoked the certification of the SparkFun
DataLogger IoT – 9DoF. This was done at the request of SparkFun "due to
accidental filing
". While the hardware for the project was open-source, the
firmware was not. This came as a surprise to users, because SparkFun
is considered a big proponent of open hardware. SparkFun CTO Kirk Benell
explained
in a GitHub issue: "The OSHWA logo/cert was a mistake made by our
system when we build [sic] the board -- everything ran on automatic and it
wasn't checked before the release
".
The list of certified open-source hardware projects on OSHWA's web site includes over 2,500 projects. These encompass a wide range of devices, including Arduino boards, add-on boards for Arduino and Raspberry Pi, drone flight controllers, 3D printers, smart speakers like the Mycroft Mark 1, and even electric vehicle charging stations. Adafruit, with almost 700 certified products, and SparkFun, with almost 600, have a significant presence on the list. Many of these certifications are for Arduino-compatible boards. Olimex, a smaller player, has 68 certified products, including ESP32 boards and its OLinuXino line of single-board computers running mainline Linux.
Each product page on the certification web site provides a direct link to the product's documentation. However, the hardware and software files are not directly accessible. To find the hardware schematics, you need to visit the referenced project web site and search for the appropriate files on that page. For example, for Adafruit products the hardware schematics are linked under the Technical Details header, while for SparkFun products the Documents tab shows them. Olimex shows a link to the hardware schematics under the Hardware heading.
In its blog article about the first decertification, OSHWA explained that it is trying to reduce duplication of effort, which is why it doesn't just serve the hardware's documentation on its own web site:
Developing and maintaining a feature-complete documentation hosting solution is beyond OSHWA’s core competency. Many good solutions for developing and maintaining software and documentation already exist online. Requiring certifiers to update and maintain yet another repository of documentation in order to certify was determined to be unnecessarily burdensome.
Other places to find open hardware
There are various other places where you can find open hardware that isn't necessarily following the OSHWA certification program. For example, OpenHardware.io contains more than 500 projects. For each project, a page shows the license, photos, a description, a bill of materials to order the parts, the source code for the associated software, and all the necessary design files for the hardware. The web site hosts sensor boards, relays, LED controllers, remote controls, Arduino and Raspberry Pi add-on boards, adapters and more. A lot of projects are still indicated as a "work in progress", though.
Kitspace hosts a smaller selection of interesting electronic designs. Notable offerings include a tiny Arm microcontroller board that fits into a USB port and barely protrudes, a tiny Arduino-compatible board, a WiFi air-pollution sensor board, and boards for the ESP8266 and ESP32 microcontrollers. The web site is sponsored by some PCB manufacturers, which results in direct links to these manufacturers from the product page to order the PCB.
Open hardware doesn't have to be about electronics, though. Thingiverse, a well-known web site in the world of 3D printing, provides a diverse range of open designs for 3D objects. Anyone can produce these objects with their own 3D printer using the shared STL files. Thingiverse offers a lot of tools and ornaments, as well as designs for enclosures for various microcontroller boards or SBCs.
Arduino and OSHWA
One of the most well-known open-hardware projects is Arduino, which offers microcontroller boards with an accompanying open-source development environment. The electronic schematics and design files of its boards are available under the CC-BY-SA license. This allows anyone to recreate these Arduino boards, although without using the trademarked Arduino name.
What is somewhat strange, though, for such a notable company in the open-hardware ecosystem, is that OSHWA's list of certified open-source hardware doesn't contain any official Arduino boards. Those boards seem to tick all of the boxes of the OSHW definition, but Arduino has chosen not to certify its products. That's even more surprising if you know that the list of endorsements of the OSHW definition includes Arduino founders Massimo Banzi, David Cuartielles, David Mellis, and Tom Igoe. They even helped create the definition. But when Adafruit recently asked Arduino if they would consider certifying any of its boards, the company declined.
Still, aside from a few
mistakes with missing design files and licenses, Arduino has been releasing
all of its boards as open
hardware. This changed with the introduction of the Arduino Pro hardware, as Adafruit
pointed out in a 2021 blog post. The product page of the Portenta H7 board
only lists schematics and a data sheet, omitting the design files necessary
for manufacturing the board.
For a long time, Arduino's introduction page
claimed:
"All Arduino boards are completely open-source, empowering users to
build them independently and eventually adapt them to their particular
needs.
"
When Adafruit asked about this discrepancy, Arduino's Alessandro
Ranellucci replied that, for the Arduino Pro line, the company wanted to
"prevent counterfeiters from blindly downloading a file and manufacturing
it without any R&D effort or contribution to the community
". As a
result, the company decided to publish the schematics without the design
files necessary for manufacturing the board. The original statement on the
introduction page has since been removed, and the page now says:
"The plans of the Arduino boards are published under a Creative Commons
license, so experienced circuit designers can make their own version of the
module, extending it and improving it.
"
For now, Arduino seems to uphold its promise to keep its "products for makers" as open-source hardware (although not OSHWA-certified), but does not do the same for its Pro line; boards such as the Portenta C33 and the Portenta X8 have been released without design files. It's a bit concerning, though, that the newest non-Pro boards (like the Arduino Nano ESP32 and the Arduino UNO R4 WiFi) don't even mention "open" on their product pages or in their documentation. It's unfortunate that a big player like Arduino isn't taking a clearer position on open hardware.
As the OSHWA list of certified projects, along with other directories such as OpenHardware.io, Kitspace, and Thingiverse, show, there is already a lot of open hardware. We can hope that Arduino changes heart and renews its commitment to keep its products for makers open, whether OSHWA-certified or not. Arduino plays a significant role in this space, not only with its hardware, but also with its software ecosystem. Fortunately, companies such as Adafruit, SparkFun, and Olimex make a big effort to certify their hardware. So, those who wish to build upon OSHWA-certified hardware have a lot of alternatives to Arduino boards to choose from.
The 2023 Image-Based Linux Summit
Following up from last year's first Image-Based Linux Summit), a second meeting was held in Berlin on September 12th, 2023, the day before All Systems Go! 2023, at the Microsoft office. The goal of these summits is to find common ground among stakeholders from various engineering groups around the topic of image-based Linux distributions, communicate progress, and attempt to build a strategy to tackle shared problems together. The organizers — Luca Boccassi, Lennart Poettering, and Christian Brauner — welcomed participants from the UAPI Group, which draws developers from a long list of companies with an interest in this area, and spent the full day discussing a variety of topics. Full minutes have been published on the UAPI Group’s web site.
Progress since last year
Progress achieved since the last summit was discussed first. The UAPI Group has been set up, with a GitHub organization and a new web site that is already gathering specifications relevant to image-based Linux; these include those for unified kernel images (UKIs) and discoverable disk images (DDIs). More specifications are being worked on, including a specification to formalize how to handle configuration files on a hermetic /usr system — the drop-ins, masking, and /etc/ -> /run/ -> /usr/ patterns already familiar to users of systemd and programs built on libeconf.
The systemd project has implemented a lot of changes, many of which were initially suggested at last year’s summit. Systemd-boot and systemd-stub gained several new features, including add-ons support (signed PE binaries for kernel command-line additions). UKIs can now be built with a new Python tool, ukify, that doesn't depend on objcopy and, thus, supports cross-architecture assembly of images; these can include many new metadata fields, such as signed PCR11 measurements from the TPM.
Several components of the machine will now be measured by systemd so that secrets can be tied not just to a UKI vendor, but also to specific system information such as a disk encryption key or disk UUID, and also to a specific phase of the boot process. Systemd-repart and mkosi can now build images without privileges or loop devices. They can also be used to build initrds fully based on packages, with no dracut/initramfs-tools involvement (this used to be implemented by a separate mkosi-initrd project, but is now supported by mkosi itself).
On the provisioning side, SMBIOS is now supported to provide read-only, ephemeral configuration to a virtual machine; this data can include systemd credentials, which are now supported by most systemd components, including generators. The new "confext" type DDIs are also supported for dm-verity-protected images that deliver configuration data to be overlaid upon /etc. Sealing secrets against the TPM can now be done "offline" (from a different machine), having only the target's public key. Fully encrypted TPM sessions are used, creating and pinning the storage root key (SRK) if not already present. Last but not least, a new soft-reboot mechanism was added that only reboots user space, leaving the kernel running, which is useful in image-based systems for updating from one image version to the next with minimal latency or loss of connectivity and state.
Distributors have also done their fair share of work over the last year:
- NixOS is working on a Rust version of systemd-stub and ukify to boot and build UKIs; systemd-repart and systemd-sysupdate are available. NixOS now offers systemd-networkd in its initrd.
- Flatcar now uses systemd-sysext extensively for A/B updates of OEM software and provides a set of scripts to build third-party software and deploy it with systemd-sysupdate. Flatcar implements factory-reset functionality, and /etc is now an overlay on top of a read-only base.
- Ubuntu recently made the tech news with the announcement of a desktop flavor that enables TPM-backed disk encryption by default, which is one of the main goals for all of the summit’s attendees. Ubuntu uses systemd-stub for the Snap-based kernel updates, and to pre-calculate PCRs for TPM sealing.
- GNOME OS already supported systemd-boot, and now also uses systemd-repart for partitioning on first boot. GNOME OS deploys UKIs with an initrd built using dracut.
- SUSE is steadily working toward using systemd-boot in various OS flavors, including MicroOS, and the YaST installer was enhanced to support this. Aeon, formerly MicroOS Desktop, is a new image-based flavor that uses systemd-boot, image-based updates, and full-disk encryption by default, and is exploring adding support for systemd-homed for managing users and home directories. Tumbleweed has fully embraced hermetic /usr and no longer ships files under /boot; work is in progress toward moving the default configuration files from /etc to /usr.
- The Fedora installer, Anaconda, gained native support for installing systems using systemd-boot as an alternative to GRUB.
Finally, UKI support is spreading to other projects; patches were proposed to allow loading them from GRUB, and OSBuild has gained native support for building them.
As expected, most of the focus in the past year has been on improving the situation around boot security. Linux has long been left behind by Windows and macOS in this area, and it is refreshing to see such a renewed and concerted effort to close this embarrassing gap.
hermetic /usr and sysext/confext
The discussion around sysext and confext has been gaining traction recently. These are two types of extension images, or DDIs that provide read-only additions to a root or base filesystem, extending respectively the /usr and /etc hierarchies. Currently, the sysext/confext overlay is read-only, prompting the question of whether an optional writable layer or mode should be added, though not set as the default. This mode could be either ephemeral or persistent. Additionally, there's a proposal to move the OS layer to the top of the stack; it currently resides at the bottom. Suggestions have been made to address these issues using symlinks, which is currently being worked on, and there's an idea to introduce an ordering guarantee to the sysext specification, which was implemented shortly after the summit.
Configuration management for image-based systems
There are two sides to this discussion. First of all, there is the question of how to get a configuration into a virtual machine. A common mechanism is provided by cloud-init and a faux-network connection. This is far from optimal, as requiring a full network handshake is slow, cumbersome, and fraught with vendor-specific pitfalls. An idea to improve the general experience around this flow would be to use a network namespace plus virtual routing and forwarding (VRF) to let tools like cloud-init have their own private network connection to the local hypervisor/cloud fabric without affecting the rest of the system.
A better alternative would be to use something that is not network-based at all. Systemd gained support for SMBIOS Type 11 objects, which are already supported by QEMU and Cloud Hypervisor. These objects work well for a user’s virtual machine, but they are problematic for some cloud vendors to support as the SMBIOS strings need to be fixed some time before the required configuration data is available. A proposed alternative would be a new ACPI driver and pseudo-device provided by the firmware or hypervisor that generates the data on-the-fly when requested, in a blocking mode. Systemd would provide a synchronization point in the initrd that services can hook into and synthesize systemd credentials; then systemd would reload the configuration, proceed with the transition, and exit the initrd phase.
This would essentially amount to adding a third phase to the boot process, in the initrd, when additional resources become available as a consequence of the first phase doing the required actions, before transitioning to the rootfs. The latter part would be relatively easy to implement, with the ACPI driver being the most difficult piece of work. If a cloud vendor volunteers to do this work, then it could be easily integrated.
The second sub-topic concerns how configuration files are consumed by services; SUSE has been actively working on adding support for libeconf in upstream projects for many years. While progress has been made, certain applications, like apache2 and nginx, still rely on files in /etc to function properly. Complex configuration files, often in XML format, have also posed challenges. Fedora has introduced patches to address these issues, demonstrating the ongoing efforts to achieve hermetic /usr.
The main action item is to create a tracking issue to list upstream projects that still have to be updated to support this configuration model, so that contributors can collaborate to shrink the list. While the work is not finished, the situation on this topic is in a much better place than it was a few years ago, thanks to the work of many stakeholders. Finally, a specification detailing how libeconf and systemd handle configuration files, aimed at developers implementing their own configuration loading, was recently published.
Systemd credentials
Still on the topic of configuration, the question was raised on how to update systemd credentials. At the moment, they are static; a service receives its credentials at startup time, and they cannot change until the service itself is restarted. For a lot of use cases this is enough, but for some it is not; for example, certificates might be rotated for a service that is sensitive to interruptions. Normally the recommended pattern is to use the file descriptor store to achieve fast restarts (one issue raised was the lack of documentation around this systemd feature, which was promptly fixed. just after the summit as a consequence), but in some cases the service interruption is too expensive to contemplate for such a configuration update.
Fortunately, work is already scheduled to integrate the confext feature into the "reload" mechanism, which is traditionally used to send a SIGHUP to a service, to also reload the stack of confexts in case some were updated. The same pattern could be used for credentials; on reload, credentials are opened again and updated, so that interrupt-free updates can be performed.
Another issue was raised: currently, the documentation states that the path to loaded credentials has to be derived from an environment variable, which is problematic for projects that do not support environment variables or search paths. But it turns out this was set up this way only because of user units, which depend on the user ID; for system services it is actually fixed. The documentation will be updated to clarify this, hopefully removing a (small) barrier to adoption.
Separately, it was also mentioned that there is currently no way to enumerate existing credentials; the proposal is to enhance the systemd-creds tool to do this job as well. Another future improvement that is already being worked on is asymmetric TPM-based encryption, so that credentials can be encrypted, away from the host, using only the target’s TPM’s public key. Currently only symmetric encryption is supported, which can be tedious to use as it requires key sharing.
/efi vs /boot vs /boot/efi vs /run/efi
The debate over the mount point for the EFI system partition (ESP) is ongoing. The issue is that when both a Linux extended boot (XBOOTLDR) partition and an ESP are present, it is unclear where each should be mounted by default. Generally speaking, the ESP is where you always want to store bootloaders, and XBOOTLDR is where you want to store kernels (and initrds), as they will likely require much more disk space. SUSE's RPM-based filesystem creates base directories, which can be problematic for top-level directories serving as mount points. A suggestion to use automount was discussed, questioning the necessity of manually mounting these directories.
Different tools, including bootctl, fwupd, kernel-install, and systemd-logind, require access to these locations for various purposes. The challenge is to ensure that these tools don't double-mount directories. There's a proposal to establish consistent standards and APIs for handling these paths, along with discussions about default directory locations and conflicts with other specifications. The first order of business was to reconcile the discoverable partitions specification and the bootloader specification so that they suggest the same approach; that was fixed shortly after the summit. Whether bootctl could be used to provide a unified interface to access the EFI partition will also be explored.
TPM
Work in the area of measured boot was one of the focal points of last year’s summit, and this year was no different. After a brief recap of all the work that has been done to implement sealing against signed policies, so that secrets can be entrusted to be decrypted only when booting images from the same vendor, attention focused on upcoming developments that will also allow sealing against the state of the local machine.
The upcoming systemd-pcrlock tool will allow sealing against a policy that takes into account PCRs zero to seven, which are owned by the firmware, but that policy can also be temporarily relaxed (if the system is in a known good state) when a firmware update is applied. Such updates can optionally provide a list of expected measurements (in the TCG CEL-JSON format) that will be used for the new policy. If, instead, those measurements are not provided, the next boot process will remeasure and add the new state to the policy, making it strict again. This ensures that attackers cannot simply relax the policy when booting their own systems, as the policy can be changed only when the system is in a known good state. If vendors collaborate by providing the measurements file, then security is never downgraded, not even temporarily. This feature represents a substantial step forward from the status quo, which requires re-sealing secrets on every update, thus changing a disk’s superblock, and making it essentially unfeasible to seal many objects (e.g. systemd credentials).
In order to develop this feature, an append-only event log had to be added to systemd, as measurements need to be replayed. This event log follows the TCG event log specification closely, so that it can be translated to or from that format. It was discussed whether to provide an API for it so that applications can also consume or append to the event log; the idea was deemed acceptable, if someone was willing to implement it.
There are a few corner cases that still need some work — for kexec, it is currently unclear what the best course of action would be. Current thinking is to change the policy to measure a nonce and expect it to be provided by the new kernel, so that only the next kernel can be successfully validated. Also, on factory reset, several machine-specific identifiers that are measured would be lost, so a solution is needed. Ubuntu Core stores an encrypted object on installation that allows such a reset, and systemd should be enhanced to provide the same capability.
Finally, non-disk-encryption usage of the TPM was discussed in order to gauge interest. The systemd credentials use case was already mentioned, but there are others, chiefly remote attestation. System Transparency provides OSS tools for it, and so does Keylime, which is integrated into SUSE MicroOS.
Unified kernels and pre-built initrd
The discussion about unified kernels and pre-built initrd was brief, as most of the work has already been done and embraced by participants. The main news is around the add-ons feature, which allows the platform owner’s optional enhancements to be added on top of the OS vendor’s images. This supports kernel command-line extensions for now, support for DTB is under review, and next on the to-do list is initrd add-ons. Finally, new sections will likely be documented in the UKI specification; these include support for embedded microcode, so that it can be loaded first in the on-the-fly generated initrd that is passed to the kernel, as that’s the established protocol.
systemd-sysusers and "user" users
The systemd-sysusers and "user" users discussion focuses on the addition of a switch to copy /etc/skel so that it can be used for "normal" users too, and not just "system" users. This would be a lightweight integration, focusing exclusively on support for the home directory and /etc/skel.
Homed in openSUSE Aeon
The final topic discussed was systemd-homed, and the attempted integration into SUSE Aeon. This effort suffered from a number of paper cuts, but it seems that most are solved or are being solved.
The first issue is provisioning. Since partitions have to be sized accurately according to the number of users, this needs to be known ahead of time. There are two solutions for this problem: first of all, by using Btrfs subvolumes, the problem goes away entirely as there’s no need to resize partitions, since space is allocated dynamically as needed. However, native encryption of subvolumes is not supported by Btrfs yet, although it is being worked on. There is also ongoing work in GNOME to provide an interface to interactively resize partitions when needed.
The second issue is integration into the desktop GUI, which is currently lacking but, once again, GNOME is working to implement it so that homed user management is fully integrated into GNOME account management. Furthermore, the lack of an upstream SELinux policy for homed was another issue that was discussed, but work is ongoing in Fedora to add support for it.
Finally, how to properly size /home relative to the root filesystem was discussed. Android uses dm-linear to create live partition "extensions" without needing to reallocate data; systemd-repart and homed could be enhanced to use the same pattern, so that space could be cheaply reassigned between the two partitions. An alternative approach could be to only have one partition and rely on Opal self-encrypting drives to protect the contents of the root directory.
Conclusions
After a long day of discussions, participants were tired but happy. The summit was again positive and productive; lots of good ideas and action items came out of it. And now we have about a year for the hard part: actually implementing them. Keep an eye on our changelogs for further updates.
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Briefs: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.