LWN.net Weekly Edition for June 27, 2024

Welcome to the LWN.net Weekly Edition for June 27, 2024

This edition contains the following feature content:

Making containers bootable for fun and profit: a DevConf.cz talk on the bootc utility.
Programming in Unison: an experimental programming language with a unique approach to collaborative development.
The GhostBSD in the machine: a look at the GhostBSD 20.04.1 release.
A capability set for user namespaces: another attempt to make user namespaces a little bit less scary.
Yet more LSFMM+BPF reporting:
- Updates to pahole: Poke-a-hole has grown far beyond its original parameters; now, it is being used to produce BTF debugging information for the kernel.
- Rust for filesystems: adding Rust abstractions for the VFS layer is proceeding, though there are still obstacles that need to be resolved, which was the topic of the discussion.
- Finishing the conversion to the "new" mount API: many kernel filesystems have still not converted to use the mount API that came in Linux 5.2 in 2019; the discussion considered some of the remaining issues to be resolved to finish the job.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Making containers bootable for fun and profit

By Joe Brockmeier
June 25, 2024

DevConf.cz

Dan Walsh, Stef Walter, and Colin Walters all walk into a presentation and Walter asks, "why would you want to boot your containers?" This isn't the setup for some technology joke, this is part of the trio's keynote at DevConf.cz in Brno, Czech Republic on June 14 about bootable containers (bootc). The talk, which was streamed to YouTube for those of us who didn't attend DevConf.cz in person, provided a solid overview of bootc and the problems it is intended to solve. The idea behind bootc is to make creating operating-system images just as easy as creating application-container images while using the same tools.

Walters is the creator of libostree (originally called OStree) and rpm-ostree, tools that allow creation, deployment, and management of bootable filesystem trees using a Git-like interface. LWN first covered OStree in 2012, and more recently in covering Project Bluefin, which uses libostree and rpm-ostree to create its images. Making it easier to compose and update operating-system images is a problem, he noted during the talk, that he has been working on for a long time and is important to him: "we ended up making our society dependent on computers, so it's important to keep them updated".

The problems

They began by talking about some of the problems that bootc is designed to solve, or at least help to solve. The classic problem, Walters said, is managing Linux systems at scale. Managing one or two systems is easy, managing tens and hundreds is not. But the number of systems that are being deployed increases all the time. It continues to be a challenge, he said, and we cannot train qualified system administrators quickly enough to deal with all of the computers we keep creating.

Another problem is infrastructure drift. A system that has been deployed for a while is likely to have drifted from its original and expected configuration. Walters gave an example of a harried system administrator making one-off fixes to a system in response to emerging requirements without using proper configuration management, leading to a system without a repeatable configuration. That makes it harder to troubleshoot the system in the future, or to replace it if needed.

Walsh stepped up next to talk about the related problems of installing operating systems at scale. Currently, organizations end up trying to solve this problem (in part) with homegrown tools. If an organization has ten or more systems to manage, he said, administrators "tend to invent some kind of image mode" to make the process of installing and handling initial configuration easier. The lack of shared tooling and workflows between operations and developers is yet another problem, Walter pointed out. Organizations have, he said, "completely different workflows between ops and dev". The teams that build and deploy systems, and the developers who develop applications, "talk to each other, but they don't share a lot of common technology", he said. As an example, he noted that it can be difficult to manage different types of software content—such as software packaged using programming-language-specific tools—at the system level. It is easy to include Python packages in an application container using pip, but "for the host, it's completely non-standard".

There is also the problem of deploying system updates. System administrators, rightly, have concerns about system updates causing downtime or problems with production applications. That could be helped by providing system rollback, sort of a security blanket for system administrators. Walters said this is near and dear to his heart after so many years working on image-based systems. Administrators want to know that when they deploy an update, they can roll back to a known-good system if the update breaks something.

Finally, Walters said that administrators want to be able to verify that a system is running the software they expect to be running. Walters said that he also likes to consider security scenarios like ransomware attacks. "If I take my kid to the hospital, I don't want to discover they got hit by a ransomware attack, right?" One way to avoid that, he said, is to have an image with a verifiable cryptographic signature that stays with it through the boot process and run time of the system.

Why containers?

Having enumerated the problems that they would like to help solve, Walters turned the discussion to explaining why their focus was on bootable containers as a way to solve those problems. The reason to work on bootable containers, he said, is the ecosystem around containers. "I like to think about things in terms of 'centers of gravity' [...] and the container ecosystem is large, and there are so many tools that orbit that center of gravity". Many people are already well-versed with container tools, and there is so much energy focused on that ecosystem already, so why not adapt them to manage full Linux systems? That led to the first core principle of bootable-container images, which Walter led the audience in reading aloud from one of their slides:

Use standard container practices and tooling, such as the OCI standard, layering, container registries, signing, testing, and GitOps workflows to build Linux systems.

Walsh then moved on to a discussion of the container ecosystem and tools used to create and work with Linux containers such as Podman, Docker, and Kubernetes. "It's all about taking advantage of our tools to build the operating system the same way we build our applications." That includes distributing operating-system images using container registries, signing the images the same way that application-container images are signed, and using container scanning tools to examine images for known vulnerabilities. It also means being able to build operating systems using a format that many developers are already familiar with, Containerfiles.

Using bootc

In a standard Containerfile, developers declare a base image to start with, using the FROM directive. That does not change for bootable containers; developers simply need to specify a new kind of base image that contains the new bootc tool (also invented by Walters). From there, Walsh said, developers can install RPMs, Python packages, add configuration scripts, use Ansible to do additional configuration, and more. "The whole center of gravity is around Containerfiles." At this point an audience member shouted "I hate Containerfiles! They're a terrible way to put things together!" Walsh responded that developers don't have to use Containerfiles, they can also use tools like Buildah or BuildKit that use other specification formats to create container images, or any other tools that can create Open Container Initiative (OCI) compliant container images.

From there, Walsh said, the image can be tested using Podman or using a full continuous-integration (CI) system. "We want to be able to test our operating system before we ever deploy it." Note that it is possible to start a bootc container as a regular container as well. In that case, the kernel in the image is simply ignored. Walters added that the goal is to shift left the creation and testing of operating systems. Instead of, for example, using SSH to log into a system and make a change in production, Walters said he hoped that administrators would make the change in the Containerfile and run the change through CI, push to a container registry, and then deploy the image to production.

The next part of the presentation was a recorded demonstration of using bootc to replace a running Linux system. The demonstration showed using Podman to start a privileged container in a system running on Google Cloud that mounted the host system's filesystems, and then used the bootc tool in the container to install itself to the system. The old filesystem, Walters said, is still there, but the user can now boot into the container's filesystem. After reboot, the system has a new bootc command that can display its status and be used to update and manage the system images.

One of the important things, he said, is that it allows administrators to know the state of the system. The "bootc status" command will show the SHA256 sum associated with the image of the running system. If it matches the sum of the container image, then system administrator knows that it matches the image they expected to be deployed. "I know that the state of this cloud instance happens to be exactly what went through my CI/CD bit-for-bit".

In addition, bootc sets up an automatic update. After an image with bootc is installed on a system, it can check the image registry for new images and install those if found. "We can get the humans out of the system and let the robots take over", Walsh said.

Walters was quick to point out that this was just a demo, and that they did not expect that users would SSH into a system and run Podman commands to provision systems with bootc. It is possible, he said, to use various cloud-provisioning systems or bare-metal installers to install a system with bootc today.

Another way to create bootable bootc images, said Walter, is to use the bootc-image-builder tool, which (as the name suggests) creates disk images from bootc images. It can create a variety of disk image formats, such as qcow2 images for use with OpenStack or QEMU/KVM, AMIs for Amazon Web Services deployments, ISO images, and more. Walsh added that the functionality to create images with bootc-image-builder has been added to Podman Desktop, so users can use a graphical environment to build images. Yet another way, Walter added, is to use the Anaconda installer with a Kickstart file that references a bootc image.

Those features, said Walters, mean that "we've made a dent in the problem domain of managing systems at scale and infrastructure drift" as well as the other problems outlined at the beginning of the talk. Walter said that it also made it possible for developers and operations teams to use the same tools to build images, configure them, and layer applications into the images. And, Walters added, it makes rollbacks possible when (not if) something goes wrong. He said that, at the moment, rollbacks are not automatic "but the API is there" for the future.

Rollbacks may be less necessary, said Walsh, if developers are making good use of CI/CD to test images before deployment. If developers really test their code, and test their operating system, they have fewer chances of problems that would require rollback in the first place.

Get started today

Walter took over the floor to talk about bootc's availability. "Did you know that right now in Fedora those bootable base containers exist?" He pointed to the Fedora documentation for bootc, and mentioned that there are bootc images for CentOS Stream as well. "You can start using this for your systems and servers today." Users who want to use bootc on their laptop, he said, should check out Bluefin. "They have taken this concept all the way, you should see their Containerfile and all their tweaks they've made to make a focused workstation for developers". Finally, he noted that the technology is already available with Red Hat Enterprise Linux (RHEL) 9.4 as a tech preview, and in OpenShift (which uses RHEL CoreOS) since version 4.12.

It is difficult to get a sense of who uses a technology, Walters said, because developers usually only get to see "the tip of the iceberg" via bug reports, pull requests, and the like. Even so, he observed that there seems to be a lot of interest in bootc already. Walter spoke next to talk about ways to contribute. One obvious way is to add support for bootc containers, which will help the bootc project and its ecosystem. Another way, he said, is for upstream project developers to use systemd-sysusers or dynamic users features if a service requires adding a user to the system. He also encouraged RPM packagers to "get your shit out of %post" in RPM spec files (scripts that are run post-installation). During a container build, he said, there is no running system, so "it's going to have different behavior" than using a post-installation script on a running system. Walters commented that some people flash system firmware using %post scripts and that does not work using this tooling. Interested parties should also check out the Linux Userspace API Group (UAPI), a community for defining specifications for immutable and image-based Linux systems.

At that point, the talk ran out of time. The recording for the talk is available as part of the day two DevConf.cz stream on YouTube.

Comments (9 posted)

Programming in Unison

By Daroc Alden
June 25, 2024

Unison is a MIT-licensed programming language, in development since 2013, that explores the ramifications of making code immutable and stored in a database, instead of a set of text files. Unison supports a greatly simplified model for distributed programming — one that describes the configuration of and communication between programs in the same language as the programs themselves. Along the way, it introduces a new approach to interfacing with programming languages, which is tailored to its design.

Every programming language, especially one that is just starting out, needs a niche. Unison's chosen niche is cloud computing — making it easier to build modern distributed systems, by radically simplifying some of the rough edges of existing technologies. While it is certainly possible to throw together simple, local scripts using the language, the core developers' focus is on making the development of distributed systems and web-based applications as seamless as possible. In support of this mission, the language employs a number of unusual features.

Naming

The feature that fundamentally sets the language apart is the way code is stored. Unlike most other programming languages, which store programs as text, Unison stores programs in a machine-readable format. There are other languages that have done this, including languages like Smalltalk with image-based persistence, or visual languages like LabVIEW. Unlike those languages, Unison programs are stored in an append-only, content-addressed database. The code is still displayed to the user for editing as text, using the editor of their choice, but it is only parsed once, and then stored internally in the database. Consider the following implementation of the factorial function:

    factorial : Nat -> Nat
    factorial n = match n with
      0 -> 1
      _ -> n * factorial (n - 1)

This function has the hash #in3bl5u64l (rendered in Unison's default base-32 hash format), using Unison's custom structural hash function. The hash is based on the structure of the code, not the variable names or formatting used to express it. Internally, the abstract syntax tree (AST) of the code is stored in Unison's database under that hash. If another person wrote the same function but decided to call it fac, it would have the same hash. When editing some other function that referenced it, Unison would display whatever name the user had defined for it; so one person might see factorial and the other might see fac. In this way, Unison names are a lot like Git tags: a human-readable name for an object that is primarily identified by a hash.

In general, the programmer interacts with Unison using an editor side-by-side with a terminal running the CLI interface, or a browser window running the graphical interface. When writing new code, the user types it in their editor, like any other language. On save, Unison is alerted by a filesystem watch, reads the code, and then presents any problems with it, or offers to update the database with it. When editing an existing function, Unison pretty-prints the stored definition into the user's editor, and watches for changes. This has the interesting effect of doing away with code formatting as a separate step — code is always formatted when the programmer goes to read or edit it. Overall, the approach ends up feeling much more like a collaboration with the compiler than conventional languages do: it asks for definitions, suggests changes, points out problems and failing tests, etc. Here's what it looked like when I added the above definition to my code:

    I found and typechecked these definitions in /tmp/scratch.u. If you do
    an `add` or `update`, here's how your codebase would change:

      ⍟ These new definitions are ok to `add`:

        factorial : Nat -> Nat

Unison's approach to naming may seem like an interesting curiosity, but it has a few practical ramifications. For one thing, renaming a function, variable, or type can never break anything. Even causing a name collision won't cause problems — Unison tracks the underlying code by hash, so two items that are both named foo might be displayed to the user as foo#hash1 and foo#hash2, but the program would still compile and run without any problems. Another consequence is the ability to use different versions of the same library without issue — different versions of the same function have different hashes, so they can be treated just like different functions with the same name. This also means that the hash of a function encodes not only its code, but also its exact dependencies, which makes sharing code between computers much simpler.

Claiming that Unison code is immutable raises the question of how a function could actually be updated, once it has been written. Since Unison code is stored in a database, the language always knows exactly which code references a particular function. If an edit to a function does not change the type signature, the language can automatically produce a new version of each function that depends on the changed function. The old versions are not removed, but the names of any functions are updated to point to the new ones. This makes it possible to write behavior tests that compare one implementation to another, by referring to the old version of a function, for example.

If the changes to a function do not preserve its type, Unison uses the same knowledge to produce a "to do" list for the programmer, which it will track and automatically remove items from as conflicts are solved. Since the old code is still present in the database, there is never a moment where the code is "broken" by a change. The old version still exists, and can be run, built, inspected, and so on, while the programmer works on the new version. Once the new version has been completed, the programmer can switch over to it all at once.

Abilities

Unison's unique approach to naming may handle dependencies, but as the pervasive use of containers shows, there is more to getting program to run on many computers than just ensuring that dependencies are bundled with a program. Code does not just depend on library functions or types, but also on the state of the computer outside of the program. Unison can't solve that problem entirely, but it does have a solution to help manage the complexity of code that relies on interfacing with the outside world: abilities.

Abilities are a kind of effect system — a way to track in the type system what a given piece of code needs in order to run. The most general ability is called IO, and represents the ability to do arbitrary I/O, including reading and writing files, opening network connections, or reading information about the state of the computer. Programmers could write their programs with every function requiring the IO ability, but the more usual approach would be to consider which concrete things each part of the program will need to be able to do, and then declare smaller, more restricted abilities. Programs with custom abilities can be run by providing a "handler" function that describes how to implement the ability, usually in terms of another ability. For example, a programmer might provide a ReadEnvironment ability that lets a program fetch the value of environment variables. In normal use, a handler would translate that into the IO ability, but there can be multiple handlers for an ability, so a test suite might use a handler that supplies pre-defined test values instead.

Since abilities are tracked by the type system, it is impossible for a function to use an ability it has not declared. This means that the programmer can get a list of every interface with the outside world that a piece of code expects to use by looking at the type signature, and mock them for testing by specifying a different handler. Overall, abilities can make writing testable distributed programs much simpler, since everything is described in one flexible language. The guarantees of the type system also mean that it is theoretically possible to run untrusted code, and be sure that it only accesses abilities that the programmer gives it. In practice, Unison is still in development, and there may be some lurking holes in the security guarantee.

Funding

The founders of Unison Computing — Paul Chiusano, Rúnar Bjarnason, and Arya Irani — have enough faith in Unison's security properties to make them the basis for a cloud computing offering. Unison Cloud is a platform that allows running Unison programs that use a custom Cloud ability on managed hardware for a monthly fee. That money goes to Unison Computing, a public benefit corporation that employs the core Unison developers, to keep working on the language. The project does accept outside contributions, however, and the language itself will remain open source.

The Cloud ability has facilities for storing arbitrary Unison values to a typed database, handling HTTP(S) requests, deploying new services, and other operations necessary for a program running in the cloud. Since it is an ability like any other, the Unison Cloud library provides mock handlers that can test the entire process of deploying multiple services, running health checks and integration tests, and tearing down the resulting deployment locally.

Drawbacks

Unfortunately, Unison's unique design comes with its share of problems. For one thing, modern programs are often not written in just one language, but Unison's greatest benefits only come when an entire program is written in it. Unison doesn't even have a stable foreign-function interface (FFI) that could be used to wrap libraries written in other languages. Because of this, existing Unison programs need to reimplement a lot of functionality that is already present in other languages.

Unison Share is a cross between a package registry, a code forge, and a code browser. Since Unison code is not stored as text files, but rather as a database, the community can't really reuse existing tooling. Tools like Unison Share must be written from scratch. There is support for pushing code to a Git repository, but since it isn't human-readable, it can't really be viewed without either using the local Unison tools, or hosting an instance of Unison Share. The community is actively encouraging people to develop and post new libraries there, but there's a long way to go to catch up with other languages. Still, Unison's ideas around dependency management make using libraries that do exist quite easy — just pull them into your local database and start calling functions, with no worries about dependency conflicts or where to obtain the code.

That approach does prompt the question of how upgrades to libraries are handled. The process described above for updating dependent code when a function changes relies on all of the affected code being locally available for development. There are three partial answers to this question: small libraries, abilities, and patches. Since Unison makes it easy to seamlessly depend on a library, many of the existing libraries are quite small; it is easier to break out some small functionality into a separate library than it would be in another language. Smaller libraries require less frequent updates, and may even become completely finished. Larger libraries can present their interface as an ability. This makes upgrading to a newer version of the library as simple as changing to a newer version of the handler. Finally, for cases where neither of those approaches apply, Unison produces a special kind of value called a patch — really, a record of what changes the developer of a library made while developing the new version, including a mapping of which new functions were produced by editing old ones. Unison uses that information to do the same kind of upgrade as during local development.

Unison is in active development, not yet having reached a 1.0 release. So quite aside from throwing out the familiar text-based workflow, it also has the normal challenges that any language must face: performance problems, occasional bugs in the runtime, unstable interfaces, etc. Despite that, the documentation is already quite comprehensive, and the project has a policy of not breaking existing programs on upgrade. In fact, the standard library is managed using the same process as other libraries, so it is quite possible for a program to use different versions of the standard library internally without conflict.

Unison is not yet widely packaged, but downloads are available from the project's releases page. Running ucm, the Unison codebase manager, will set up a database for the user's code in ~/.unison and provide some quick-start guidance on starting a project in Unison.

It remains to be seen whether Unison will overcome the hurdles necessary to become a widely-used, productive language. Even if it does not, however, it at least illustrates that a different approach to software development is possible — one that builds collaboration with the computer directly into the language itself, and provides an alternative to the many text-based programming languages.

Comments (12 posted)

The GhostBSD in the machine

By Joe Brockmeier
June 24, 2024

GhostBSD is a desktop-oriented operating system based on FreeBSD and the MATE Desktop Environment. The goal of the project is to lower the barrier to entry of using FreeBSD on a desktop or laptop system, and it largely succeeds at this. While it has a few rough edges that make it hard to recommend for the average desktop user, it is a fine choice for users who want a desktop with FreeBSD underpinnings such as the Z File System (ZFS), and the Ports (source) and Packages (binary) software collections.

GhostBSD has been haunting users for some time now. The first GhostBSD release (1.0) was announced in 2010, and was based on FreeBSD 8.0 and GNOME 2.28.2. The name is a portmanteau of "GNOME hosted by FreeBSD", even though the project switched to MATE in 2013. The project also offers an unofficial community spin of GhostBSD featuring the Xfce desktop. The most recent release, 24.04.1, was announced on May 20, and is based on FreeBSD 14.

Off to the races

Installation is almost dead simple. The installer walks users through setting up the preferred language, keyboard layout, time zone, disk partitioning, boot manager, then creates a user account with administrative privileges. Disk partitioning can be as simple as pointing the installer at the disk to use; in that case, GhostBSD will allocate a small partition for swap space and use the rest to create a ZFS pool with separate datasets for the root filesystem (/), /home, /usr, /usr/ports, /var, /var/log, and others. The installer also allows custom partitioning and the creation of complex setups with multiple pools and disks. Unfortunately, the installer does not provide a help menu or documentation to guide users with this. That is a recurring theme while using GhostBSD—its homegrown tools have no accompanying help menus or inline documentation.

The initial set of software for GhostBSD is minimal, but enough to get started. It provides the MATE defaults for file manager, text editor, terminal, PDF viewer, and so forth. Users also get Firefox, the VLC media player, Shotwell photo manager, and Evolution for email.

If the default software selection is not enough, GhostBSD provides an application called Software Station manage software without having to use the pkg command-line utility. It is a no-frills application that acts as a front-end for pkg. It lets users search the more than 34,000 available packages, or browse packages by one of the many categories, and install or remove packages. Unlike more full-featured desktop software managers, like GNOME Software or KDE Discover, Software Station does not provide screenshots, links to the web sites for the software, or much else.

Software Station does not act as a front-end for GhostBSD's Ports. If users want to install packages from source using Ports instead of the precompiled packages (.pkg files), they need to install the ports, src, and os-generic-userland-devtools packages.

Everything under the sun

Generally, installing from source should not be necessary. What Software Station lacks in sophistication, it makes up for in software selection. Most of the desktop software I would reach for on a Linux desktop is available on GhostBSD, such as LibreOffice, the Strawberry media player, Claws Mail, GNU Emacs, NeoVim, GIMP, and quite a bit more. Users can choose different desktop environments and window managers as well, of course. A Google Chrome package is available, but it is the Linux version of Google Chrome released on February 6 which is badly outdated at this point: stable Chrome releases happen roughly every six weeks. A native version of Chromium, however, is available and is up-to-date.

Updates and system upgrades are handled by the Update Station utility, but it is not entirely clear what the security-update policy is. GhostBSD founder Eric Turgeon said in a post on the FreeBSD forum that the GhostBSD ports tree is synced from FreeBSD's ports tree every week or two, and packages are built from the ports tree "about every two weeks unless there are CVEs in the default set of packages". Turgeon adds that he tries to stay on top of CVEs "as much as possible". GhostBSD users would be wise to subscribe to the FreeBSD security advisory notifications mailing list mentioned on the FreeBSD Security page, or at least keep a close eye on the advisories page.

When updates are available, Update Station displays all of the packages that are going to be installed or updated and offers to create a boot-environment backup as well. The boot environment is a ZFS snapshot of the root filesystem, which allows users to roll back to a known-good environment if an update goes awry. The snapshot does not cover the full system: the /home, /usr, /var directories (and others) are on their own mount points and not included as part of the boot-environment snapshot. In other words, a boot-environment snapshot serves as a rescue utility, it will not provide snapshots of user documents and such.

The GhostBSD folks seem to like "Station" naming: there is also a Backup Station for manually creating boot environment snapshots, and a Station Tweak utility for configuring some of the desktop options such as the panel layout and whether window title-bar buttons are on the left or right side. The Backup Station utility is somewhat misnamed; it only allows users to create boot-environment snapshots. It does not expose other ZFS snapshotting capabilities, which is a pity. Those features are, of course, available at the command line but lack discoverability. It only took about five minutes with the documentation to be able to create snapshots and try some of the ZFS rollback features, but deeper understanding and mastery will clearly take a bit longer.

GhostBSD uses fish (the "friendly interactive shell", covered here in 2020) as its default shell, and Berkeley vi instead of Vim as the default editor. The fish shell's autocomplete features seem interesting, though the time I spent with GhostBSD is not nearly sufficient to fully explore its features or overcome 20-plus years of using GNU Bash. Further exploration is in order. Vim 9.1.404 is included in the default install, though it seems to be targeted for removal at some point. Likewise, Git is included but Turgeon says that most operating systems do not pre-install Git and has marked it for removal in a future release.

Hardware support with GhostBSD is somewhat lacking. For example, it required a system reboot for GhostBSD to discover an HDMI monitor the first time. Simply plugging the HDMI cable in was not sufficient. Bluetooth is, theoretically, supported—but it is not enabled by default. The documentation on the GhostBSD wiki points to a six-year-old forum post with a lengthy set of instructions to try to configure Bluetooth hardware. These are not insurmountable hurdles, but they are things that one would expect to simply work for a desktop-oriented system.

Another problem was that the MATE desktop panel would crash/disappear sometimes when installing new software. This happened, for example, while installing Google Chrome. (My best guess is something goes awry while adding new menu items for applications to the MATE application menu.) The panel can be resurrected with "mate-panel --replace &", but it does not lend confidence when a core component crashes frequently.

Live and let ghost

GhostBSD seems like a good option for those who already prefer FreeBSD and want a distribution that's customized for the desktop, or for users who want to get a first taste of FreeBSD. It is not as polished or full-featured as mainstream Linux desktop distributions, but it is user-friendly enough for experienced Linux and BSD users.

Intrigued users can find MATE and Xfce ISOs on the download page. The project also maintains a development tracker that provides insight into the features to expect and bugs that should be fixed in releases coming soon. The next release (24.07.1) seems to be expected in July.

Comments (15 posted)

A capability set for user namespaces

By Jonathan Corbet
June 20, 2024

User namespaces in Linux create an environment in which all privileges are granted, but their effect is contained within the namespace; they have become an important tool for the implementation of containers. They have also become a significant source of worries for people who do not like the increased attack surface they create for the kernel. Various attempts have been made to restrict that attack surface over the years; the latest is user namespace capabilities, posted by Jonathan Calmels.

The core idea behind user namespaces is that a user runs as root within them, while the namespace as a whole is still unprivileged in the system that hosts it. A root process within the namespace has access to many root-only operations that can be used to configure and run the environment within the namespace. By design, that access cannot harm the system outside of the namespace, but there is a catch: the root user within the namespace can make many system calls that would be unavailable to that user outside of the namespace. That exposes much more of the kernel API to unprivileged users, increasing the severity of any security-relevant bugs in that API. A number of exploitable vulnerabilities have predictably emerged from that exposure.

Fear of new vulnerabilities has caused some distributors to disable user namespaces entirely in the past. A security-module hook was added in 6.1 to control the ability to create user namespaces, despite objections from the user-namespace maintainer. Out-of-tree patches to control user namespaces also apparently exist. In a world where the kernel was bug-free, user namespaces would not be a problem; in the world we actually inhabit, they continue to worry security-oriented developers.

Capabilities

While Linux appears to follow the traditional model where the root account has all privileges and non-root accounts have none, internally the implementation is rather more complicated. Privileges are represented by capabilities, a set of bits describing the various operations that a task is allowed to perform. For example, CAP_CHOWN allows a process to change the ownership of any file in the system, CAP_BPF gives access to the BPF virtual machine, and CAP_SYS_ADMIN covers a horrifyingly long list of privileged operations.

In the traditional model, a process running as root has all capabilities available to it; in a Linux system, it is possible to restrict capabilities to a smaller set. Of course, the world is complex; rather than having one set of capabilities, a thread in Linux has five of them. As if that were not enough, those sets interact with three other sets that can be associated with executable files. The thread capability sets are:

The effective set, which describes the capabilities that the thread can actually exercise at the moment.
The permitted set, containing the capabilities that the thread is empowered to exercise. A thread can add a new capability to its effective set with the capset() system call, but only if that capability exists in the permitted set.
The bounding set, which contains the list of capabilities a thread can obtain by any means. If a capability is missing from the bounding set, the thread cannot obtain that capability even if it runs a privileged program that would otherwise enable that capability.
The ambient set contains a set of capabilities that will be retained if the thread calls execve() to run an unprivileged program. Capabilities are normally cleared by execve(); the ambient set allows a task to pass a subset of its capabilities through that call.
The inheritable set defines the capabilities that can be passed through execve() to an executable file that has its own inheritable set. A capability must appear in both sets to be permitted after execve().

A look at the unprivileged editor process in which this article is being written (as seen in /proc/pid/status) shows:

    CapInh:	0000000000000000
    CapPrm:	0000000000000000
    CapEff:	0000000000000000
    CapBnd:	000001ffffffffff
    CapAmb:	0000000000000000

All of the sets are zero (indicating no capabilities set) with the exception of the bounding set, where all capabilities are allowed.

An executable file can have its own permitted and inheritable sets that cause it to run with additional privilege (like a restricted form of setuid program), along with a single "effective" bit that causes the permitted set to also be established as the effective set. As described above, capabilities in the file's inheritable set are only enabled if they also appear in the inheritable set of the thread executing the file with execve(). In general, the interactions between the sets can be complex; see the above-linked capabilities man page for all the details.

Yet another capability set

At its core, Calmels's patch set works by adding another capability set — the userns capability set — to the above pile. During a thread's normal operation, this capability set is not consulted by the kernel. The thread can change the capabilities in that set with a new prctl() call, but setting new capabilities there will only succeed if either those capabilities already exist in the thread's permitted set or the thread holds the CAP_SETPCAP capability. Additionally, the operation will also only succeed if the requested capabilities appear in the thread's bounding set.

The new capability set comes into play, though, when a thread creates a new user namespace. At that point, the effective, permitted, bounding, and userns capability sets within the namespace will all be set to the creator's userns capability set. If the creator's set reflects a reduced set of capabilities, then root within the namespace will no longer be all-powerful there. Any system calls that need the missing capabilities will become off-limits, thus reducing (or so it is hoped) the attack surface that the kernel presents within the namespace.

By default, the userns capability set contains the full list of capabilities, so no restrictions will be applied within user namespaces. This default preserves the existing behavior of user namespaces.

The patch series also creates a new sysctl knob (kernel.cap_userns_mask) that is applied to all userns capability sets. By default this mask contains all capabilities; if the system administrator removes some capabilities from it, then no user namespace created within the system can have that capability internally. Finally, and somewhat controversially, there is a set of changes to allow BPF Linux security modules (LSMs) to adjust all of the capability sets (including the userns set) for a thread.

Mixed reception

While there was no opposition to the idea of reducing capabilities within a user namespace, the mechanism implemented in this patch has not been universally popular. Casey Schaufler called the first version of the series "a bad idea", adding that the interaction between the various capability sets was too complex for user-space developers now. He suggested a mechanism built into user namespaces directly, or perhaps a clone() flag, instead. John Johansen suggested that perhaps restricting capabilities within user namespaces should be implemented within the security-module mechanism; this idea may have led to the LSM hook added in the second version.

That hook, though, did not gain favor from LSM maintainer Paul Moore, who worried about giving LSMs the ability to modify a thread's capabilities. He pointed out the potential for bad interactions between security modules, any of which might be using the capability sets to make access-control decisions. LSMs are currently restricted to modifying their own internal state, he said, and that situation should not change; modification of capability sets should only be done within the capability LSM. He summarized that "this patch is not acceptable at this point in time".

On the other hand, Serge Hallyn, the current maintainer of the kernel's capability subsystem, has been generally favorable to the idea, saying: "I'm a container developer, and I'm excited about it". He has provided Reviewed-by tags for most of the series, with the exception of the LSM hook; his suggestion is that the series should move forward with everything except that hook.

That seems like the most likely outcome for this patch set. The capability-based solution did not find universal acclaim, but it does not appear that anybody is so opposed to it that they will fight its inclusion. While most users will never notice this relatively new feature, container developers may well take advantage of it to ratchet down the level of privilege (and vulnerability exposure) given to containers, and distributors may find that it helps them to get over their fear of user namespaces in general.

Comments (56 posted)

Updates to pahole

By Daroc Alden
June 20, 2024

LSFMM+BPF

Arnaldo Carvalho de Melo spoke at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit about his work on Poke-a-hole (pahole), a program that has expanded greatly over the years, but which was relevant to the BPF track because it produces BPF Type Format (BTF) information from DWARF debugging information. He covered some small changes to the program, and then went into detail about the new support for data-type profiling. His slides include several examples.

BTF gradually evolves alongside BPF. Over time, Carvalho has been adding options to pahole to cope with the changes, but those options only make pahole more difficult to use. It is sometimes difficult to know what flag or combination of flags should be used for any given invocation. So recently he has added a --btf_features flag that takes a comma-separated list of features, in order to centralize the different flags. Any unknown flags are ignored, which could make emitting BTF using older versions of pahole less painful. During development, the --btf_features_strict can be used to produce warnings for misspellings. The new approach has slightly simplified the Makefile that the kernel uses to generate BTF information by replacing conditional statements with a static set of flags:

     --btf_features=encode_force,var,float,enum64,decl_tag,type_tag,optimized_func,consistent_func

Another recent change is the introduction of reproducible builds. Another developer had sent in a patch that disabled parallel BTF encoding, because the output could differ between runs. Ensuring that the output of pahole is reproducible is important because the BTF information gets encoded into the kernel image — so un-reproducible BTF meant un-reproducible kernel images. Now Carvalho has added code to ensure that parallel encoding threads emit information in the same order every time. New reproducibility tests confirm that users can now have both parallel encoding and reproducible builds, with minimal performance overhead.

BPF has had a "call by BTF ID" mechanism for kfuncs for some time, but previously there has not actually been a way to see which kfuncs are available in a given kernel. Now, pahole emits declaration information for kfuncs (when the feature decl_tag_kfuncs is enabled), so interested code can iterate over all the declarations. Carvalho has also been working on changing how BTF handles kernel modules. Right now, the debugging information in kernel modules references items in the kernel by BTF ID, so that debug information for the whole kernel doesn't need to be shipped with each module. This would not be a problem, except that the numbering changes for each build of the kernel. Normally, this necessitates recompiling modules alongside the kernel. But with a bit of extra effort, the kernel modules can be built with ELF relocations in the generated BTF, so that they don't always need to be rebuilt. An implementation of that is almost done and ready to be merged, he said.

Data-type profiling

BTF already provides an easy way for performance-monitoring tools like perf to trace which lines of code correspond to particular instructions. That doesn't always tell the full story, though. Modern CPUs have aggressive caching, and fetching values from memory is a serious performance hit; it can make sense to analyze performance issues in terms of the data a computation is interacting with, instead of the code being run.

Carvalho demonstrated two tools: perf mem, which displays the time spent accessing memory broken down by the individual members of each structure, and perf c2c, which tracks false sharing and cache evictions. For these tools to work, there needs to be some way to connect a memory access not just to a line of code, but to the particular type of the value involved. The original version of perf mem used DWARF debugging information to make that connection. Now, BTF has enough information to be used for that purpose as well. perf still prefers DWARF tables when they are available, but it does use BTF to display information on kfuncs.

Carvalho went into some detail about how perf handles disassembling programs; in short, it uses the Capstone library, with a fallback to objdump when Capstone is unavailable. Integration with objdump also means that it supports all of the architectures that are supported by GNU Binutils.

Daniel Borkmann asked whether these changes were already available, and Carvalho said that they were. He is continuing to work on improvements, but the basic functionality is usable now. José Marchesi asked whether it would cause a problem if the kernel changed to generate BTF directly, instead of being compiled with DWARF information and then using pahole to translate. Pahole has a BTF loader, Carvalho explained, so it would not need to change anything about how the tool is used. At that point, the session ran out of time.

Comments (1 posted)

Rust for filesystems

By Jake Edge
June 21, 2024

LSFMM+BPF

At the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit, Wedson Almeida Filho and Kent Overstreet led a combined storage and filesystem session on using Rust for Linux filesystems. Back in December 2023, Almeida had posted an RFC patch set with some Rust abstractions for filesystems, which resulted in some disagreement over the approach. On the same mid-May day as the session, he posted a second version of the RFC patches, which he wanted to discuss along with other Rust-related topics.

Goals

After updating attendees on the status of his patches, Almeida listed some of the goals of the Rust-for-Linux project, which are embodied in the filesystem abstractions that he is proposing. The first is to express more of the requirements using Rust's type system in order to catch more mistakes at compile time. In addition, the project's developers want to automate some tasks, such as cleaning up resources, in ways that are not easily available to C code. The overall idea is to have a more productive filesystem-development experience, with less time spent on debugging problems that the compiler could find, and with fewer memory-related vulnerabilities overall.

Overstreet said that he had been a part of too many two-week bug hunts and has been trying to find ways to avoid those kinds of problems for bcachefs. The Rust language provides a lot more than what he can do in C; it eliminates undefined behavior and provides facilities to see what is happening inside the code. "You can't debug, if you can't see what's going on." He believes that kernel development "will get a whole lot easier over the coming decades" due to using Rust. It will be possible to prove the correctness of code written in Rust, which will mean that bugs that can derail feature development will be much less common.

From his slides, Almeida showed an example of how the Rust type system can eliminate certain kinds of errors. He noted that the iget_locked() function in current kernels has a complicated set of requirements. Callers must check to see if the return value is null and, if it is not, then the contents of the returned struct inode need to be checked to see if it is a new or existing inode. If it is new, it needs to be initialized before it can be used; if that fails, iget_failed() needs to be called, he said.

There was some discussion of the finer points of what callers of iget_locked() need to do, with Al Viro disagreeing with some of what Almeida had on his slide. That went back and forth, with Overstreet observing that it was exactly that kind of discussion/argument that could be avoided by encapsulating the rules into the Rust types and abstractions; the compiler will know the right thing to do.

Overstreet noted that Christian Brauner and Alice Ryhl have helped to improve the abstractions a great deal since the first posting; in particular, there are things he has learned about reference counts based on how they are being handled by the Rust code. "This is going to make all our lives so much easier", Overstreet said.

Almeida put up a slide with the equivalent of iget_locked() in Rust, which was called get_or_create_inode(). The important part is the return type, he said; as with C, callers must check for failure, but the success case is much different. If it is successful, the caller either receives a regular reference-counted inode to use (which has its reference count automatically decremented when the inode object is no longer referenced) or it receives a new inode, which will automatically call the equivalent of iget_failed() if it is never initialized. If it is ever initialized (which can only be done once), it becomes a regular inode with the automatic reference-count decrement. All of that is enforced through the type system.

Viro seemed somewhat skeptical of how that would work in practice. He wondered where in the source code those constraints would be defined. Almeida said that the whole idea is to determine what the constraints are from Viro and other filesystem developers, then to create types and abstractions that can enforce them.

Disconnect

Dave Chinner asked about the disconnect between the names in the C API and the Rust API, which means that developers cannot look at the C code and know what the equivalent Rust call would be. He said that the same names should be used or it would all be completely unfamiliar to the existing development community. In addition, when the C code changes, the Rust code needs to follow along, but who is going to do that work? Almeida agreed that it was something that needs to be discussed.

As far as the renamed functions goes, he is not opposed to switching the names to match the C API, but does not think iget_locked() is a particularly good name. It might make sense to take the opportunity to create better names.

There was some more discussion of the example, with Viro saying that it was not a good choice because iget_locked() is a library function, rather than a member function of the superblock object. Almeida said that there was no reason get_or_create_inode() could not be turned into a library function; his example was simply meant to show how the constraints could be encoded in the types.

Brauner said that there needs to be a decision on whether the Rust abstractions are going to be general-purpose, intended for all kernel filesystems, or if they will only be focused on the functionality needed for the simpler filesystems that have been written in Rust. There is also a longer-term problem in handling situations where functions like get_or_create_inode() encode a lot more of the constraints than iget_locked() does. As the C code evolves, which will happen more quickly than with the Rust code, at least initially, there will be a need to keep the two APIs in sync.

It comes down to a question of whether refactoring and cleanup will be done as part of adding the Rust abstractions, Overstreet said; he strongly believes that is required. But there is more to it than just that, James Bottomley said. The object lifecycles are being encoded into the Rust API, but there is no equivalent of that in C; if someone changes the lifecycle of the object on one side, the other will have bugs.

There are also problems because the lifecycle of inode objects is sometimes filesystem-specific, Chinner said. Encoding a single lifecycle understanding into the API means that its functions will not work for some filesystems. Overstreet said that filesystems which are not using the VFS API would simply not benefit, but Chinner said that a VFS inode is just a structure and it is up to filesystems to manage its lifetime. Almeida said that the example would only be used by filesystems that currently call iget_locked() and could benefit. The Rust developers are not trying to force filesystems to change how they are doing things.

Allocating pain

Part of the problem, Ted Ts'o said, is that there is an effort to get "everyone to switch over to the religion" of Rust; that will not happen, he said, because there are 50+ different filesystems in Linux that will not be instantaneously converted. The C code will continue to be improved and if that breaks the Rust bindings, it will break the filesystems that depend on them. For the foreseeable future, the Rust bindings are a second-class citizen, he said; broken Rust bindings are a problem for the Rust-for-Linux developers and not the filesystem community at large.

He suggested that the development of the Rust bindings continue, while the C code continues to evolve. As those changes occur, "we will find out whether or not this concept of encoding huge amounts of semantics into the type system is a good thing or a bad thing". In a year or two, he thinks the answer to that will become clear; really, though, it will come down to a question of "where does the pain get allocated". In his mind, large-scale changes like this almost always come down to a "pain-allocation question".

Almeida said that he is not trying to keep the C API static; his goal is to get the filesystem developers to explain the semantics of the API so that they can be encoded into Rust. Bottomley said that as more of those semantics get encoded into the bindings, they will become more fragile from a synchronization standpoint. Several disagreed with that, in the form of a jumble of "no" replies and the like. Almeida said that it was the same with any user of an API; if the API changes, the users need to be updated. But Ts'o pointedly said that not everyone will learn Rust; if he makes a change, he will fix all of the affected C code, but, "because I don't know Rust, I am not going to fix the Rust bindings, sorry".

Viro came back to his objections about the proposed replacement for iget_locked(). The underlying problem that he sees is the reliance on methods versus functions; using methods is not the proper way forward because the arguments are not specified explicitly. But Overstreet said that the complaints about methods come from languages like C++ that rely too heavily on inheritance, which is "a crap idea". Rust does not do so; methods in Rust are largely just a syntactical element.

There was some discussion of what exactly is being encoded in the types. Jan Kara said that there is some behavior that goes with the inode, such as its reference count and its handling, but there is other behavior that is inherent in the iget_locked() function. Overstreet and Almeida said that those two pieces were both encoded into the types, but separately; other functions using the inode type could have return values with different properties.

Viro went through some of his reasoning about why inodes work the way they do in the VFS. He agreed with the idea of starting small to see where things lead. Overstreet suggested that maybe the example used was not a good starting point, "because this is the complicated case". "Oh, no it isn't", Viro replied to laughter as the session concluded.

Comments (45 posted)

Finishing the conversion to the "new" mount API

By Jake Edge
June 26, 2024

LSFMM+BPF

Eric Sandeen led a filesystem-track session at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit on completing the conversion of the existing kernel filesystems to use the mount API that was added for the 5.2 kernel in 2019. That API is invariably called the "new" API, which it is when compared to the venerable mount() system call, but it has been available for five years or so at this point without really pushing its predecessor aside. Sandeen wanted to discuss the status of the conversion process and some other questions surrounding the new API.

He began by saying the session is "not really a rocket-science talk", instead it was more of a "let's get that thing that we said we were going to do, done" talk. The original idea was to finish the conversion to the new API, then deprecate and remove the internal API that is used by the old mount API. But, after an initial push, there were few conversions until the pace picked up somewhat during the last two releases.

Of the 56 or so kernel filesystems, around 30 still remain to be converted, Sandeen said, so he has been joking that the completion of the effort will be in 2026. A bunch of filesystem support had just been merged during the 6.10 merge window—which happened during the conference. The two most prominent filesystems that still need to be converted are fat, which has patches floating around the list, and bcachefs, which he looked at briefly but did not tackle.

He encouraged the maintainers of any of the filesystems that still need conversion to "go for it"; the maintainers should have a better idea what mount support and options are needed for users. But, he noted, some of the kernel filesystems are abandoned. There may not be user-space tools or even a filesystem image to work with, he said, so whoever takes on the task of converting those is just going to have to do their best.

Logging

Another part of the API that he wanted to talk about was the message logging that filesystems can use to communicate warnings and errors during the mount process to user space. There are three functions (infof(), warnf(), and errorf()) that allow returning text strings to the callers of the API. When he started looking at converting filesystems, he first thought that printk() calls should be changed to use those logging functions, but has changed his mind because there are "different audiences" for those messages.

He asked David Howells, who developed the new mount API, to describe the original intent for the logging functionality. Howells said that there were two main purposes; first it provides a channel to report what went wrong during a mount operation, which is especially useful when the user-space process cannot access dmesg. It also provides a way for filesystems to ask questions, such as for passwords.

Amir Goldstein said that there is more to it than just access to the dmesg log, because there is no way to know if the user-space tools will actually print any messages logged using the new API. Christian Brauner said that the util-linux tools have added support for these messages, but Goldstein pointed out that the kernel still has no way to know that the user will see them. Brauner agreed that they should still be sent to dmesg, but he also noted that information sent to the kernel logs can become part of the user-space API of the kernel, so there is a need for caution.

Ted Ts'o said that part of the problem is that random user-space programs are scraping the dmesg information via the log files or perhaps the serial console. If the information that gets sent to dmesg changes, "there may be some random system-administration script that gets cranky". Those tools are arguably wrong to do so, he said, but users will complain to the filesystem developers if it happens.

The mount API puts log messages into a struct fc_log context, Brauner said. User space can then read the data that gets logged by using the file descriptor returned from an fsopen() system call. Currently, just the errors (or maybe errors and warnings, no one seemed to be sure) from that stream are written to dmesg.

Sandeen said that whatever is going to dmesg today should continue to go there, in order to avoid complaints. Another reason not to change existing behavior, Brauner said, is that systemd uses a known-bad mount option to probe the kernel to see how it can find out about illegal options; to a monitoring tool, those could look like a continuing stream of errors to be reported. Similarly, Goldstein said that overlayfs has a lot of different fallbacks that it tries that could be misinterpreted if they are handled differently.

Ts'o suggested that the conversation get more concrete. He wanted to try to define the log level that would be used for invalid mount options in dmesg, rather than try to guess whether programs will complain or get confused. "When we say 'we can't log to dmesg', that may not be true" because it depends on the log level of the message. Existing practice will have to be accommodated, but filesystem developers need to define the goal for these messages.

Dave Chinner said that existing filesystems already have their own mechanisms for reporting things like invalid mount options, which cannot really change, so he suggested not changing what is sent to dmesg at all. User-space programs already need to access that information and can continue to do so. Goldstein said that overlayfs had no mechanism of its own, so it uses dmesg to report problems; now there is a better way to do that reporting, however there is no way to know if user space will actually print those messages so that users can find them. Brauner said that the lack of documentation has hampered the adoption of the mount logging that came with the new API; since developers do not know about it and cannot find out much about it, there is no real discussion or agreement on how to use it.

Steve French noted that he had experienced similar problems in the early 2000s when he was working on SMB; "I needed to be able to tell user space things and all I had were like ten return codes that were valid". There were thousands of different things that might have gone wrong, he said. Jeff Layton agreed with that, "an integer return is not expressive enough", which is the rationale for the mount-logging feature.

Another problem that occurs, Layton said, is that on a busy system there may be lots of mounting going on, which makes it difficult to determine the correspondence between messages and mount operations. Brauner said that the mount logging API has a prefix that can be associated with each message. While it is true that existing reporting mechanisms need to be maintained, they are not consistent between filesystems. "How beautiful would it be", he asked, if all of the error messages from VFS had a "vfs:" prefix—likewise for "bcachefs:" and the other filesystems.

Kent Overstreet said that bcachefs has a system of error codes that allows developers to track down exactly where in the code a problem occurred, which has been extremely useful. Chinner said that XFS only reports certain kinds of information to user space, while reporting the details and exact location in the code when filesystem corruption is found, for example, to dmesg. The user-space report just tells the user to unmount and run fsck; the details should not be sent to user space, and the two types of information should be kept separate, he said.

Unknown options

Sandeen said that when he went to add support for the mount API to tracefs and debugfs, he encountered a comment about ignoring unknown mount options. But in the conversion process, most filesystems now reject unknown mount options regardless of their earlier behavior. One exception is NFS, which has a sloppy mount option that means unknown options should not cause an error. He wondered if there was a need to maintain the previous behavior when converting.

The sloppy option is important for network filesystems, French said, because new options may get added but not be available on every server. Sandeen agreed, but noted that sloppy has gotten even more complicated because it is positional; it needs to be specified before any potentially unknown option. Layton said that it is also needed for automount maps; in the past, there were sites that were administering both Solaris and Linux systems using the same maps, but the mount options did not line up between the two.

Brauner said that switching to using the new mount API is a conscious decision, so changes to the behavior for things like unknown options would be acceptable. But, eventually, the plan is for the existing mount command to use the new API, which will need to preserve the existing behavior. So there will need to be a way to do that.

"Remount is even worse", Brauner said. Most filesystems will accept any options for remount then silently ignore those that they do not care about, Sandeen said. An attendee asked if the kernel should care; "should we tell user space 'you can't remount that thing'?" While he could not think of an example, Brauner said that he could imagine a security-sensitive mount option being passed to remount; in that case, user space would want the operation to fail if the option could not be handled.

The (re)mount-option handling has been inconsistent and broken forever, Brauner said, but there is a need to be able to express the intent of a given mount operation with respect to unknown-option handling. That will be needed for users of the new mount API and, eventually, for mount itself. There are other cases that are messy as well. The new API allows options to be specified one at a time, but if a new option conflicts with an earlier one, there is no way to say which of the earlier ones caused the conflict.

XFS has a table that tracks all of the possible options and which cannot be used together so that they can be tied together in error messages, Chinner said. Ts'o said that ext4 has something similar. Chinner did not think it was important to handle that problem in a generic way since it is a corner case that does not arise frequently. While Howells said that handling conflicting options was part of his initial proposal for the API, Al Viro had him remove it. As time ran out on the session, Lennart Poettering agreed with Chinner that user-space tools likely did not really want to handle that level of detail.

Comments (1 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: RIP Larry Finger and Daniel Bristot de Oliveira; sched_ext; OpenSUSE Leap Micro 6.0; Darktable 4.8.0; Emacs 29.4; Rust types; Tor Browser 13.5; Quotes; ...
Announcements: Newsletters, conferences, security updates, patches, and more.

Next page: Brief items>>