LWN.net Weekly Edition for May 9, 2024

Welcome to the LWN.net Weekly Edition for May 9, 2024

This edition contains the following feature content:

Securing Git repositories with gittuf: a tool to implement and enforce policies on Git repositories.
Systemd heads for a big round-number release: version 256 brings a lot of new features.
Modernizing accessibility for desktop Linux: a conference session on work to make our systems more widely usable.
Inheritable credentials for directory file descriptors: a proposal for a new kernel-level sandboxing feature.
The file_operations structure gets smaller: two of the oldest callbacks in the kernel may be about to go away.
A proposal to switch Fedora Workstation's desktop: giving the KDE Plasma desktop more prominence within the Fedora community.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Securing Git repositories with gittuf

By Joe Brockmeier
May 8, 2024

OSSNA

The so-called software supply chain starts with source code. But most security measures and tooling don't kick in until source is turned into an artifact—a source tarball, binary build, container image, or other method of delivering a release to users. The gittuf project is an attempt to provide a security layer for Git that can handle key management, enforce security policies for repositories, and guard against attacks at the version-control layer. At Open Source Summit North America (OSSNA), Aditya Sirish A Yelgundhalli and Billy Lynch presented an introduction to gittuf with an overview of its goals and status.

Lynch began the talk with an overview of the state of the world for security tools. He observed that there are many tools used to generate provenance and attestation documents for artifacts, as well as tools to address threats against software at run time. "But we don't often talk about the very first part of this. How do we start securing our source code and source repositories [...] where everything originates?"

More and more frequently, he said, organizations are pulling source directly into continuous-integration / continuous-delivery (CI/CD) pipelines without the kind of protections we have for, say, container images or software packages. But a compromise in the repository itself could snowball into a compromise of everything built from the source code. "So it's equally important to protect and [use] all the same protection mechanisms" for repositories.

Different projects will need different policies, so there needs to be a way to define policies for common scenarios. For example, Lynch said that projects may want to enforce a policy that requires code to be reviewed and pass CI before being merged into a main branch. Applying a tag to the repository might require more stringent policies, since they often correspond with releases and things people are most likely to use. For organizations that have a monorepo, it might be necessary to have a folder-level policy to ensure that certain teams or people review and approve changes to specific parts of the repository.

Current state

Git, today, does not provide these features. Lynch pointed out that Git has "some amount of integrity checks" as well as commit and tag signing, but it is "fairly simple" overall. There are many other operations on Git repositories that need validation beyond what Git provides on its own.

The various forges may layer additional features on top of Git, including security features such as protected branches. But these, he said, "are forge-specific features and not actually part of the repository". Even if a project or organization uses one of the forges exclusively, the question becomes "is that enough?"

Especially when it comes to security metadata you know it's very nice to be able to say "hey, if we have a commit we can associate that with a pull request but how do we actually know that that pull request has gone through all the checks?" And, more importantly, how do we know six months down the line, a year down the line, how can we look back and verify and ensure that all of those checks have happened in the past? Sometimes that can be very difficult.

Even when a forge like GitHub provides more information about how teams have interacted with a repository, that information may not be visible to people outside the organization. Ideally, he said, "we want to get to a state where anyone can verify this metadata". Perhaps just as importantly, how do users verify that policies were enforced for a repository without taking a forge's word for it? This led to thinking about "what sort of security properties do we care about, or might people care about, when consuming Git repositories?"

Goals

Next Lynch moved into a discussion of security goals that have guided the project. First, verification that policies have been followed should be possible by any party, and not just members of an organization or users of specific Git forges. It should also be possible to verify the state of the repository at any point in its development. "It shouldn't just be what is latest, we should be able to go back even to the first commit, ideally."

Signing-key distribution is another hard problem that the project hopes to solve. How do users make sure they're getting the right keys, or which keys they should be checking against? On top of that, "how do we rotate and revoke keys?"

Of course, a security tool needs to be flexible and has to prevent the possibility of internal threats. Lynch said that feature branches might have more lax policies, while main and release branches have stricter policies that require multi-user signoffs to account for possible insider attacks. "We want to make sure that you know whenever a security policy changes, multiple people have to sign off on it. Even if one account is compromised, it's harder to compromise the entire repository."

Finally, any tool that works on top of Git is going to need to provide backward compatibility. "If somebody is using more security tools on top of an existing repo, that shouldn't break the workflow of everyone else using that repo." It has to be possible to adopt stricter security policies "in an incremental way, without having to completely annihilate the history of the repo and all the metadata" prior to using the new tool. All of those goals, Lynch said, brought them to the idea behind gittuf. Here, he handed the talk over to Yelgundhalli, to talk about the project and the implementation so far.

Gittuf

Yelgundhalli said that gittuf takes concepts from The Update Framework (TUF), a Cloud Native Computing Foundation (CNCF) project that provides a specification for securing software-update systems. "It gets a lot of things right in [the] context of handling key distribution, rotation, and revocation" as well as providing a model for delegating trust from one user to another.

Another important concept for gittuf is the reference state log (RSL). This idea comes from a 2016 paper on preventing Git metadata tampering. The RSL is similar to the Git reflog, but "actually embedded in the repository" and "authenticated using signatures on each individual entry". These are implemented in gittuf ~~using a custom git namespace~~ in a separate Git refs under refs/gittuf. The RSL is stored under refs/gittuf/reference-state-log and the policy metadata is stored under refs/gittuf/policy. Every time a reference state changes in Git, new entries are added in these logs. This means that the RSL records not only "the main branch went from commit A to commit B" but also when owners of a repository update policies.

Git servers need not be aware of gittuf, since its metadata is stored under the custom namespace. If the server is gittuf-enabled, then it can perform verification on a change when is pushed to the server. "This is great because now you have the ability to reject changes from making their way to other clients" even if those clients aren't using gittuf. But if they are, "they get the changes on the branch as well as the signed statement" of changes that they can verify.

Gittuf is using the in-toto project to provide attestations, which allows a project to make claims about the software. For example, he said that Git only allows a commit to be "meaningfully" signed by a single person, but using in-toto it is possible to record multiple signatures. It would also be useful for answering other questions like "did this test run, and pass" or other policy questions that are not covered by Git itself or even Git forges. The project is also considering how it can authenticate users who are not using gittuf, so that they are still able record evidence that they have authenticated.

Ultimately, Yelgundhalli said that step one in a project's or organization's security policy could be to require source code to have an attestation from gittuf to verify that it has followed policies and is suitable to be "plugged into other parts of our supply chain". As a CI pipeline receives changes to source code, it could inspect each change to ensure that it meets policy.

With the basics of gittuf covered, Yelgundhalli moved on to a short demo of the tool. This included using gittuf to list policy rules for the main branch of a repository that requires two authorized signatures to push to the branch, and demonstrating what happens when a commit meets or violates that policy.

As expected, if a commit has the two required signatures, gittuf reports success. If not, the gittuf command-line tool will emit an error. The demo also illustrated that using gittuf adds several steps to the process to record and commit the reference log. For those who would like to follow along at home, the project provides a demo repository to test out gittuf.

What's next

After the demo, Lynch took the lead again to talk about the status of gittuf and its roadmap. Currently it is considered alpha status. The project has "only just joined" the OpenSSF as a sandbox project in the Supply Chain Integrity Working Group. So far, the project has been focused on functionality but making it easier to use is on the agenda as well:

There's all these multiple refs that we have to worry about, ideally that should just be one command, and ideally that should have command compatibility with existing Git commands so you don't really have to think about it.

He said that the project is looking at "things like repository hooks" to help automate gittuf operations "so you can just use your normal Git workflow". Another feature that Lynch mentioned for the future is signed pushes, but he said that "is not the top priority" because it is not yet supported by any of the Git forges. The project also has a roadmap that mentions a number of interesting features planned for gittuf. This includes adding support for roles and teams, so that policies can require things like "a change must be signed off by two members of a development team and one member of a security team". And, of course, the project should "dogfood" itself by using gittuf to protect the project's source code thus demonstrating its viability for use with other projects.

During the Q&A an attendee asked about Kubernetes CI. The audience member said that CI for the project cost "between $100,000 and $200,000 a month"; they wondered whether gittuf could allow developers to run CI tests locally and just submit proof that they had been run to save on costs. Yelgundhalli said that it is "something we've been talking about but it's early days". Actually collecting proof that CI jobs had run would involve a lot of moving parts, but it would be possible to allow developers to attest that they had run the jobs locally if a project is willing to extend trust that far.

Even though gittuf is not yet ready for prime time, the tool and thinking behind it show a lot of promise. If the project can build out the planned functionality and improve usability, it may well find its way into securing source code for many projects and organizations. (The video of the talk is available on the Linux Foundatoin's YouTube channel.)

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting our travel to this event.]

Comments (14 posted)

Systemd heads for a big round-number release

By Daroc Alden
May 7, 2024

The systemd project is preparing for a new release. Version 256-rc1 was released on April 25 with a large number of changes and new features. Most of the changes relate to security, easier configuration, unprivileged access to system resources, or all three of these. Users of systemd will find setting up containers — even without root access — much simpler and more secure.

Lennart Poettering chose to experiment with a new format for announcing features this year: posting a series of Mastodon threads that cover features that he's excited about in more detail. Poettering said that he found it easier to get ideas out on Mastodon than in a more official venue, and invited anyone who wished to consolidate his thoughts as a long-form article to do so. One thread — on systemd's new run0 tool — has already generated substantial commentary.

The first thread describes the new way that systemd finds configuration files. Currently, many tools, systemd included, support reading multiple configuration files from a directory (whose name typically ends in .d) and combining them to produce the final configuration. As Poettering points out in his thread, this approach is useful for package managers, because it lets individual packages add to the configuration while keeping those contributions separate.

There are some situations where it's less important to have files from many packages than from different versions of the same package: for example, a container runtime needing to deal with versioned images. Ideally, existing containers could continue using an older version, while new containers would seamlessly use the newest version. Systemd now supports this use case by reading files from a directory whose name ends in ".v". When a systemd tool goes looking for a particular file — example.ext, for example — it will now accept a directory called example.ext.v/ with files example_[version].ext inside. Of the available files, the tool will pick the one with the highest semantic version number.

The rest of the changes Poettering has chosen to highlight are a bit larger. Systemd has had support for encrypted credentials for some time. In systemd terms, a credential is a named blob that an application may interpret however it likes. Credentials are locked to a computer's trusted platform module (TPM), or stored on an encrypted disk if no TPM is available. These credentials have only been usable by system services, however, not by per-user services. Poettering shared that systemd version 256 would support making credentials available to user services. This is useful in its own right, but other improvements make this feature more useful than it might initially appear.

The release also includes support for working with discoverable disk images (DDIs) in an unprivileged context. DDIs are disk images with embedded metadata that systemd uses for various purposes. DDIs are often used as filesystem images for systemd-nspawn containers. Letting unprivileged users work with DDIs was the last step required to permit unprivileged systemd-nspawn containers.

Finally, systemd also supports configuring some settings by adding encrypted credentials — even if these thing are not traditional "credentials", but rather just a useful way to pass configuration parameters into a service using an interface that already existed. For example, systemd-firstboot looks for a credential called firstboot.locale and uses its value as the system's locale. On a physical computer or a virtual machine, those credentials can be passed in via the BIOS or UEFI ESP. In a container, they can be passed in via a mount under /run/host. The number of settings that can be configured this way has been greatly expanded in the new release:

Thus, a regular systemd system will now allow you to configure via credentials: keymap, locale, timezone, issue file, motd file, hosts file, .link files, .network files, .netdev files, DNS servers, DNS search domains, root passwords, root shell, SSH key of root, additional SSH address/port to listen on, sysuser.d/ additions, tmpfiles.d/ additions, sysctl.d/ additions, fstab additions, console font, additional TTYs to spawn gettys on, socket to forward journal data to, socket for sd_notify() messages from the system, machine ID, hostname, systemd-homed users to create, cryptsetup passwords and pins, additional unit files and drop-ins for unit files, udev rules, and more.

The combination of these features means that it is now possible for an unprivileged user to configure their own systemd-nspawn containers — or even entire hierarchies of such containers — using encrypted credentials that are protected from other users on the host system.

That isn't the only feature designed to make interacting with containers or virtual machines more pleasant, however. Many readers may be aware of the sd_notify() protocol that systemd uses to get information from system services about their status. Less well-publicized is the fact that systemd actually sends sd_notify() messages to whatever started it. This is useful for running systemd under another init system, but it also means that systemd can signal the host of a container this way. Since version 253, systemd has also supported the AF_VSOCK option for sending sd_notify() messages, letting it send messages to the virtual machine manager responsible for more traditional virtual machines.

Version 256 adds a new message that systemd will send when a given target is fully activated: X_SYSTEMD_UNIT_ACTIVE=[unit name]. Poettering calls this "both a progress notification and a feature notification". One example use is letting the host system of a virtual machine know when the SSH socket (which systemd sets up before starting SSH, and then hands over when the service is up in socket-activated configurations) is bound, and therefore it can connect without errors or retries. Other uses include discovering what services are running on a virtual machine, or providing a more granular view of how far into starting up the machine is.

Another feature that existed previously in a smaller form, but which is now available to the whole system, is a configuration option called ProtectSystem. Services with this option run in a separate mount namespace where important system directories — particularly /usr — are mounted read-only. Since few programs need to write to /usr, this is a fairly seamless way to make the system more secure.

With version 256, this option can now be applied to the entire system instead of on a service-by-service basis. While this is not practical for most systems, since tools like package managers do still need to write to /usr on occasion, there is one place where enabling the option by default makes sense: the system's initial ramdisk.

When a Linux system starts up, it begins by creating a temporary, in-memory filesystem and unpacking the initial ramdisk into it. Then it starts the init process from the disk, and leaves the task of actually setting up all the expected filesystem mounts and so on to user space. Often, this setup involves talking to the network, receiving encryption secrets to unlock the hard disk, or both. Exposing trusted code to the network is always risky, but the code to handle both of those things can also write to the temporary filesystem, opening an even larger attack surface. With the new version, however, ProtectSystem becomes the default for systemd on a ramdisk, causing it to remount the temporary filesystem as read-only before proceeding with the rest of the boot. Early tests revealed few problems with this change, Poettering said. The only distribution to have a serious problem with it was Fedora; dracut (the tool Fedora uses to create an initial ramdisk) had problems writing hook files with the new protection in place, but has since been fixed.

The final feature that Poettering has discussed at the time of writing (although more threads seem sure to follow) is a quality-of-life improvement for users of systemd-homed — a service that encrypts users home directories until they log in. Unfortunately, encrypted home directories don't work with SSH because it doesn't include a mechanism to ask for encryption secrets before trying to start a shell (systemd-homed loads SSH authorized keys from outside the home directory, so that is not a barrier to SSH logins). Currently, users must log in locally at least once (in order to be prompted to unlock their home directory) in order for SSH logins to work correctly. With the new update, systemd has added a shim that will intercept SSH logins for a user with an encrypted home directory and prompt them to enter encryption credentials over the network.

New systemd versions don't just bring new features, however. They also bring the deprecation of old features. In this case, the most noticeable deprecation is that systemd is finally dropping support for version 1 control groups (cgroups) in favor of the newer version 2 cgroups. A system that boots with version 1 control groups will cause systemd to fail loudly with an error, although version 1 cgroups can still be turned on with an option on the kernel's command line, for now.

There are other, less notable additions and deprecations with the release as well, including changes to nscd caching, configuration file locations, and many others. Interested readers can find the full list in the project's NEWS file. Systemd releases usually have three or four release candidates approximately a week apart, so it is reasonable to expect that systemd version 256 will be fully released in approximately a month, and make its way into distributions from there.

Comments (32 posted)

Modernizing accessibility for desktop Linux

By Joe Brockmeier
May 6, 2024

OSSNA

In some aspects, such as in gaming, the Linux desktop has made enormous strides in the past few years. In others, such as accessibility, things have stagnated. At Open Source Summit North America (OSSNA), Matt Campbell spoke about the need for, and an approach to, modernizing accessibility for desktop Linux. This included a discussion of Newton, a fledgling project that may greatly improve accessibility on the Linux desktop.

Campbell has a long history with accessibility. As he wrote in a post on the GNOME Accessibility blog, he has been working on accessibility tools for more than 20 years and is visually impaired himself. He is the lead developer of AccessKit, a project written in Rust that's designed to allow developers to implement accessibility features once in their application and have them work cross-platform with Windows, macOS, and Linux.

Overview

He began the talk with a quick overview of accessibility and assistive technologies (ATs) for the audience. When talking about accessibility, Campbell said, "we're talking about making applications accessible to disabled people who depend on assistive technologies" such as screen readers and alternative input methods like speech recognition.

Assistive technologies on Linux desktops, like GNOME's Orca screen reader, communicate with applications through interprocess communications (IPC). Usually this works via an accessibility API such as the Assistive Technology Service Provider Interface (AT-SPI) for open-source systems. AT-SPI was first implemented by Sun for GNOME on top of its CORBA object-request broker, and eventually ported to D-Bus in 2008.

The key concept of AT-SPI, he said, is the accessibility tree—similar to the document object model (DOM) for HTML. The tree contains a hierarchy of nodes, beginning with the application window at the root, then layout containers such as GTK's HBox, then text labels and controls as leaf nodes in the structure. When something happens, such as a change in keyboard focus or text selection, the user-interface toolkit will emit events to AT-SPI that will then be passed on to the assistive technology in use.

There are a number of problems today with AT-SPI according to Campbell, such as "the rise of Wayland and security sandboxing such as Flatpak". These have severed the direct connection between the accessibility tree and the windowing system, so applications like Orca can't verify that an event is actually coming from the application that has focus.

A deeper problem, Campbell said, is "chatty IPC". As implemented today, a screen reader "doesn't immediately have all the information that it needs locally, so it has to keep going back and forth doing multiple IPC round trips". This may lead to latency and can cause a screen reader to be unresponsive in situations where the application would be fine for sighted users.

This approach to IPC in accessibility protocols leads to one cause of unequal access for blind users compared to sighted users [...] it's as if whenever the application was busy the screen went blank before the application started responding to events on that main thread again. But, no, that's not what happens—a sighted person can continue to look at whatever was last drawn on the screen and make their decision about what they're going to do next when the application is ready for them again.

Introducing Newton

Having to work around the IPC limits what features are available for assistive technologies. That brought him to introducing Newton, a Wayland-native accessibility architecture that he is working on as a contractor for the GNOME Foundation. (A high-level overview is available on GNOME's GitLab instance, though it does not mention the name Newton.) The moniker comes from Wayland's convention of choosing names from locations in New England. But Newton is not just a random choice from cities in Massachusetts, it is the home of the Carroll Center for the Blind.

He outlined three high-level design goals for Newton. First, it should have no resource impact when assistive technologies are inactive. Second, it should make the compositor "the final source of truth" rather than trusting applications to provide updates. Finally, it should shift complexity from applications or user-interface toolkits to assistive technologies and Newton's client libraries.

It's the accessibility developers [who are] most invested in getting it right, so we want to keep things as simple as we can for the developers of the applications and the toolkits and the compositors, and make it easy for them to give us accessibility developers what we need to do our job.

Campbell said that Newton builds on top of AccessKit. "It's as if I designed AccessKit from the start with something like Newton in mind, and that's because I did."

The project is being designed from the beginning to work with sandboxed applications, and to use a push architecture rather than AT-SPI's pull model. An application will push updates to the compositor and then to the assistive technology. "When a user issues a command to the screen reader or other assistive technology, then all the information the AT needs to respond to that command by looking at what is on the screen is already there." This, Campbell said, will be more resilient to hung or busy applications. It is also a proven approach. He pointed out that the Firefox and Chromium browsers have already implemented this model for their internal accessibility architecture. The accessibility APIs are implemented in the main browser process rather than the sandboxed processes that render each page. The renderer processes push an accessibility tree to the main process, which then caches the tree and responds to assistive technology clients.

Current status

Right now, the protocols for Newton are not yet finalized but the project is far enough along that Campbell has prototype implementations for AccessKit, Orca, and GNOME's Mutter display server. He said that he is currently working on integrating AccessKit into GTK, which would make GTK applications accessible on Windows and macOS for the first time.

The next step after completing GTK integration is to start testing with real-world applications and stress-testing them with the kinds of use cases that give existing technologies trouble. That will help identify where optimizations are needed. "We're not going to do premature optimizations, we're going to figure out where we actually need to optimize."

He said that the architecture also needs to be implemented within the GNOME shell itself, and that would involve "a bunch of review" with all of the involved stakeholders. Though his work is focused on GNOME, "I should note that I personally want to make sure that developers from other desktop environments are included". Once everything is finalized it would be time to make sure it is documented well enough that it could be implemented in other environments and "so this project doesn't have a bus factor of one".

"At this point, some of you might be thinking 'show me the code'", he said. The audience murmured its agreement. Rather than linking to all of the repositories, he provided links to the prototypes for Orca and GTK AccessKit integration. Campbell said these would be the best way to start exploring the stack.

If all goes well, Newton would not merely provide a better version of existing functionality, it would open up new possibilities. Campbell was running out of time, but he quickly described scenarios of allowing accessible remote-desktop sessions even when the remote machine had no assistive technologies running. He also said it might be possible to provide accessible screenshots and screencasts using Newton, because the accessibility trees could just be bundled with the image or pushed along with the screencast.

The conclusion, he said, was that the project could provide "the overhaul that I think that accessibility in free desktop environments has needed for a little while now". Even more, "we can advance the state-of-the-art not just compared to what we already have in free desktops like GNOME", but even compared to proprietary platforms.

He gave thanks to the Sovereign Tech Fund for funding his work through GNOME, and to the GNOME Foundation for coordinating the work.

There was not much time for questions, but I managed to sneak one in to ask about the timeline for this work to be available to users. Campbell said that he was unsure, but it was unlikely it would be ready in time for GNOME 47 later this year. It might be ready in time for GNOME 48, but "I can't make any promises". He pointed out that his current contract ends in June, and plans to make as much progress as possible before it ends. Beyond that, "we'll see what happens".

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting our travel to this event.]

Comments (6 posted)

Inheritable credentials for directory file descriptors

By Jonathan Corbet
May 2, 2024

In Unix-like systems, an open file descriptor carries the right to access the opened object in specific ways. As a general rule, that file descriptor does not enable access to any other objects. The recently merged BPF token feature runs counter to this practice by creating file descriptors that carry specific BPF-related access rights. A similar but different approach to capability-carrying file descriptors, in the form of directory file descriptors that include their own credentials, is currently under consideration in the kernel community.

Linux systems allow a process to open a directory with any of the variants of the open() system call. The resulting "directory file descriptor" can be used to read the contents of the directory; it is also useful, when passed to system calls like openat(), to specify the starting directory for the pathname lookup of the file to be opened. A privileged process can open a directory and give the file descriptor to a less-privileged process (or simply drop its own privileges), and that descriptor will continue to be usable to access the directory, even if the owning process would otherwise be unable to do so.

That access does not, however, extend to any files contained within that directory.

Stas Sergeev recently proposed a change to that situation in the form of a new flag (OA2_INHERIT_CRED) for the openat2() system call. If a process uses that flag while opening a file, and that process provides a directory file descriptor, the file will be opened using the credentials that were in effect when the directory was opened. So, if a privileged process created the directory file descriptor, any other process owning that descriptor could open files in the reference directory using the privileged process's user and group IDs.

In other words, when this flag is used, a directory file descriptor grants more than just access to the directory itself; it also provides credentials to access files within the directory. This feature can be used, according to Sergeev, to implement a sort of lightweight sandboxing mechanism to restrict a process (or a container) to a specific directory tree. Such restrictions can be implemented now, but is rather more cumbersome to set up.

Andy Lutomirski said that he liked the idea; "it's a sort of move toward a capability system". He added, though, that turning a directory file descriptor into this sort of capability should require an explicit act — it should not just happen by default. Not every process providing a directory file descriptor to another will want to hand over its rights to access objects in the directory as well. He also worried about potential mischief resulting from directory file descriptors opened in special filesystems like /proc.

As a result of these comments, a number of changes had been made by the time that the patch series got to version 6. To be usable with the (renamed) OA2_CRED_INHERIT flag, a directory file descriptor must have been opened with the new O_CRED_ALLOW flag. An attempt to use the OA2_CRED_INHERIT flag on a directory file descriptor created without O_CRED_ALLOW will just result in an EPERM error. The kernel will also reject OA2_CRED_INHERIT opens that involve /proc or symbolic links that lead out of the directory. Any file descriptors opened using OA2_CRED_INHERIT will be automatically closed in an execve() call.

Meanwhile, O_CRED_ALLOW directory file descriptors cannot be passed to any other process over a Unix-domain socket. This would appear to be the only case where the SCM_RIGHTS mechanism restricts the type of file descriptor that can be passed in this way. This restriction prevents a container from giving its special permissions to a process outside of the container, but it will also block attempts to pass an O_CRED_ALLOW file descriptor into a container. For the intended use case (where a privileged process sets up the file descriptor before dropping privileges) this restriction will not be a problem, but it could possibly impede other use cases.

Sergeev notes in the series that, if this idea is accepted, there are more patches to come:

This patch is just a first step to such sandboxing. If things go well, in the future the same extension can be added to more syscalls. These should include at least unlinkat(), renameat2() and the not-yet-upstreamed setxattrat().

Whether things will, in fact, go well is yet to be determined; this sort of security-related change to a core system call tends to need a high degree of review. And, of course, there will be people with other ideas of how this functionality could be provided. For example, Lutomirski proposed a somewhat more elaborate mechanism where credentials could be attached using open_tree() (which is part of the new(ish) mount API); a process could then mount the given subtree as a separate filesystem. This would allow him to "pick a host directory, pick a host *principal* (UID, GID, label, etc), and have the *entire container* access the directory as that principal".

Lutomirski was seeking comments on this approach and did not include an implementation of this idea. The comment he got came from filesystem-layer maintainer Christian Brauner, who pointed out that ID-mapped mounts can already provide most of the functionality that Lutomirski appeared to be looking for. Lutomirski has not yet responded to indicate whether he agrees.

It may take some time to see whether this work is accepted, and in which form. Adding new security features to an operating-system kernel needs to be done with care; there can often be surprising interactions with existing features, and they may be used in surprising ways. Serious vulnerabilities have resulted from file descriptors passed into containers in the recent past; developers would want to be sure that this feature would not lead to similar problems. But, regardless of how this specific patch set is ultimately received, it does demonstrate a direction — toward more capability-oriented systems — that many developers would like to pursue.

Comments (14 posted)

The file_operations structure gets smaller

By Jonathan Corbet
May 3, 2024

Kernel developers are encouraged to send their changes in small batches as a way of making life easier for reviewers. So when a longtime developer and maintainer hits the list with a 437-patch series touching 859 files, eyebrows are certain to head skyward. Specifically, this series from Jens Axboe is cleaning up one of the core abstractions that has been part of the Linux kernel almost since the beginning; authors of device drivers (among others) will have to take note.

The origin of `struct file_operations`

In the beginning, the Linux kernel lacked any sort of virtual filesystem layer. See, for example, the 0.01 implementation of read(), which contained explicit checks for each possible file-descriptor type. That approach worked to get an initial kernel to boot but, before long, Linus Torvalds realized that it would not scale well. As developers sought to add more device types, and to implement more than one filesystem type, the need for an abstraction layer became more urgent.

The Linux 0.95 release, which came out in March 1992, brought a number of changes, including a switch to the GPL license. It also added the first pieces of what was to become the kernel's virtual filesystem layer. A core piece of that layer was the first file_operations structure, defined, in its entirety, as:

    struct file_operations {
	int (*lseek) (struct inode *, struct file *, off_t, int);
	int (*read) (struct inode *, struct file *, char *, int);
	int (*write) (struct inode *, struct file *, char *, int);
    };

This structure contains the pointers to the functions needed to implement specific system calls on anything that can be represented by a file descriptor. Rather than use an extended if-then-else sequence to determine which type of file was being operated on, the kernel could just do an indirect call to the appropriate file_operations member. As might be expected, the most fundamental operations — reading, writing, and seeking — showed up here first. In early versions of the kernel, there wasn't much else that one could do with a file descriptor.

The file_operations structure grew from there. The 1.0 version of this structure included ten members, implementing system calls like readdir(), ioctl(), and mmap(). The 2.0 version of struct file_operations had 13 members, and 2.2 added two more. Through all of this history, the read() and write() members remained the way to read from and write to a file descriptor, though their prototypes changed somewhat.

The plot thickens

The 2.4 release, made at the beginning of 2001, included a version of struct file_operations with these new members:

    ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
    ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);

User-space developers often needed the ability to perform scatter/gather I/O — operations involving multiple segments of memory that needed to be transferred in a single operation. In response, the kernel gained support for readv() and writev() but, to properly support these system calls, the kernel needed to pass them down to the underlying implementations. The new members, which took an array of iovec structures containing an address (in user space) and size for each segment, were added for this purpose. For device drivers or filesystems that did not implement the new functions, the kernel would emulate them with a series of read() or write() calls instead.

Subsequent work added many more members to struct file_operations, including other variants of read() and write(). aio_read() and aio_write(), used to implement the kernel's somewhat unloved asynchronous I/O mechanism, went into the 2.5.33 development release. splice_read() and splice_write(), implementing the splice() system call, were added for 2.6.17. Removals of file_operations members, like the removal of kernel code in general, was rare, but readv() and writev() were removed in 2.6.19 after all users were switched to use aio_read() and aio_write() instead.

The 3.16 version of struct file_operations, had grown to 27 members, including these additions indicating a new approach to I/O within the kernel:

    ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
    ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);

Increasingly, I/O operations were being initiated from the kernel, not just from from user space; they often involved multiple segments and needed to be executed asynchronously. The data buffers involved could be referenced in a number of ways. The iov_iter structure used to describe these more complex I/O operations looked like this at the time:

    struct iov_iter {
	int type;
	size_t iov_offset;
	size_t count;
	union {
	    const struct iovec *iov;
	    const struct bio_vec *bvec;
	};
	unsigned long nr_segs;
    };

The key distinguishing feature of this structure is related to the type field. If it was ITER_IOVEC, then the iov union member contained an array of segments using user-space addresses. If it was, instead, ITER_KVEC, then the addresses were in kernel space. And if type was ITER_BVEC, then the bvec field pointed to an array of bio structures (used to describe block I/O requests). An I/O API defined in this way could be called from a number of contexts and would work regardless of whether the operation was initiated from user space or from within the kernel.

The kiocb structure is used by the kernel to coordinate asynchronous I/O operations. Drivers are not required to implement asynchronous I/O (though they may not perform as well if they don't), but if they do implement it, they need the information in this structure. The use of struct kiocb reflects the fact that, among other goals, the new methods were intended to replace aio_read() and aio_write(), which were duly removed for the 4.0 release.

`struct iov_iter` everywhere

Over time, struct iov_iter has evolved and become rather more complex; see the 6.8 version for the details. The kernel has also accumulated a set of helpers that free code from dealing with that complexity much of the time. Meanwhile, struct file_operations in 6.8 is up to 32 callable members. But, through all of this change, read() and write() have remained essentially unchanged, even though they only handle the simplest of I/O operations in what has become a complicated world.

Axboe has decided that, perhaps, those two members have reached the end of their useful life:

10 years ago we added ->read_iter() and ->write_iter() to struct file_operations. These are great, as they pass in an iov_iter rather than a user buffer + length, and they also take a struct kiocb rather than just a file. Since then we've had two paths for any read or write - one legacy one that can't do per-IO hints like "This read should be non-blocking", they strictly only work with O_NONBLOCK on the file, and a newer one that supports everything the old path does and a bunch more.

Since read_iter() and write_iter() can do everything that read() and write() can do, it makes sense to simply remove the older members. The only problem is, of course, there is a lot of code that only implements read() and write() in the kernel; much of it is in drivers that may not have seen significant development (or even use) in years. Some of them surely are being used, though, and breaking them would undoubtedly increase the (already high) level of grumpiness on the net.

Many modules that use the older interface can, with some effort, be converted to use read_iter() and write_iter() instead, perhaps gaining functionality in the process. But there are a lot of these modules, and trying to understand every one of them well enough to do such a conversion is a path to madness, with little benefit. So, instead, Axboe started by implementing a set of helpers that emulates the new functions with a series of calls to read() or write(); that minimizes the amount of change to any given module while maximizing the chances that the results will be correct. See this patch as an example of what the simplest conversions look like.

The final patch in the series removes read() and write() with a surprising lack of ceremony, given that they have been there for 32 years.

There have not been a lot of comments on the series; perhaps many developers are still waiting for the whole thing to download into their inboxes. Al Viro noted that some of the conversions might need to be done a bit more carefully. But nobody has objected to the overall concept, thus far.

For a series like this to be accepted, it will need to be split into more manageable chunks — which Axboe acknowledged at the outset. This set of changes does simplify the kernel, though, and it removes a fair amount of old code, so chances are that it will happen in some form, sooner or later. At that point, there will likely be a lot of out-of-tree modules that will need to be updated before they can be built on newer kernels. The good news is that developers can make those changes now and get ahead of the game.

Comments (10 posted)

A proposal to switch Fedora Workstation's desktop

By Jake Edge
May 7, 2024

A proposal to switch the default desktop for Fedora Workstation from GNOME to KDE Plasma largely went over like the proverbial lead balloon—unsurprisingly. But the conversation about the proposal did surface some areas where the distribution could perhaps be more inclusive with regard to the other desktop choices available. The project believes that it benefits from being opinionated and not requiring users to make multiple decisions before they can even install the distribution, but there is a balance to be found.

For Fedora 42

The change proposal was posted to the Fedora devel mailing list on behalf of the feature's owners (Joshua Strobl, Marc Deop i Argemí, Troy Dawson, Steve Cossette, Alessandro Astone) by Fedora operations architect Aoife Moloney on April 2. In short, it proposes to "switch the default desktop experience for Workstation to KDE Plasma" for Fedora 42, which will come in roughly a year. As one might expect, it reads like an advocacy piece about the Plasma desktop, extolling its virtues while not denigrating GNOME at all. The idea would be to swap the positions of Plasma and GNOME, keeping the GNOME edition as a separate version that would still be release-blocking; new installs would get Plasma by default, while upgrading existing systems would not switch the desktop.

The date of the post did not help with its initial reception. It was first posted on the Fedora wiki April 1, but was announced a day later on the list. That led Richard Hughes to wonder if it was an April Fools' Day joke; if so, "it's a weird one, and a day too late". Tomas Torcz thought the proposal made sense because Plasma seems "more technically advanced than GNOME", thus he did not think it is a joke. Feature owner Cossette agreed that it was not a joke; despite the timing, "the proposal is 1000% serious". He followed that up with some more information about the thinking of its proponents. For one thing, it was never meant to knock down GNOME; the real goal is rather different:

The overall spirit of the CP [change proposal] is that we think KDE, and to some extent the other spins too, need a bit more visibility on the website. At the very least, Gnome and KDE should be up front on the frontpage.
[...] We've been discussing it in Matrix, and we can't seem to reach a consensus as to what is the best way to initiate the discussion procedure. Figured a change proposal was probably a decent way to "kick the hornet's nest", so to speak.

But Kevin Fenzi objected that giving the two desktops equal billing would simply lead to confusion; what would be needed is a way to describe the differences to new users "in a quick enough way that they won't decide it's all confusing and go do something else". Kevin Kofler thought there was a fairly straightforward way to raise the visibility of Plasma and other spins without confusing users. He suggested that the first "option" be a big button that users who hate options can click (hyperbolically: "I HATE OPTIONS, JUST GIVE ME SOMETHING WITH NO OPTIONS!"); it would download the GNOME workstation edition for x86_64. Below that would be alternative desktops for the various architectures, then specialty choices, such as mobile versions and Fedora Labs, and so on. The advantage is that the big button at the top will cater to the users Fenzi is concerned about and "will give them a desktop environment designed exactly for them".

Fenzi said that could simply be turned on its head, so that the "Download Workstation" button was at the top, followed by other options—which is more or less what is there now. The current Fedora home page (from the Wayback Machine, since it may be changing) shows the five editions, Workstation, Server, IoT, Cloud, and CoreOS, toward the top, each with its own logo, short description, download link, and "Learn More" button. After that come the other options, Atomic desktops, Spins, Labs, and Alternative (ALT) downloads, each with a description and "Learn more" link. Kofler said that the arrangement places Plasma (and other desktops) behind editions, such as IoT, Cloud, or Server, that may well be irrelevant to the users Fenzi mentioned.

Cossette acknowledged Fenzi's point about confusing users, but suggested that choosing between two desktops was not such a huge barrier, especially in comparison to the decision on which of a huge number of Linux distributions to try. Adam Williamson pointed out, however, that the outcome of the Fedora.next initiative back in 2014 had specifically overhauled the distribution to make it "much more focused and less of a choose-your-own-adventure, specifically including making the download page much more opinionated". Michael Catanzaro said that while the changes made have been "key to the success of Fedora over the past 10 years", there may still be room to raise the profile of Plasma on Fedora:

But there is a continuum of strategies we can use to promote our default desktop over other options, and I wonder if we've erred too far in favor of Fedora Workstation and against Fedora KDE Plasma Desktop here. The Plasma spin is much "bigger" than the other spins, it's of comparable quality to Fedora Workstation, and it is release blocking. It just seems strange to relegate it to a secondary downloads page regardless of how popular it is, while the non-desktop editions (some of which are frankly relatively niche) get featured very prominently.

Edition?

He suggested that since the Fedora KDE Plasma Desktop spin occupied a singular position among the spins (and various other kinds of Fedora releases), it could perhaps become an edition of its own. The "Workstation" name and branding should not be used for it, and the distribution would "continue to steer undecided users towards Fedora Workstation", but it would make Plasma easier to find and present it "more prominently than it is today". Beyond that, the Fedora Spins could be positioned higher on the home page—since those options are not mutually exclusive, both could be done.

Neal Gompa, who is a member of the Fedora Engineering Steering Committee (FESCo) and the KDE SIG, wondered why the options could not be "Fedora GNOME Workstation" and, reusing the current name, "Fedora KDE Plasma Desktop". But Andreas Tunek pointed out that using GNOME in the name of the Workstation edition is concerning because it may imply the existence of other Workstation editions to some, which is not the case at all. Kofler said that he was not sure that he bought that argument, however.

FESCo member Zbigniew Jędrzejewski-Szmek agreed with Catanzaro's idea that the Plasma spin become an edition. He did not see that having a second choice would be disruptive to the Fedora Workstation edition. Like others, he thought that the web site needed some reorganization. Gompa seemed concerned that the change would simply move the KDE version in with the editions, but Williamson noted that is actually a big change:

Being an Edition is a very significant thing, though, as we conceive of Fedora more widely than just the download page. We put a bunch of hoops in the way of IoT and CoreOS becoming editions, and there are hoops in the way of Silverblue becoming one (or, you know, wherever we go with that path in the end).

Jędrzejewski-Szmek said that his assumption was that the proposal would be changed to create a KDE Plasma edition, following the Edition Promotion Policy. Overall, it seems that KDE Plasma would qualify, with one possible exception:

The only sticky point is whether KDE desktop serves a different purpose than Workstation with GNOME. I'd say it does: desktop preferences are like religion, and people don't just switch (except when they do).

More discussion

A parallel discussion of the change proposal took place on the Fedora discussion forum after one of the owners, Joshua Strobl, posted it there. That discussion progressed on similar lines, with some highly in favor of a switch, while others were strongly opposed—still others wondered whether it was an April Fools' joke. Cossette clarified that the proposal was not aimed at removing GNOME, but, of course, the proposal wording itself seemed to advocate in that direction, which was confusing.

Fedora project leader Matthew Miller suggested a path for the change proposal owners—and the wider KDE SIG that they are members of—to take, starting with contacting the Fedora Workstation working group to see if there is any interest in switching to, or better supporting, Plasma. In the likely event that does not go far, looking into a promotion to an edition would be the next step, he said. In the meantime, Miller asked that the change proposal be withdrawn, or that FESCo defer action on it, until that process could play out. Gompa, who is also a member of the Workstation working group, said that he would rather see the discussion continue. Since the proposal targets Fedora 42, "there's a very long timeframe to figure things out".

Yet another proposal owner, Troy Dawson, filed an issue with the Workstation group on April 12. As with the change proposal, the issue suggested replacing GNOME with KDE Plasma in the Workstation edition for Fedora 42. If that was not of interest: "we would like to talk with the Fedora Workstation Group about possible ways to promote KDE to Edition level status in Fedora". That set off a lengthy discussion, in the issue thread and over several Workstation meetings, that continues as of this writing. On May 6, Catanzaro summarized the status:

I think we have a rough consensus that:

We do not want to use Fedora Workstation branding for KDE
We still want Workstation to be the "default" choice (i.e. we don't want them to be viewed as equal) (Neal [Gompa] does not agree with this)

But, even after spending the entirety of the May 7 meeting discussing the issue, the group has not come up with an official response. Catanzaro said: "I know you've been waiting a while (sorry!) and we want to finish this soon, but this is also too important to rush."

That's where things stand now. The discussion has mostly run its course at this point; along the way it included various comparisons of the two desktops and their ease of use for newcomers (as opposed to the Linux-savvy), rehashing the decision on continuing X11 support for Plasma, and more. Based on what we know, a switch to Plasma for Fedora Workstation in Fedora 42 (or any release in the foreseeable future) seems vanishingly unlikely. On the other hand, more prominence for the Plasma spin (or, probably, edition) is something we are likely to see—perhaps even well before a year goes by.

Comments (50 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: Linux 6.9-rc7; GCC 14.1; Go 1.22 randomness; 2023 PSF report; Rust 1.78.0; curl up; 2023 Free Software Awards; Quotes; ...
Announcements: Newsletters, conferences, security updates, patches, and more.

Next page: Brief items>>