LWN.net Weekly Edition for June 12, 2025

Welcome to the LWN.net Weekly Edition for June 12, 2025

This edition contains the following feature content:

Nyxt: the Emacs-like web browser: an attempt to bring the Emacs philosophy to a fully functional web browser.
Open source and the Cyber Resilience Act: what does the new European legislation mean for the open-source community?
Fending off unwanted file descriptors: Unix-domain sockets have, since times before Linux, allowed one process to pass a file descriptor to another. But what if that descriptor is malicious? A new kernel feature allows processes to prevent the receipt of unwanted file descriptors.
Slowing the flow of core-dump-related CVEs: kernel-produced core dumps have long been associated with security problems; the 6.16 kernel will offer a better API for processes that handle core dumps.
The second half of the 6.16 merge window: the remainder of the changes for the next kernel release.
An end to uniprocessor configurations: has the time come to stop special-casing uniprocessor systems in the CPU scheduler?
Finding locking bugs with Smatch: a Linaro Connect session on this static-analysis tool.
Zero-copy for FUSE: LSFMM+BPF discussion on improving the performance of the filesystems in user space feature.
Improving iov_iter: the kernel's abstraction for block I/O needs some changes.
Improving Fedora's documentation: a Flock session on how to make Fedora's documentation better.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Nyxt: the Emacs-like web browser

By Joe Brockmeier
June 6, 2025

Nyxt is an unusual web browser that tries to answer the question, "what if Emacs was a good web browser?". Nyxt is not an Emacs package, but a full web browser written in Common Lisp and available under the BSD three-clause license. Its target audience is developers who want a browser that is keyboard-driven and extensible; Nyxt is also developed for Linux first, rather than Linux being an afterthought or just a sliver of its audience. The philosophy (as described in its FAQ) behind the project is that users should be able to customize all of the browser's functionality.

Background

Nyxt was started in 2017 by John Mercouris, and is currently sponsored as a project by Atlas, which seems to be a two-person business focusing on Common Lisp development. The team consists of Mercouris and André A. Gomes. The post about Nyxt's origins states that it was built by Emacs users and developed to provide "a good Emacs experience while using the Internet". It is meant to enable user freedom not only through its license, but also by focusing on the browser's "hackability" so users can fully control the browser:

Nyxt and Emacs take a different approach than Unix. Instead of doing one thing and doing it well, Nyxt and Emacs share a core foundation built upon extensibility. They are designed to be introspected, changed, and hacked on the fly.

With Emacs being a heavy influence on Nyxt, one might wonder why it isn't developed as an Emacs package or browser extension rather than as a standalone project. In 2021, contributor Pedro Delfino and Mercouris addressed that question in a blog post. In short, despite all of its merits, Mercouris felt that Emacs had too much technical debt to make a good basis for a web browser, and wanted to start with a clean slate. He also wanted to make Nyxt welcoming to non-Emacs users, which meant that it would be a bad idea to require people to use Emacs in order to run Nyxt. It should be noted that Nyxt has support for vi-like keybindings as well as common user access (CUA) keybindings.

According to their FAQ, it was not possible to develop Nyxt as a browser extension, either:

It would not be able to utilize Lisp as a powerful and modern language (a source of many of Nyxt's powerful features). Additionally, many of Nyxt's features are simply not possible due to restrictions in plugin-architecture.

The current stable version of Nyxt is 3.12.0, which was released in October 2024. The 3.x series uses WebKitGTK as its rendering engine, with experimental support for Blink. The project's security policy is terse and to the point: only the latest stable version of Nyxt will receive security updates. There is no announcement list specifically for security issues, and the project requests that vulnerability reports be sent to hello@atlas.engineer.

The 3.x series appears to be in maintenance mode at this point, with work focused on a 4.0 release series that was unveiled at the end of December. The 4.0 pre-releases support two renderers: WebKitGTK and Electron. According to the blog post by Gomes announcing the first preview release, 4.0 will mark "a new era for Nyxt" as a "renderer-agnostic web browser". The project is adding Electron due to shortcomings in WebKitGTK's performance. This has required the project to improve its API "in order to achieve render agnosticism". The move to Electron, said Gomes, will provide better performance as well as support for macOS and Windows.

The latest preview release (4.0.0-pre-release-8) is only available as an AppImage with Electron as the rendering engine. Users who wish to use the WebKitGTK version will need to compile from source. Users should not expect a bunch of new features or functionality in the first stable 4.0 release; the bulk of the work seems to be refactoring for Electron support, bug fixes, and user-interface improvements.

Getting Nyxt

The recommended way to install the stable release of Nyxt for Linux is via Flatpak. (Despite offering an AppImage for preview releases, there is no AppImage for the stable series.) The project also maintains a list of packages for a variety of Linux distributions, but asks that any bugs found in third-party packages be reported to the distribution rather than the Nyxt project.

Compiling from source is also an option, of course, and might even be a necessity. The Nyxt Flatpak package would not run on Fedora 41 on a system with an NVIDIA card without disabling WebKit compositing and sandboxing using the following command:

    $ flatpak run --env=WEBKIT_DISABLE_COMPOSITING_MODE=1 \
      --env=WEBKIT_DISABLE_SANDBOX_THIS_IS_DANGEROUS=1 \
      engineer.atlas.Nyxt

Starting Nyxt with a variable that explicitly informs the user "this is dangerous" seemed unwise over the long term, so I went ahead and compiled Nyxt myself for that machine. The developer manual has information on dependencies that might be needed and that are unlikely to be installed by default. To enable copy and paste functions under Wayland, it will be necessary to install the Wayland clipboard utilities in addition to any dependencies needed to compile Nyxt. On Fedora, this is the wl-clipboard package.

For this article, I primarily used Nyxt 3.12, though I did spend some time with the 4.0.x preview releases as well. As one might expect, they are still too unstable to use full-time.

Getting started

When Nyxt is started, it displays a page with four buttons: Quick-Start, Describe-Bindings, Manual, and Settings. Unlike Chrome, Firefox, and other popular web browsers, Nyxt does not have a point-and-click interface for all (or even most) of its features; the expectation is that users are going to do almost everything from the keyboard, and much of Nyxt's functionality is only accessible by using key combinations or entering commands. Users can still use the mouse to click links, etc., but there are no buttons to open new windows, add bookmarks, and no URL bar or location bar to type "lwn.net" into.

The quick start introduces some of the concepts that set Nyxt apart from other browsers. Instead of tabs, Nyxt has buffers. In practice, buffers are similar to tabs, except that a Nyxt buffer can have its own behavior and settings. For instance, users can set a buffer's keybindings individually to use a different set than the global default. Likewise, a buffer can use different modes—which are similar to Emacs modes.

Nyxt commands are invoked with keybindings or by bringing up the prompt buffer and typing the command to be used. Users can summon the prompt buffer with Ctrl-Space and Alt-x if using Emacs mode or : in vi mode. Users can see the modes that are enabled in a buffer by bringing up the prompt buffer and using the toggle-modes command.

The Settings page, which can be opened from the Nyxt start page or by running the common-settings command in the prompt buffer, has several tabs for configuring high-level Nyxt options. These include the browser's default keybindings, the theme (dark or light), its privacy settings, and the text editor. Nyxt comes with an ad-blocker mode, but it needs to be enabled in the privacy settings. Users can also set the cookie policy, turn on the reduce-tracking mode, and turn off JavaScript if desired with the no-script mode. And, of course, each of these settings can be enabled or disabled on a per-buffer basis as well.

Even though Nyxt developers have a strong preference for Emacs, it is set to use CUA keybindings by default. Folks who prefer Emacs or vi-like bindings will want to change the keybinding setting and restart the browser for that to take effect. (Note that users can change the keybindings (or other modes) in a buffer without having to restart—a restart is only required for the global setting.) It's best to do this early, rather than committing the CUA keybindings to memory and then relearning them later for vi or Emacs. It will likely make Nyxt more intuitive as well—I found that it was relatively easy to make the move from Firefox with Vimium to Nyxt. The CUA bindings, aside from some familiar ones like Ctrl-l to enter a URL to browse to, were much harder (for me, anyway) to commit to memory.

Over the years, a lot of work has been done to reduce the amount of space that is taken up by browser "chrome"—that is, the user-interface components (UI) such as window title bars, toolbars, tabs, and so forth—to maximize the amount of space available for the web pages being viewed. Nyxt solves this by having almost no UI at all while browsing, though there are a few tiny buttons in a small toolbar at the bottom of the Nyxt window. The toolbar has elements for navigating backward or forward, reloading the page, raising the command prompt, and for displaying the modes enabled in each buffer. Nyxt also has a minimal right-click menu that lets users move one page backward or forward in a buffer, reload a buffer, or bring up the WebKit Inspector tools.

One feature Nyxt has that other browser makers should copy is the history tree. Running the history-tree command will bring up a navigable tree-type visualization of the browser's history across all buffers. This not only lets users trace their steps and quickly hop back (or forward) to pages they've visited, Nyxt can operate on the history with other commands, such as bookmark-buffer-url to bookmark pages that are currently open in the browser.

Just as with Emacs and Vim, users can operate on multiple buffers at once. After a long browsing session, for instance, one might have 20 pages open from Wikipedia after going down a rabbit hole about a topic. Open the command buffer, enter the delete-buffer command, and Nyxt will list all open buffers. Type "wikipedia" and it will list only those buffers that match the term; select each one and hit Enter, and they will all be closed. The same is true for other commands, of course. Instead of closing all buffers, a user could choose to bookmark all pages matching certain terms, or to use the print-buffer command to send them all to the printer.

Extending Nyxt

Firefox and Chrome allow developers to add features to the browser as extensions, but there are limits to what developers can do within the boundaries of the extension frameworks for each browser. Nyxt, on the other hand, is designed to be entirely customizable and extensible via Lisp. Nyxt targets the Steel Bank Common Lisp (SBCL) implementation. Users can create their own URL-dispatchers to configure applications to handle certain types of URLs, or even create custom URL scheme handlers to deal with URL schemes that Nyxt doesn't know about. Users can also create custom commands, add new menu entries to the right-click menu, and (of course) add and modify keybindings. Nyxt also features a built-in REPL to run Lisp code in the browser.

Nyxt will automatically load configurations and code from the user's config.lisp file, usually found under ~/.config/nyxt. As a short example, this code will add an item to the right-click menu to bookmark the current page:

    (define-command-global my-bookmark-url
        nil
      "Query which URL to bookmark."
      (let ((url
             (prompt :prompt "Bookmark URL" :sources
                     'prompter:raw-source)))
        (nyxt/mode/bookmark:bookmark-current-url)))
    
    (ffi-add-context-menu-command 'my-bookmark-url "Bookmark")

One of the great things about both Emacs and Vim is that each editor has a large community of users, many of whom like to share their knowledge about using and extending their editor of choice. New users can find copious amounts of documentation and examples online to learn from or copy and modify as needed. The odds are that anything one might want to do with Emacs or Vim has already been done and blogged about with examples for others to copy.

That is decidedly not true of Nyxt, at least not yet. There are not many people blogging about Nyxt and few examples online that I could find. Few, but not zero. Artyom Bologov has a repository with a treasure-trove of Nyxt configuration files, as well as a separate repository for adding search engines to Nyxt. These are bound to be helpful for many new Nyxt users, but the bad news is that Bologov stopped using Nyxt in favor of the surf browser from suckless.org. The examples are still useful today, but will become less so as Nyxt evolves.

Nyxt also lacks the kind of extension ecosystem that other browsers enjoy. It is possible to create extensions for Nyxt, but the project only lists two extensions currently—and both are created by Atlas.

Notes on Nyxt

It takes some time to get used to using Nyxt after using a browser like Chrome or Firefox full-time. It took several sessions with Nyxt before I felt productive with it, and a few more to really appreciate its features.

Nyxt's performance is notably slower than Firefox or Chrome on some sites. For example, using the date-picker widget on GitHub, posting content to Mastodon, and browsing Codeberg was sluggish compared to Firefox. Nyxt is probably not suitable as a primary browser replacement if one's work (or play) involves a lot of JavaScript-heavy web applications. It also lacks WebRTC support, so users need to look elsewhere for video conferencing.

On the other hand, it's quite usable for most of my day-to-day work with LWN and performs well on sites, like sourcehut, that have minimal JavaScript. It seems likely it would be a suitable option as a primary browser for many system administrators and developers.

The Electron port should offer a better experience once it has stabilized. I tried it on GitHub and Mastodon and didn't experience the same slowdowns that I ran into with the stable series.

Nyxt has an online manual that is also included with the browser to help users dig more deeply into its functionality. It is only fully usable when viewed with Nyxt; links to commands and functions are prefixed with nyxt: to point to internal documents, so they do not work when viewed with other browsers from the Nyxt web site. Unfortunately, the manual is outdated in spots, and that can lead to frustration. As an example, it explains how users can customize settings with the "Configure" button, but that feature was removed.

It's not entirely surprising that the documentation has fallen out of date in places: Nyxt is an ambitious project and only has a few active developers. Only seven people have contributed to the Nyxt repository on GitHub since June 4 last year, with Mercouris and Gomes responsible for all but 13 of the 633 commits; one of the other contributors has nine commits, the rest only have one each.

The 4.0 series does not have a target date for a stable release, but a list of GitHub issues tagged for the 4.0 series suggests that the project is making good progress toward it. There are currently 17 open issues and 46 closed issues, and there have been 4.0.0 development versions released about every two to three weeks since December.

Nyxt has received some funding from the European Union's Next Generation Internet initiative and has a plan to raise additional funding by selling applications that use Nyxt as an application framework. The first application is an RSS feed reader called Demeter. It is not open source, but it is offered under a "pay-what-you-can" model, with a suggested price of $10—and a link to a direct download to the source for users who cannot afford to pay. As a business strategy, Atlas's approach is fairly user-friendly but unlikely to generate a huge revenue stream.

Like Emacs and Vim, Nyxt is not for everyone—it takes more time than most would want to invest to really explore the features it has to offer and even longer to start making it one's own through customization. Also, like Emacs and Vim, it has the promise of letting users mold the application exactly to their specifications if they are willing to expend the time and effort to do so.

Comments (14 posted)

Open source and the Cyber Resilience Act

By Daroc Alden
June 5, 2025

Linaro Connect

The European Union's Cyber Resilience Act (CRA) has caused a stir in the software-development world. Thanks to advocacy by the Eclipse Foundation, Open Source Initiative, Linux Foundation, Mozilla, and others, open-source software projects generally have minimal requirements under the CRA — but nothing to do with law is ever quite so simple. Marta Rybczyńska spoke at Linaro Connect 2025 about the impact of the CRA on the open-source ecosystem, with an emphasis on the importance of understanding a project's role under the CRA. She later participated in a panel discussion with Joakim Bech, Kate Stewart, and Mike Bursell about how the CRA would impact embedded open-source development.

Rybczyńska is not a lawyer. She's a security professional and a developer, but "we cannot leave law to the lawyers". A company in need of legal advice should go to its lawyer; for the rest of us, we have to rely on summaries from interested non-lawyers, or our own research.

The CRA has already become law, but does not come completely into force until 2027, Rybczyńska said. Some provisions start earlier than others; as of September 2026, vendors will need to report exploited vulnerabilities. "Basically everything" is affected: any software or hardware that is or can be connected to the Internet and is sold in Europe. There are specific exceptions for web sites, for products with existing regulations, and for hobby projects (including many open-source projects). Open-source stewards, organizations that guide an open-source project but don't qualify as manufacturers, also have reduced requirements.

So, if hobby projects are an exception to the law, why does anyone without access to a corporate legal team need to care? Rybczyńska laid out two possible futures: either CRA compliance becomes another regulation for lawyers to work around with paperwork, self-assessments, and calculated risks of being caught, or software developers take the opportunity that the CRA offers to persuade companies to employ the best practices "that engineers have always wanted."

If someone is simply a developer of open-source software, which they don't monetize, they have no obligations under the CRA. But they can help vendors who do have those obligations choose real change over paperwork-only "compliance" by having a clear reporting channel for security vulnerabilities and a way to announce to users when those vulnerabilities are discovered. This helps consumers, but another provision of the law directly helps the open-source project itself. Manufacturers that monetize their products are legally responsible for all included software in their products, even if it's open source. If a manufacturer uses 1,000 open-source projects, it is responsible for fixing bugs in those 1,000 projects, Rybczyńska said.

Historically, companies have often demanded security fixes from open-source projects. The CRA inverts that relationship: companies are required to fix security problems in the open-source software they use, and report security problems to the upstream project. This obligation lasts for the entirety of the CRA's support period, five years after a consumer buys the end product. The companies are, unfortunately, not required to actually share their bug fixes (except as compelled to do so by a project's license) — but if an open-source project makes it easy to do so, they can likely be convinced to contribute back, if only so that they don't have to maintain a fix out-of-tree. [As pointed out in a comment, the CRA does actually require companies to share bug fixes with the upstream project.]

That isn't the only obligation companies have under the CRA, Rybczyńska continued. Companies will also be required to report security incidents to the government, and perform a risk analysis of their software-development process, although the CRA doesn't mandate a framework to perform that risk analysis. It does require companies to use encryption for user data, encrypted communication, and mechanisms to ensure code integrity, such as signed images, in their products.

Rybczyńska finished her talk by inviting people again to consider the two possible worlds. Open-source developers can ignore the CRA, in which case companies will likely stick to working around the CRA with paperwork, or fixing bugs without sharing. Or open-source developers can embrace the CRA, make it easy for corporate users of their software to contact them with information about vulnerabilities, cooperate with risk analyses, and receive an army of paid engineers to fix security-related bugs for them.

Discussion

Bech, an employee at Linaro, led a later panel discussion about the CRA with Rybczyńska, Stewart, and Bursell. Stewart works at the Linux Foundation on dependable embedded systems; she gave a related talk earlier in the week. Bursell serves as the executive director of the Confidential Computing Consortium. Bech opened with a simple question for Rybczyńska: "Marta, if I'm a small business, what should I do?" Her answer was: "Figure out who you are, under the CRA". Manufacturers, open-source stewards, and contributors all have different obligations, she explained. Bursell added that there are specific provisions for small businesses as well, so company size can also play a role.

"If you fancy going to sleep one night, reading the CRA is a great way to do that," he said. Rybczyńska and Stewart disagreed, saying that the law has many interesting parts. Stewart was particularly interested in the classification of operating systems as "important" components.

Bursell briefly explained about the different levels of products defined in the CRA (in paragraphs 43 through 46, primarily). By default, products can be self-certified for compliance; their manufacturers only need to provide supporting materials on request. "Important" products, a category that includes everything from baby monitors to operating systems, are held to a higher standard, and may need to have paperwork filed in advance. "Critical" products are the highest category, with additional compliance obligations. He advised people to err on the side of caution, or ask the EU for clarification if unsure about the status of a specific product.

A concern that applies regardless of product classification, however, is the mandate that companies which sell a product retain documentation about its CRA compliance for 10 years. Rybczyńska urged everyone to generate that documentation in advance and save it in a safe place; trying to come up with a software bill of materials (SBOM) at the time of a request is likely to be problematic. Stewart agreed, saying: "Yeah, don't do that."

Bech asked what kind of documentation was covered by the requirement. Rybczyńska gave a long list: processes for software development, evidence that they were followed, a product's SBOM, and a complete history of security updates for the product. She emphasized that companies should really have a complete history of security updates for their products already. "We all know many cases where something went wrong in a product after a sequence of updates; if you don't have them, you can't debug."

Stewart advised that companies should be generating a new SBOM along with each of those security updates, as well, which can help with reproducing problems. A lot of the challenges of CRA compliance will come during mergers and acquisitions, she said, when trying to reconcile processes for these things across companies. Stewart was also worried about the relationship between datasets and trained machine-learning models, which the CRA doesn't cover. Rybczyńska agreed, noting that machine-learning models are increasingly used in security-critical applications such as firewalls.

Bech asked the panel members what they thought about the requirement that companies provide security fixes for their dependencies — "won't that result in a kind of fragmented 'fixed' ecosystem?" Rybczyńska agreed that it could happen, but called it an opportunity for vendors to review their whole supply chain and minimize their dependencies, focusing on dependencies with good security policies. If a company relies on abandoned projects, she said, that's going to cause a nightmare eventually, so it's better to find that out up front. In her opinion, the next thing SBOM tooling needs is a way to track project's security policies as well as their licensing requirements.

"I'd go further," Bursell asserted. If a vendor's product relies on an open-source project, the company should be involved in the project's development, or at least pay for its support, he said. He expressed the hope that the CRA would push more companies in that direction. Bursell also wondered how much information about the software running in a product's build environment, rather than direct dependencies, the CRA requires.

Stewart answered that the CRA leaves that undefined, just requiring "an SBOM". What exactly that means is not clear, with US, German, and Japanese agencies all publishing different definitions and requirements. Bech asked what parts of the definition were missing from the CRA.

The industry currently focuses too much on documents, Stewart answered, which provide a snapshot in time. The definition of an SBOM would ideally handle keeping that information in a database. Rybczyńska added that the tooling simply isn't there yet — there are multiple SBOM standards, multiple SBOM-generating tools, and "you are expected to make sense of all of that".

One member of the audience asked whether the panelists thought that the CRA would harm open-source adoption. Stewart said that the Linux Foundation had a survey done which showed that 46% of manufacturers passively rely on upstream projects for security fixes. "The way the CRA looks at it is, the people making money should have skin in the game." Ultimately, she doesn't think the CRA will hurt open source. Bursell also suggested that if an open-source developer is worried about this, it's "a great chance to figure out who your users are".

Rybczyńska pointed out that under the CRA, open-source projects are not required to attach a "CE mark" to their project (the mark that claims compliance with the CRA's security requirements) — but nothing is stopping them from doing that paperwork voluntarily. If a project is concerned with increasing adoption, it could do the work to obtain a C mark as a marketing tactic.

The panel session ended with a few more questions about the exact scope of the CRA. The panelists clarified that it applies to "any product with a digital element", including firmware. Rybczyńska advised that if someone were really concerned about whether they could get away with not complying for a specific product, they should "speak to your lawyers, not three semi-experts".

It seems clear that the CRA is going to have an impact on the open-source ecosystem, if only because of its impact on the entire software sector. While open-source contributors don't have direct obligations under the CRA, they will still need to be aware of its implications in order to effectively guide their projects.

[Thanks to Linaro for funding my travel to Linaro Connect.]

Addendum: Videos of Rybczyńska's keynote and the panel discussion are available on YouTube, and slides for the talks are available on Linaro's website.

Comments (59 posted)

Fending off unwanted file descriptors

By Jonathan Corbet
June 5, 2025

One of the more obscure features provided by Unix-domain sockets is the ability to pass a file descriptor from one process to another. This feature is often used to provide access to a specific file or network connection to a process running in a relatively unprivileged context. But what if the recipient doesn't want a new file descriptor? A feature added for the 6.16 release makes it possible to refuse that offer.

Normally, a Unix-domain connection is established between two processes to allow the exchange of data. There is, however, a special option (SCM_RIGHTS, documented in unix(7)) to the sendmsg() system call that accepts a file descriptor as input. That descriptor will be duplicated and installed into the receiving process, giving the recipient access to the file as if it had opened it directly. SCM_RIGHTS messages can be used to give a process access to files that would otherwise be unavailable to it. It is also useful for network-service dispatchers, which can hand off incoming connections to worker processes.

The SCM_RIGHTS feature is not exactly new; it was added to the 1.3.71 development kernel by Alan Cox in 1996, but existed in Unix prior to that. Interestingly, it seems that, in the long history of this feature, nobody has ever considered the question of whether the recipient actually wants to acquire a new file descriptor. In retrospect, it seems like a bit of a strange omission. Developers tend to take care with the management of the open-file table in their programs, closing files that are no longer needed, and ensuring that file descriptors are not passed into new process or programs unnecessarily. Injecting an unexpected file descriptor into a process has the potential to interfere with those efforts.

A specific problem with unexpected file descriptors, as pointed out by Kuniyuki Iwashima in this patch series, is their denial-of-service potential. If a file descriptor that is somehow hung — consider a descriptor for an attacker-controlled FUSE filesystem or a hung NFS file — is installed into a process, the recipient may be blocked indefinitely while trying to close it. This situation is similar to dumping a load of toxic waste on somebody's lawn; the victim may find themselves unable to get rid of it. In the SCM_RIGHTS case, this sort of toxic file descriptor can prevent the recipient from getting work done (or exiting).

The solution, as implemented by Iwashima, is to provide a new option to disable the reception of file descriptors over a given socket. That is done with a setsockopt() call, using the new SO_PASSRIGHTS flag, like:

    int zero = 0;
    ret = setsockopt(fd, SOL_SOCKET, SO_PASSRIGHTS, &zero, sizeof(zero));

If this option is used as above to disable the reception of file descriptors, any attempt to transfer a descriptor over that socket will fail with an EPERM error. Of course, the reception of SCM_RIGHTS file descriptors remains enabled by default; to do otherwise would surely break large numbers of programs. If SCM_RIGHTS were being designed today, it would likely require an explicit opt-in, but that ship sailed decades ago, so developers wanting to protect a process against unwanted file descriptors will need to disable SCM_RIGHTS explicitly for any socket that it might be pass to recvmsg().

The SO_PASSRIGHTS option found its way into the mainline kernel (as part of the large networking pull) on May 28 and will be available as of the 6.16 kernel release.

Comments (17 posted)

Slowing the flow of core-dump-related CVEs

By Jonathan Corbet
June 6, 2025

The 6.16 kernel will include a number of changes to how the kernel handles the processing of core dumps for crashed processes. Christian Brauner explained his reasons for doing this work as: "Because I'm a clown and also I had it with all the CVEs because we provide a **** API for userspace". The handling of core dumps has indeed been a constant source of vulnerabilities; with luck, the 6.16 work will result in rather fewer of them in the future.

The problem with core dumps

A core dump is an image of a process's data areas — everything except the executable text; it can be used to investigate the cause of a crash by examining a process's state at the time things went wrong. Once upon a time, Unix systems would routinely place a core dump into a file called core in the current working directory when a program crashed. The main effects of this practice were to inspire system administrators worldwide to remove core files daily via cron jobs, and to make it hazardous to use the name core for anything you wanted to keep. Linux systems can still create core files, but are usually configured not to.

An alternative that is used on some systems is to have the kernel launch a process to read the core dump from a crashing process and, presumably, do something useful with it. This behavior is configured by writing an appropriate string to the core_pattern sysctl knob. A number of distributors use this mechanism to set up core-dump handlers that phone home to report crashes so that the guilty programs can, hopefully, be fixed.

This is the "**** API" referred to by Brauner; it indeed has a number of problems. For example, the core-dump handler is launched by the kernel as a user-mode helper, meaning that it runs fully privileged in the root namespace. That, needless to say, makes it an attractive target for attackers. There are also a number of race conditions that emerge from this design that have led to vulnerabilities of their own.

See, for example, this recent Qualys advisory describing a vulnerability in Ubuntu's apport tool and the systemd-coredump utility, both of which are designed to process core dumps. In short, an attacker starts by running a setuid binary, then forcing it to crash at an opportune moment. While the core-dump handler is being launched (a step that the attacker can delay in various ways), the crashed process is killed outright with a SIGKILL signal, then quickly replaced by another process with the same process ID. The core-dump handler will then begin to examine the core dump from the crashed process, but with the information from the replacement process.

That process is running in its own attacker-crafted namespace, with some strategic environmental changes. In this environment, the core-dump handler's attempt to pass the core-dump socket to a helper can be intercepted; that allows said process to gain access to the file descriptor from which the core dump can be read. That, in turn, gives the attacker the ability to read the (original, privileged) process's memory, happily pillaging any secrets found there. The example given by Qualys obtains the contents of /etc/shadow, which is normally unreadable, but it seems that SSH servers (and the keys in their memory) are vulnerable to the same sort of attack.

Interested readers should consult the advisory for a much more detailed (and coherent) description of how this attack works, as well as information on some previous vulnerabilities in this area. The key takeaways, though, are that core-dump handlers on a number of widely used distributions are vulnerable to this attack, and that reusable integer IDs as a way to identify processes are just as much of a problem as the pidfd developers have been saying over the years.

Toward a better API

The solution to this kind of race condition is to give the core-dump handler a way to know that the process it is investigating is, indeed, the one that crashed. The 6.16 kernel contains two separate changes toward that goal. The first is this patch from Brauner adding a new format specifier ("%F") for the string written to core_pattern. This specifier will cause the core-dump handler to be launched with a pidfd identifying the crashed process installed as file descriptor number three. Since it is a pidfd, it will always refer to the intended process and cannot be fooled by process-ID reuse.

This change makes it relatively easy to adapt core-dump handlers to avoid the most recently identified vulnerabilities; it has already been backported to a recent set of stable kernels. But it does not change the basic nature of the core_pattern API, which still requires the launch of a new, fully privileged process to handle each crash. It is, instead, a workaround for one of the worst problems with that API.

The longer-term fix is this series from Brauner, which was also merged for 6.16. It adds a new syntax to core_pattern instructing the kernel to write core dumps to an existing socket; a user-space handler can bind to that socket and accept a new connection for each core dump that the kernel sends its way. The handler must be privileged to bind to the socket, but it remains an ordinary process rather than a kernel-created user-mode helper, and the process that actually reads core dumps requires no special privileges at all. So the core-dump handler can bind to the socket, then drop its privileges and sandbox itself, closing off a number of attack vectors.

Once a new connection has been made, the handler can obtain a pidfd for the crashed process using the SO_PEERPIDFD request for getsockopt(). Once again, the pidfd will refer to the actual crashed process, rather than something an attacker might want the handler to treat like the crashed process. The handler can pass the new PIDFD_INFO_COREDUMP option to the PIDFD_GET_INFO ioctl() command to learn more about the crashed process, including whether the process is, indeed, having its core dumped. There are, in other words, a couple of layers of defense against the sort of substitution attack demonstrated by Qualys.

The end result is a system for handling core dumps that is more efficient (since there is no need to launch new helper processes each time) and which should be far more resistant to many types of attacks. It may take some time to roll out to deployed systems, since this change seems unlikely to be backported to the stable kernels (though distributors may well choose to backport it to their own kernels). But, eventually, this particular source of CVEs should become rather less productive than it traditionally has been.

Comments (49 posted)

The second half of the 6.16 merge window

By Daroc Alden
June 9, 2025

The 6.16 merge window closed on June 8, as expected, containing 12,899 non-merge commits. This is slightly more than the 6.15 merge window, but well in line with expectations. 7,353 of those were merged after the summary of the first half of the merge window was written. More detailed statistics can be found in the LWN kernel source database.

As usual, the second half of the merge window contained more bug fixes than new features, but there were many interesting features that made their way in as well:

Architecture-specific

The getrandom() system call is now much faster on RISC-V. It is now handled entirely within the vDSO.
RISC-V kernels now support new vendor extensions from SiFive, as well as the Zicbop, Zabha, and Svinval extensions. They also include the supervisor binary interface (SBI) firwmare features (FWFT) extension, which is needed for SBI 3.0, the latest version.
LoongArch now supports up to 2048 CPUs, the maximum that the architecture can handle. The architecture now also has multi-core scheduling.

Core kernel

Unix-domain sockets can be used to transfer file descriptors; it is now possible for a program to opt-out of that ability, which may be important to preventing denial-of-service attacks.
The ring buffer used for kernel tracing can now be mapped into memory in user space.
A new API will let virtual memory allocations persist across kexec handovers.
Crash-dump kernels (the special kernel that runs after a kernel crash to produce a report) can now reuse existing LUKS keys. This lets crash dumps be made to encrypted filesystems, which was not previously possible.
The kernel memory accounting done by the memory control-group code can now be performed in a non-maskable interrupt (NMI) context. This is important because BPF programs can run in NMI contexts, and may need to allocate memory in the kernel, which in turn needs to be accounted for.
NUMA weighted interleaving is now automatically tuned, providing better utilization of memory bandwidth in systems with data striped across multiple NUMA nodes.

Filesystems and block I/O

OrangeFS now makes use of the new mount API, as does UFS.
The limit on read and write sizes for NFS filesystems has been raised to 4MB. "The default remains at 1MB, but risk-seeking administrators now have the ability to try larger I/O sizes with NFS clients that support them."
Users with the CAP_SYS_ADMIN capability in a user namespace (and no privileges in the root namespace) can now watch filesystems and mounts with fanotify.
The ext2 filesystem has deprecated support for DAX, since it isn't widely used. The ext2 filesystem itself isn't widely used either, but it does serve as a stable reference implementation of a filesystem. Since persistent memory has not become as widely used as once expected, supporting it in a reference implementation doesn't make much sense. DAX support in ext2 is expected to be completely removed at the end of 2025.
FUSE filesystems can now invalidate all existing cached directory entries (dentries) in a single operation.
The overlayfs filesystem now supports data-only layers with dm-verity in user namespaces. This allows trusted metadata layers to be combined with untrusted data layers in unprivileged namespaces.

Hardware support

Clock: SpacemiT K1 SoCs, Sophgo SG2044 SoCs, T-HEAD TH1520 video-output clocks, Qualcomm QCS8300 camera clocks, Allwinner H616 display-engine clocks, Samsung ExynosAutov920 CPU cluster clock controllers, Renesas RZ/V2N R9A09G056 SoCs, Sophgo CV1800 clocks, and NXP S32G2/S32G3 clocks.
GPIO and pin control: Mediatek MT6893 and MT8196 SoCs, Renesas RZ/V2N SoCs, MediaTek Dimensity 1200 (MT6893) I2C, Sophgo SG2044 I2Ci, Renesas RZ/V2N R9A09G056 I2C, Rockchip RK3528 I2C, and NXP Freescale i.MX943 SoCs.
Graphics: Amlogic C3 image-signal processors.
Hardware monitoring: Dasharo fans and temperature sensors, KEBA fan controllers and battery monitoring controllers, MAX77705 ICs, MAXIMUS VI HERO and ROG MAXIMUS Z90 Formula motherboards, SQ52206 energy monitors, lt3074 linear regulators, ADPM12160 DC/DC power modules, and MPM82504 and MPM3695 DC/DC power modules.
Industrial I/O: DFRobot SEN0322 oxygen sensors.
Input: ByoWave Proteus game controllers and Apple Magic Mouse 2s.
Media: ST VD55G1 and VD56G3 image sensors and OmniVision OV02C10 image sensors.
Miscellaneous: FSL vf610-pit periodic-interrupt timers, SGX vz89te integrated sensors, Maxim max30208 temperature sensors, TI lp8864 automotive displays, MT6893 MM IOMMUs, Sophgo CV1800 and SG2044 SoCs, Qualcomm sm8750 SoCs, Amlogic c3 and s4 SoCs, and Renesas RZ/V2H(P) R9A09G057 DMA controllers.
Networking: Renesas RZ/V2H(P) SoC, Broadcom asp-v3.0 ethernet devices, AMD Renoir ethernet devices, RealTek MT9888 2.5G ethernet PHYs, Aeonsemi 10G C45 PHYs, Qualcomm IPQ5424 qusb2 PHYs, IPQ5018 uniphy-pcie devices, Mediatek MT7988 xs-PHYs, and Renesas RZ/V2H(P) usb2 PHYs.
Sound: Fairphone FP5 sound card.

Miscellaneous

Support for the STA2x11 video input port driver has finally gone away.
The documentation generation script scripts/bpf_doc.py can now produce JSON output about BPF helpers and other elements of the BPF API. This change makes it easier for external tools to keep their knowledge of the BPF interface up to date.
Writing "default" to the sysfs trigger of an LED device will now reset the trigger to that device's default.
Compute express link (CXL) devices now support the reliability, availability, and serviceability (RAS) extensions. Most importantly, these let CXL devices participate in various error detection and correction schemes.
This release includes a number of improvements to perf, including support for calculating system call statistics in BPF, better demangling of Rust symbols, more granular options for collecting memory statistics, a flag to deliberately introduce lock contention, and several more.
USB audio devices now support audio offloading. This lets, for example, audio from a USB device to continue to flow even when the rest of the system is sleeping. In the pull request, Greg Kroah-Hartman said: "I think this takes the record for the most number of patch series (30+) over the longest period of time (2+ years) to get merged properly."

Networking

The contents of device memory can now be sent via TCP, allowing zero-copy transmission from a GPU to the wire.
BPF can be used to implement traffic-control queueing disciplines (qdiscs) with a struct_ops program.
Support for the datagram congestion control protocol (DCCP) is being removed following a long deprecation and no signs of having any users. DCCP was intended to prevent problems with UDP's lack of rate control, which have largely failed to materialize. It was originally added in 2005. The hope is that this removal will enable cleanup of the parts of the TCP stack that are currently shared with DCCP.
The kernel now supports using generic security services application programming interface (GSSAPI) for the AFS filesystem, allowing connections to manage the encryption of connections to YFS and OpenAFS servers.
OpenVPN now has a virtual driver for offloading some operations to the kernel, which should make it faster, especially for large transfers.

Security-related

The randstruct GCC plugin, which makes it harder for attackers to access kernel data structures by randomizing their layout, is now working again, and has tests to keep it that way. The ARM_SSP_PER_TASK GCC plugin, which lets different tasks use different stack canaries, has been retired, since its functionality is available in upstream GCC.
Integrity Measurement Architecture (IMA) measurements can now be carried across kexec invocations. A new kernel-configuration option, IMA_KEXEC_EXTRA_MEMORY_KB determines how much memory is set aside for new IMA measurements on a soft reboot.
The measurements made by the trusted security manager (TSM; part of Intel's trust domain extensions, also known as TDX) are now exposed as part of sysfs. This gives user space the opportunity to make decisions based on attestations from the hardware.
The performance overhead of SELinux has been reduced by adding a cache for directory-access decisions and support for wildcards in genfscon policy statements.
The kernel's EFI code has been extended to allow emitting a .sbat section with UEFI SecureBoot revocation information; the upstream kernel project won't maintain the revocation information, but individual distributions now have the access they need to the be able to ship their own revocation databases.
The .static_call_sites section in loadable modules is now made read-only after module initialization.

Virtualization and containers

64-Bit Arm now supports transparent huge pages on non-protected guests when protected KVM is enabled.
Nested virtualization support on 64-bit Arm is also working, although it remains disabled by default.
x86 virtual machine hosts on KVM now support TDX, enabling the use of confidential guests on Intel processors. This change "has been in the works for literally years", and includes a large number of patches.
KVM support on RISC-V is no longer experimental.

Internal kernel changes

The power-management subsystem has gained Rust abstractions for managing CPU frequency, operating performance points (OPPs), and related power-management APIs.
The kernel's minimum supported GCC version has been updated to GCC 8 for all architectures; the update allows for two of the five remaining GCC plugins used in kernel builds to be removed. The corresponding minimum version of binutils is 2.30.
A bevy of memory-management changes includes more folio conversions, Rust abstractions for core memory-management operations, better support for memory compaction, and the removal of VM_PAT.
Rust test error messages are now more tightly integrated into KUnit when using assertions and results. Rust code can now also make use of XArrays.

The 6.16 kernel now goes into the stabilization period, with the final release expected July 27 or August 3.

Comments (7 posted)

An end to uniprocessor configurations

By Jonathan Corbet
June 10, 2025

The Linux kernel famously scales from the smallest of systems to massive servers with thousands of CPUs. It was not always that way, though; the initial version of the kernel could only manage a single processor. That limitation was lifted, obviously, but single-processor machines have always been treated specially in the scheduler. That longstanding situation may soon come to an end, though, if this patch series from Ingo Molnar makes it upstream.

Initially, Linus Torvalds's goal with Linux was simply to get something working; he did not have much time to spare for hardware that he did not personally have. And he had no multiprocessor machine back then — almost nobody did. So, not only did the initial version of the kernel go out with no SMP support, the kernel lacked that support for some years. The 1.0 and 1.2 releases of the kernel, which came out in 1994 and 1995, respectively, only supported uniprocessor machines.

The beginnings of SMP support found their way into the 1.3.31 development release in late 1995; the associated documentation file included the warning: "This is experimental. Back up your disks first. Build only with gcc2.5.8". It took some time for the SMP work to stabilize properly; the dreaded big kernel lock, which ensured that only one CPU was running within the kernel at any time, wasn't even introduced until 1.3.54. But, by the time 2.0 was released in June 1996, Linux worked reasonably well on two-CPU systems, for some workloads, at least.

At that time, though, SMP systems were still relatively rare; most people running Linux did not have one. The majority of Linux users running on uniprocessor systems had little patience for the idea that their systems might be made to run slower in order to support those expensive SMP machines that almost nobody had. The tension between support for users of "big iron" and everybody else ran strong in those days, and a two-CPU system was definitely considered to be big iron.

As a result, the addition of SMP support was done under the condition that it not regress performance on uniprocessor systems. This is a theme that has been seen many times over the history of Linux kernel development. Perhaps most famously, the realtime preemption code was not allowed to slow down non-realtime systems; in the end, realtime preemption brought a lot of improvements for non-realtime systems as well. In the case of SMP, this rule was implemented with a lot of macro magic, #ifdef blocks, and similar techniques.

It is now nearly 30 years after the initial introduction of SMP support into the Linux kernel, and all of that structure that enables the building of special kernels for uniprocessor systems remains, despite the fact that one would have to look hard to find a uniprocessor machine. Machines with a single CPU are now the outlier case; in 2025, we all are big-iron users. Many of the uniprocessor systems that are in use (low-end virtual servers, for example) are likely to be running SMP kernels anyway. Maintaining a separate uniprocessor kernel is usually more trouble than it is worth, and few distributors package them anymore.

As Molnar pointed out in his patch series, there are currently 175 separate #ifdef blocks in the scheduler code that depend on CONFIG_SMP. They add complexity to the scheduler, and the uniprocessor code often breaks because few developers test it. As he put it: "It's rare to see a larger scheduler patch series that doesn't have some sort of build complication on !SMP". It is not at all clear that these costs are justified at this point, given how little use there is of the uniprocessor configuration.

So Molnar proposes that uniprocessor support be removed. The 43-part patch series starts with a set of cleanups designed to make the subsequent surgery easier, then proceeds to remove the uniprocessor versions of the code. Once it is complete, the SMP scheduler is used on all systems, though parts of it (such as load balancing) will never be executed on a machine with a single CPU. Once the work is done, nearly 1,000 lines of legacy code have been removed, and the scheduler is far less of a #ifdef maze than before.

Switching to the SMP kernel will not be free on uniprocessor systems; all that care that was taken with the uniprocessor scheduler did have an effect on its performance. A scheduler benchmark run using the SMP-only kernel on a uniprocessor system showed a roughly 5% performance regression. There is also a 0.3% growth in the size of the kernel text (built with the defconfig x86 configuration) when uniprocessor support is removed. This is a cost that, once upon a time, would have been unacceptable but, in 2025, Molnar said, things have changed:

But at this point I think the burden of proof and the burden of work needs to be reversed: and anyone who cares about UP performance or size should present sensible patches to improve performance/size.

He described the series as "lightly tested", which is not quite the standard one normally wants to see for an invasive scheduler patch; filling out that testing will surely be required before this change can be accepted. But, so far, there have been no objections to the change; there are no uniprocessor users showing up to advocate for keeping their special configuration — yet. Times truly have changed, to the point that it would be surprising if this reversal of priorities didn't make it into the kernel in the relatively near future.

Comments (32 posted)

Finding locking bugs with Smatch

By Daroc Alden
June 11, 2025

Linaro Connect

Smatch is a GPL-licensed static-analysis tool for C that has a lot of specialized checks for the kernel. Smatch has been used in the kernel for more than 20 years; Dan Carpenter, its primary author, decided last year that some details of its plugin system were due for a rewrite. He spoke at Linaro Connect 2025 about his work on Smatch, the changes to its implementation, and how those changes enabled him to easily add additional checks for locking bugs in the kernel.

Video of the talk is available, and Carpenter's slides can be found on Linaro's website. Carpenter began by apologizing for the relative complexity of this talk, compared to some of his presentations about Smatch in prior years. "We're running out of easy checks to write," he explained. Smatch is designed to permit writing project-specific checks; over the years, a large number of kernel-specific checks have been added to the code, so the latest work has moved on to more complicated topics, such as locking.

One of the things that sets Smatch apart from other static-analysis tools, Carpenter said, is its support for control-flow analysis and cross-function analysis. He frequently uses both of these features to understand new subsystems; Smatch can "tell you where a variable is set, where callers of a function are, and what a function can return," among other things. For example, Smatch might show that a particular function has three callers, all of which hold a particular lock when they call it. From that, the programmer can infer the implicit locking requirements of the function.

That kind of report requires cross-function analysis, to trace nested calls with a lock held, but it also requires some starting knowledge of which functions acquire or release a lock. In the kernel, Smatch obtains that information from a hand-written table of every lock-related function, and then propagates that information through the different call trees. Rebuilding the complete database of calls takes multiple passes and five to six hours, which he said he does every night.

Once the database is built, however, it allows files to be easily checked for common locking problems. The most common mistake is to fail to unlock a lock in a function's error path, he said. Smatch finds places where this has occurred by looking for functions that acquire a lock, and then have some possible control-flow path along which it is not released. That is slightly more complicated than it sounds.

"It's harder than you might think to know what the start state is," he explained. If a function makes a call to spin_lock(), one might reasonably assume that the lock was not held before that point. But some functions behave differently depending on whether a lock is already held, so that is control-flow-sensitive as well. Also, sometimes a lock is referred to by multiple names, being locked by one name and unlocked via another. This complexity had resulted in Smatch's lock-tracking code slowly becoming an unreadable mess. "And I'm the one who wrote it."

So, in the past year, Carpenter has rewritten everything. The reimplementation of the locking checks provides a blueprint for how to write modular Smatch checks, he said. Checks can now call add_lock_hook() and add_unlock_hook() to be informed when Smatch finds that a function call acquires or releases a lock somewhere in its call tree. Locks are also now tracked by type, instead of by name, in order to reduce problems with one lock being referred to by more than one name.

There's a slight wrinkle with code that uses C's cleanup attribute to automatically unlock locks. On the one hand, it mostly eliminates bugs related to forgotten unlocks; on the other hand, it's a "headache for static analysis" since it makes the lock object harder to access and track. Ultimately, since they avoid many locking bugs, Smatch can "mostly just ignore them".

Carpenter has used the new structure to write checks for double lock and unlock bugs as well. Unlike other static-analysis projects, Smatch focuses less on "universal static properties" and more on "the actual bugs people are writing". Smatch will not catch every possible double lock, double unlock, or forgotten unlock. That increases the number of false negatives from the tool, but it results in a much bigger reduction in false positives, he said.

I asked how he found the classes of bugs that people actually write, in order to target them with Smatch checks. He explained that he reviews patches sent to the linux-stable mailing list in order to find bugs that could have been found earlier with static analysis. He encouraged other people to try the same thing, as he has found it educational.

In the future, Carpenter wants to extend Smatch's double-lock checks to operate across function boundaries, to take advantage of Clang's upcoming support for tracking the relationship between locks and their data, and to handle lock-ordering bugs.

As time wound down, one member of the audience wanted to know how Smatch compared to Cppcheck, Coccinelle, and other static-analysis tools. Other open-source tools do not have good control-flow analysis, and practically no cross-function analysis, Carpenter said. Smatch does those things better, but it has its own weaknesses. The main problem is that it hasn't really been tested outside the kernel, he explained, so it's not clear how well Smatch will handle other styles of code. Smatch is also relatively slow.

Coccinelle is fast, and can generate fixes for many problems. Sparse is good at finding endianness bugs and user-space pointers dereferenced in kernel space. "But in terms of flow analysis, Smatch is the only tool that we have in open source."

After Carpenter's talk, I ran the tool against my own copy of the kernel; in that process, I learned that Smatch is best run from source. The last release predates Carpenter's rewrite, and there are a number of useful scripts included in the source distribution that are not present in the distribution packages. Smatch's terse documentation covers how to build its analysis database and run existing checks, but not how to query the database in a more free-form manner. The smatch_data/db/smdb.py script included in the source distribution can be used for that purpose.

[Thanks to Linaro for sponsoring my travel to Linaro Connect.]

Comments (6 posted)

Zero-copy for FUSE

By Jake Edge
June 5, 2025

LSFMM+BPF

In a combined storage and filesystem session at the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF), Keith Busch led a discussion about zero-copy operations for the Filesystem in Userspace (FUSE) subsystem. The session was proposed by his colleague, David Wei, who could not make it to the summit, so Busch filled in, though he noted that "I do not really know FUSE so well". The idea is to eliminate data copies in the data path to and from the FUSE server in user space.

Busch began with some background on io_uring. When an application using io_uring needs to do read and write operations on its buffers, the kernel encapsulates those buffers twice, first into an iov_iter (of type ITER_UBUF) and from that into a bio_vec, which describes the parts of a block-I/O request. It does that for every such operation; "if you are using the same buffer, that's kind of costly and unnecessary". So io_uring added a way for applications to register a buffer; the kernel will create an iov_iter with the ITER_BVEC type just once when a buffer is registered. Then the application can use the io_uring "fixed" read/write operations, which will use what the kernel created rather than recreating it on each call.

He then turned to ublk, which is a block device that is implemented by a user-space server. When an application writes to the device, the ublk driver in the kernel will notify the ublk server that new data has been written to it, but the application's user-space buffer where the data lives cannot be read directly by the server. Instead, the ublk server needs to allocate a bounce buffer and ask the ublk driver to copy the data into it, which is pretty expensive. Ublk was changed in Linux 6.15 to allow the server to use the io_uring buffer registration that he had just described, so that it can do fixed read/write operations and a copy operation is not needed.

Busch has just started looking at the FUSE code, but he thinks that the same idea could be applied for the user-space FUSE server. Now that the FUSE server (or daemon) has io_uring support, he thinks this technique could just work—the target is a file in a filesystem rather than a block device. Busch thinks that idea is different than what Wei is proposing; instead of referencing the buffers, Wei was thinking of the application sharing memory with the daemon cooperatively. Using the registration mechanism, though, would mean that the FUSE daemon would not be able to directly read the data, it would only be able to reference it for fixed io_uring operations.

Josef Bacik agreed that Wei is looking for a way to share memory between the application and the FUSE daemon; Busch was unclear why FUSE would be needed at all in that case. Bacik said that FUSE provides files, permissions, and the like, which applications already know how to work with. The FUSE daemon may need to be able to read the data, so the registration mechanism is not sufficient. Christoph Hellwig suggested using layout leases as a way to ensure that clients have direct access to the buffer, but that the access can be revoked if the FUSE daemon exits, which Bacik and Busch thought made sense.

Jeff Layton asked how applications would access the functionality and if a new io_uring command would be needed; Bacik thought it would just be an extension to existing io_uring commands for zero-copy networking. Stephen Bates asked if that would allow FUSE on top of a ublk device "and that it is going to zero-copy all the way through"; Busch said that it would.

There was some discussion of how the memory got pinned and whether it would be able to be migrated or not. In addition, there were questions about how that memory would be accounted for and, thus, would interact with memory control groups. Some of that was done without a microphone, so I was unable to fully follow, but the attendees all seemed satisfied that those concerns were being considered.

As the session wound down, with some banter and laughter, Bates asked about what people were using ublk for. Busch said that his employer, Meta, had a blog post about one use case which is for quad-level cell (QLC) SSDs that are not NVMe devices. "We are doing all the fancy stuff in user space", so there is no out-of-tree kernel driver being used to support those devices.

Comments (none posted)

Improving iov_iter

By Jake Edge
June 10, 2025

LSFMM+BPF

The iov_iter interface is used to describe and iterate through buffers in the kernel. David Howells led a combined storage and filesystem session at the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF) to discuss ways to improve iov_iter. His topic proposal listed a few different ideas including replacing some iov_iter types and possibly allowing mixed types in chains of iov_iter entries; he would like to make the interface itself and the uses of iov_iter in the kernel better.

Howells began with an overview. An iov_iter is a stateful description of a buffer, which can be used for I/O; it stores a position within the buffer that can be moved around. There is a set of operations that is part of the API, which includes copying data into or out of the buffer, getting a list of the pages that are part of the buffer, and getting its length. There are multiple types of iov_iter. The initial ones were for user-space buffers, with ITER_IOVEC for the arguments to readv() and writev() and ITER_UBUF for a special case where the number of iovec entries (iovcnt) is one.

There are also three iov_iter types for describing page fragments: ITER_BVEC, which is a list of page, offset, and length; ITER_FOLIOQ, which describes folios and is used by filesystems; and ITER_XARRAY, which is deprecated and describes pages that are stored in an XArray. The problem with ITER_XARRAY is that it requires taking the read-copy-update (RCU) read lock inside iteration operations, which means there are places where it cannot be used, he said. An ITER_KVEC is a list of virtual kernel address ranges as with regions allocated with kmalloc(). Finally, the ITER_DISCARD type is used to simply discard the next N bytes without doing any copying, for example on a socket.

One of the big problems with iov_iter, and buffer handling in general, is that as buffers are passed down into lower layers, those layers want to take page references on the buffer's pages. There is a pervasive view that all buffers have pages that references can be taken on, but that is no longer true in the folio world. There are also different lifetime rules for different kinds of memory that might be used in an iov_iter; pages might be pinned via get_user_pages(), there is slab memory (from kmalloc()), vmalloc() and vmap() memory, as well as device memory and other memory types, all of which have their own lifetimes. For example, user space could allocate GPU memory and do a direct read or write to it, which mixes several types. The bottom line is that a function that receives a buffer should not assume that it can take page references on it.

Beyond that, an array of pages may contain mixed types, Howells said. That means that cleaning up should not be done at the lower layers. Cleanup should instead be the responsibility of the caller.

A filesystem that does direct I/O will use an iov_iter to pass its buffers to a lower layer, but that layer does not know what that memory is. It is "a random set of user addresses and you don't know that you can pin them". In addition, readahead and writeback do not know how many pages or folios there are in an iov_iter that reference the page cache. Those operations have to iterate through the list to count them. Things are even worse if writeback_iter() is used, he said, which needs to traverse the page cache pages to flip the dirty bits on pages, then it needs to do so again to create an ITER_BVEC iov_iter, and another time to copy the data there.

Christoph Hellwig said that he did not really follow the problem for writeback as described, which may be because he comes from a block-layer perspective. Howells, Hellwig, and Matthew Wilcox had a rapid-fire discussion about the problems reported; Howells said that he is encountering the problems with network filesystems. Both Hellwig and Wilcox suggested that Howells was trying to optimize for a corner case, which is something that should be avoided; if the code works correctly, it can be slow for cases that rarely happen.

Howells then turned to the crypto API, which uses scatter-gather lists; he would like to switch that to use iov_iter. Wilcox said that was a good idea, since kernel developers want to get rid of scatter-gather lists. Howells's idea is to add a temporary ITER_SCATTERLIST type for iov_iter as a bridge to convert crypto drivers. Hellwig strongly recommended avoiding that approach, saying that previous experience shows that other developers sometimes start using a transitional feature, which makes it hard to remove it down the road. He was concerned that direct-rendering-manager (DRM) or dma-buf developers would start using it; "I don't want to give them that rope to hang themselves."

Duplicating the crypto APIs using iov_iter and slowly converting all of the crypto pieces to use the new ones was a better approach, Hellwig said. It is only needed for parts of the crypto layer that are implementing the asynchronous APIs, "which is actually not that much". Howells disagreed, saying there were lots of places in the crypto subsystem that needed the changes and that not all of it was in C code. Hellwig said that the assembly code operated at a lower level so it was not really a concern; he said he could lend a hand to help with the conversion.

Howells and Hellwig went back and forth about problems that Howells is trying to solve in the interaction between various subsystems, including networking, crypto, block, and memory management, which have led to all of the different ITER_* types and to some developers wanting (or needing) to add more. Hellwig said that the underlying problem is that the various subsystems cannot agree on a single common way to "describe chunks of physical memory", because "in the end, that's what all of the kernel operates on". Most of that is RAM, but there are other types as well. Without a kernel-wide agreement on what that that description should be, there will be a need to convert between all of the different representations.

Most people seem to think that representation should be pairs of physical addresses and lengths, perhaps with a flag, he said. That is not quite what a bio_vec is yet, but that is "the structure we think we can turn into that soon-ish". Then there will be a need to get all of the subsystems to use that; in some cases, it may make sense to "sugarcoat that in an iov_iter", but most of the low-level code should be operating on bio_vec (or whatever the name ends up being) objects.

Howells did not really see the path to getting to that point and wanted to talk about less-long-term solutions. He and Wilcox went back and forth some without seeming to make any real progress in understanding each other. Along the way, it became clear that there is some unhappiness because it seems like the networking-subsystem developers are unwilling to work with other parts of the kernel to solve these big-picture problems; it was unclear where things go from here—at least to me.

Comments (1 posted)

Improving Fedora's documentation

By Joe Brockmeier
June 9, 2025

Flock

At Flock, Fedora's annual developer conference, held in Prague from June 5 to June 8, two members of the Fedora documentation team, Petr Bokoč and Peter Boy, led a session on the state of Fedora documentation. The pair covered a brief history of the project's documentation since the days of Fedora Core 1, challenges the documentation team faces, as well as plans to improve Fedora's documentation by enticing more people to contribute.

I did not attend Flock in person, but watched the talk the day after it was given via the recording of the live stream from the event. The slides for the talk were published in PowerPoint format, but I have converted them to PDF for readers who prefer that format.

One piece of important information about the documentation team was not addressed until late in the talk—namely, the fact that the team's charter is to coordinate content and maintain tooling for Fedora documentation, with contributions from the larger community. Many people might assume that the team is responsible for writing all of the documentation, but that is not the case. The team is essentially there to help facilitate other people's work creating documentation, and to help with publishing of the documentation on docs.fedoraproject.org. This is not to imply that some team members don't do both—Boy also contributes to documentation for Fedora Server, for instance—but creation of documentation is not the team's mission. The Fedora wiki is also outside the team's scope.

Boy started the talk off by introducing himself as social scientist who works at the University of Bremen. He said that he has been using Fedora since its first release and he joined the documentation team in 2022, during an initiative led by Ben Cotton to revitalize the documentation effort. Bokoč, a Red Hat employee, said he had been working on Fedora documentation since about 2013.

The roller coaster

Bokoč said that Fedora documentation had a strong start "for about ten releases" because it was based on Red Hat's documentation. After that, it was downhill because "Red Hat's publishing toolchain was insane". Red Hat was using the DocBook XML-based markup language for its documentation and Publican to publish it. New contributors, he said, would join IRC to ask questions and be given little guidance on how to get started. For instance, a newcomer might be told "OK, well, install Vim" and be expected to adapt to writing documentation in XML with tools that are not suited for new users. That was, unsurprisingly, not hugely successful in attracting or retaining contributors.

The Red Hat employees who had contributed to documentation started leaving the company for greener pastures or switching roles inside the company. When Bokoč joined in 2013, there were only three people actively contributing to the documentation for quite a while. In 2018, Bokoč moved to working on Fedora documentation full-time, and in 2022 Cotton—who was Fedora's program manager at the time—led an effort to improve the documentation toolchain. The tools and processes are now much more lightweight, Bokoč said. Instead of DocBook, contributors can work with documentation in AsciiDoc using Antora to preview the documentation, which he described as faster and less annoying than working with Publican. Having dabbled with DocBook and Publican myself, I can certainly agree that they would not be my tools of choice in 2025.

After some initial excitement and interest, contributions once again tapered off, picked back up again, and tapered off once more. "This is the story of Fedora docs, right? It's a roller coaster, and we would like to get off the roller coaster." The slides for the presentation say that, currently, "docs are near dormant except for a few long-time contributors".

Things are not all bad, though. In 2022, the Fedora documentation site was redesigned and gained Fedora Quick Docs for "micro-articles focused on a specific thing, like 'my drivers don't work, I have an NVIDIA card, how do I fix this?'". The quick docs are now the primary area for contribution, he said.

In the past, much of Fedora's documentation was basically adapted from Red Hat Enterprise Linux. "You know, somebody just search and replaced RHEL for Fedora, which sucks because the target audience for Fedora is wildly different." Now, however, the teams producing Fedora editions are the main drivers of documentation. This is especially true of the IoT, Server and Workstation editions, whereas the Cloud edition's documentation "needs a lot of love" according to the slides.

Problems

What is missing, Bokoč said, is an overarching strategy. The project has a big gap in documentation for "specific, simple-but-common tasks that people ask about a lot". The quick docs cover some of that, but many things users look for are missing. The Fedora forum helps somewhat, but it requires users to create a Fedora account to log in just to ask a drive-by question. Most people, he said, don't want to do that. "They just want to google it and find an answer".

The problem is the high attrition rate, he said. Most people are "drive-by contributors" who don't stay long. They show up and make a few contributions, "which is awesome", but they don't stay. That is a problem. He attributed some of that to the scattered nature of Fedora's documentation repositories. Some are hosted on GitHub, some on GitLab, and some on Fedora's infrastructure. There is a way to find the source; each page of documentation has a button on the right-hand side of the page that leads to its repository. "But nobody notices those. I don't know what to do about those, like make them flash red or something?"

Another problem is that when someone does come in with a contribution, it may take a while for a member of the docs team to review it. "This is mostly my fault, and I'm a lazy bastard". That is, he admitted, demotivating for contributors. After doing the work to make a first contribution, a person's pull request just sits there "and it looks like nobody actually cares". Then they go away, and there is a good chance that they do not come back. He said that he had been getting better at reviewing pull requests more quickly, but that it had been a problem for a while.

Finally, there is a ton of existing content but almost no one actually knows all of what is available and what has become outdated. The team, he said, lacks the people to go through all of the existing documentation and throw out what is no longer valuable.

Solutions

Boy said the project needed to give up its "purely technical perspective" on solving contributor problems. "We have to take care of group-building and community-building tools". He noted that the team needed to be able to integrate the work of many people with different skills, over a longer period of time. Though he didn't say this as concisely, the obvious point was that Fedora cannot depend on single individuals willing to do heroic amounts of work to deliver complete documentation. It has to find ways to organize and assemble smaller contributions into coherent works.

He also noted that Fedora should provide more automation for tedious work, and Bokoč agreed that with few people stepping up as new contributors, "we should really do a better job on like trying to keep them in the project as long as possible, which means making them happy, right?" One example of that, he said, might be a live preview for pull requests so that people would see what their contribution would look like live on the site when it is merged.

Bokoč also said that (ironically) the documentation team's contributor documentation needed improvement. There is some, he said, but it was outdated and unnecessarily long, "so nobody wants to read it".

Another thing that would provide some motivation was to offer badges for contributions. Its impact would be limited because nobody would write tons of documentation just to get badges, he said, "but it definitely can't hurt". Bokoč added that there had been documentation badges available beginning in 2015, but "they broke almost immediately and nobody fixed them".

With that, the presenters turned the floor over to questions. One audience member wanted to know what areas the documentation team most wanted help with. "If we had 20 people willing to work on docs, what would be your priority to fix?" Boy reminded the audience that the documentation team were generally the editors, not owners, of the documentation. People should write about the areas that interest them. He did note that the team is responsible for some general-purpose documentation, such as the Fedora administration guide and kernel documentation, which are in need of help.

Another person asked if there were any plans to have a chatbot that Fedora users could simply ask questions to rather than searching the documentation. Bokoč said that there are no plans for that. "Unfortunately, it took five years to implement proper search on the site, so I hope that answers your question." The session wound down shortly after that.

Important but unloved

Fedora's documentation woes are hardly unique to the project. Documentation for most Linux distributions and open-source projects is lacking—and it is not difficult to see why. Even when developers are paid to work on open source, documentation is often left as an exercise for volunteers. Fedora's documentation problems are made even worse by the fact that it has a rapid development cycle with many people paid to develop software, and hardly anyone paid to help document that work. In that scenario, even when documentation is written, it often ages poorly as things evolve. Good documentation is like a garden; it has to be continually tended to in order to feed the community.

It is nice to see projects take interest in improving tooling and processes for documentation volunteers; one of the Fedora foundations is "friends", and it's not friendly to make people work with XML. But it would be even better if companies that benefit from open source (not just Red Hat) took a greater interest in funding the documentation that accompanies the software.

People who have the requisite knowledge to produce high-quality documentation are in short supply. People who have the skills and wish to deploy them in a volunteer role over the long haul are in even shorter supply. As long as documentation is treated as second-class to code, community building and tooling improvements will only go so far; projects like Fedora will probably have to keep riding the roller coaster.

Comments (none posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: Android tracking; /e/OS 3.0; FreeBSD laptops; Ubuntu X11 support; Netdev 0x19; OIN anniversary; Quotes; ...
Announcements: Newsletters, conferences, security updates, patches, and more.

Next page: Brief items>>