Leading items
Welcome to the LWN.net Weekly Edition for June 5, 2025
This edition contains the following feature content:
- OpenH264 induces headaches for Fedora: patent concerns prevent the release of an update for an important OpenH264 security fix.
- Safety certification for open-source systems: two Linaro Connect sessions on the use of open-source software in safety-critical systems.
- Out of Pocket and into the wallabag: an alternative "read it later" service.
- The first half of the 6.16 merge window: what's coming in the next major kernel release.
- Block-layer bounce buffering bounces out of the kernel: removing some ancient legacy code from the kernel's block subsystem.
- Hardening fixes lead to hard questions: an inadvertently edited repository creates a momentary panic.
- Device-initiated I/O: developers at LSFMM+BPF 2025 discuss letting devices initiate peer-to-peer I/O transfers.
- Two sessions on faster networking: LSFMM+BPF discussions on optimizing the network stack.
- Reports from OSPM 2025, day three: the final day of reports from the 2025 Power Management and Scheduling in the Linux Kernel Summit.
- The importance of free software to science: why open code and file formats are key to high-quality science.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
OpenH264 induces headaches for Fedora
Software patents and workarounds for them are, once again, causing headaches for open-source projects and users. This time around, Fedora users have been vulnerable to a serious flaw in the OpenH264 library for months—not for want of a fix, but because of the Rube Goldberg machine methodology of distributing the library to Fedora users. The software is open source under a two-clause BSD license; the RPMs are built and signed by Fedora, but the final product is distributed by Cisco, so the company can pick up the tab for license fees. Unfortunately, a breakdown in the process of handing RPMs to Cisco for distribution has left Fedora users vulnerable, and inaction on Fedora's part has left users unaware that they are at risk.
OpenH264 background
The Advanced Video Coding (AVC) video codec, often referred to as H.264, is a video-compression standard meant to provide reasonable video quality at lower bitrates than previous standards. It is widely used for video encoding and playback, not only for watching cat videos online but also for video conferencing. Various patents that apply to the standard were held by MPEG LA, which was later acquired by the Via Licensing Alliance. One public presentation from 2013 indicates that the going rate to acquire a license to appease the patent holders was $0.20 per unit (after 100,000 units) or $25 million a year, whichever was cheaper.
In 2013, Cisco announced the OpenH264 project, which not only provides source and binaries that implement the H.264 standard, Cisco also pays the licensing royalties through an arrangement it made with MPEG LA. However, there's a catch: a project or user has to use the binaries that are distributed by Cisco to take advantage of its deal with the license holders.
Projects cannot simply take the source and ship it with their software—at least, they cannot if they want to be certain that the patent holders won't come looking for a payout. If, say, Fedora actually hosts the repository with OpenH264 packages for its users, then it—or its users—will be responsible for the licensing fees. Since Fedora is not an independent entity, Red Hat would be responsible for the fees as the sponsor of Fedora. As it happens, Red Hat has (quite reasonably) been unwilling to roll the dice and ship OpenH264 in the hopes that it would escape the gaze of the license holders. It has been equally reluctant (also reasonably) to open its wallet for a fee that may well run in the millions of dollars each year.
In 2014, LWN covered some of the early problems that Fedora had with inclusion of OpenH264, and noted that Christian Schaller was working with Cisco on a way to build OpenH264 on Fedora's infrastructure for its users. Ultimately, as described on Fedora's OpenH264 page, Fedora was able to include the fedora-cisco-openh264 repository metadata, starting with Fedora 24, that would point users to Cisco-hosted packages built by Fedora. That included packages to use H.264 with GStreamer and Fedora's Firefox package. The repository was enabled by default with the Fedora 33 release.
Given the age of the standard, one might hope that the patents that apply to it would have expired or be close to expiring—but it's not that simple. As Michael Catanzaro explained in a 2023 discussion about OpenH264, it is unlikely that a current implementation of the standard will definitely be clear of patent encumbrances:
Unfortunately H.264 is extremely complicated because there are so many revisions to the specification, so even once the patents covering the original specification have expired, figuring out whether a particular decoder is legal or not still requires substantial technical expertise in addition to legal expertise. And that can change if the decoder implements any newer features in the future, which makes for an extremely challenging problem. That is to say, do not expect other H.264 decoders to be allowed in Fedora even when all patents covering the original spec have expired.
Thus, despite the fact that it is more than 22 years old, H.264 remains a patent headache, and Fedora is still having to perform a delicate dance with Cisco to avoid having liability for patent fees. Unfortunately, it takes two to tango—or, in this case, release OpenH264 packages.
Delays
CVE-2025-27091 was issued on February 20. It describes a vulnerability in the decoding function of OpenH264 that could allow a remote attacker to trigger a heap overflow in the library. If an attacker can entice a user into playing a video that exploits the vulnerability, it would be theoretically possible for the attacker to perform arbitrary commands on the victim's system. The CVE was given a severity rating of 8.6 out of ten. The upstream project released OpenH264 2.6.0, with a quiet fix for the vulnerability, on February 12.
Patrik Polakovič opened a ticket to update OpenH264 the same day that 2.6.0 was released. Unfortunately, there was a mismatch in the shared-library version. The Makefile for OpenH264 specified SHAREDLIB_MAJORVERSION=8, but the meson.build for the project was set to major_version = '7'. That stalled things on the Fedora side while trying to sort out the right library version—and before it was understood that there was a security vulnerability at play. Catanzaro bumped the ticket with a comment about the security advisory on February 24.
Following that, several tickets were opened to track progress on properly building the packages with the security fix; those were ultimately consolidated in a single ticket on February 27. The first set of packages was built the same day, but Catanzaro asked to hold off on sending them to Cisco for distribution while a 2.5.1 version release with the fix was created to avoid bumping the ABI version from 7 to 8. But it was not possible for Fedora to create its own version due to the patent issues.
Wim Taymans updated
the ticket on March 12 to say that there were completed builds
for Fedora 40, 41, and 42, as well as EPEL 10. The builds
are visible
in Koji, Fedora's build system, but they are not available for
download. If users try to access the builds using Koji's download
links, they will be directed to the Non-distributable-rpms
page that explains that Fedora cannot distribute the RPMs for "various
legal reasons
".
After some discussion, Polakovič reported
that he had tried to send the packages to Cisco as email attachments
but received an error that "Red Hat Mail does not allow you to use
this type of file as it violates Google policy for executables and
archives
". He asked
later if Catanzaro knew how to get the RPMs to Cisco:
These licensing issues are the source of so much frustration. Ordinarily we'd just get the build and that's it. It's insane how many hoops we have to go through.
More discussion and complaints about the legal silliness followed. No one seemed to know how to achieve the task of alerting Cisco to the builds and ensuring that only Cisco employees could download the builds. In early April, Neal Gompa offered the contact information he used when communicating with Cisco from the openSUSE project, which followed Fedora's lead in distributing OpenH264. On April 21, Polakovič said that he had still not received a reply from Cisco and was not sure how to proceed.
Remove OpenH264?
More time passed and on May 8, Catanzaro said
that it might be time to remove OpenH264 from Fedora "due to the high
risk level and inability to release updates
". Kevin Fenzi replied
that he finally had contact with Cisco, "so I am hoping that this
will be unblocked soon
". Another update
from Polakovič on May 13—three months after OpenH264 2.6.0
was released with a fix for the vulnerability—indicated that
Cisco had provided Fedora with a link for uploading and that the
packages were successfully submitted.
On May 28, Fenzi updated the ticket status to affirm that Fedora was still waiting on Cisco to update its repository.
The problem has also been raised and discussed on the fedora-devel
mailing list. Chris Adams first
asked at the end of April how to remove or replace the
openh264 package, due to the vulnerability, without removing
other packages. In late May, Jonathan Schleifer thanked
Adams for raising the topic: "I had no idea Fedora would let
something so serious be unfixed for so long
". Stephen Smoogen responded
that this was a predictable problem:
There is no incentive for the "partners" to fix these sorts of problems and there will always be a lot of incentive to put it off another day for whatever internal fires are going on. At this point, I think we should acknowledge that we as a project made a mistake and figure out how to fix this for our users.
Catanzaro said
that Fedora was between a rock and a hard place. Without OpenH264
Fedora would have to "point to RPM Fusion and hope [users] can figure
out how to get what they need from there
". He later said
that Fedora would need to remove OpenH264 if it cannot fix security
issues in a timely manner. "If we knew that
it would take this long to update, we would probably have done that
already.
"
Current status
To date, it is unclear when updated packages will be available to Fedora users. The Fedora project has no viable options here. If it does not ship an H.264 implementation, its users are left to fend for themselves, which leads to complaints about complexity and a lack of parity with other Linux distributions or operating systems. It does not have the luxury of thumbing its nose at the patent cartel, nor a benefactor that wants to appease the cartel with bags of cash so that Fedora could supply OpenH264 to users directly. The Cisco arrangement, on paper, would seem to be the best option.
In practice, though, it has left Fedora unable to push an update to protect its users. But what is within Fedora's control is how it communicates with its users. It is mystifying that Fedora has not issued an advisory to warn its users that they are exposed to a security vulnerability. There may not be an easy fix for Fedora to provide, but it could advise of workarounds or at least ensure that users are aware that they are vulnerable and allow them to seek their own remedies.
This situation demonstrates, once again, the fragility of depending on a corporate benefactor providing a service. Just because a team at a company is well-staffed and offering to lend a hand today does not guarantee that will be the case tomorrow or the day after. Management can have a change of heart, priorities will shift, people may get busy or leave for other jobs, institutional knowledge can be lost, and then things fall through the cracks.
This story has been replayed in open-source projects enough times to be familiar to all but the newest folks in the community. We've all seen this movie before, and it does not improve with repeated viewings or remakes.
Safety certification for open-source systems
This year's Linaro Connect in Lisbon, Portugal featured a number of talks about the use of open-source components in safety-critical systems. Kate Stewart gave a keynote on the topic on the first day of the conference. In it, she highlighted several projects that have been working to pursue safety certification and spoke about the importance of being able to trace software's origins to safety. In a talk on the second day, Roberto Bagnara shared his experience with working on one of those projects, the Xen hypervisor, to conform to a formal set of rules for safety-critical code.
Automotive open source
Stewart is the owner of a 1996 Volvo car. Its air conditioning failed earlier this year, she explained, so with no replacement parts to be found and summer in Texas fast approaching, she recently bought a new car. Her original car is just hardware, she said, but now software is playing an increasingly important role in modern automobiles.
The difference is obvious just from comparing the dashboards: Stewart's
original Volvo has physical dials and knobs, but her new Honda has displays and
touchscreens. The obvious differences aren't what worry her, though. New cars
also have backup cameras, radar, assistive lane following, and many more
electronic features that have a real impact on the safety of the vehicle. A car
is, after all, "a couple of ton piece of equipment that could easily kill
someone
". All of
this new hardware needs to be powered by software — and a lot of that software
is open source.
So how can we ensure that a complicated system like a modern car is safe? Traditional supply chains for safety-critical systems have ensured this using "needs" that are fulfilled with "requirements", Stewart explained. A component is designed out of smaller components. Assuming that the smaller components meet some standard, the engineer designing the larger component is responsible for making that component safe. The assumptions that a larger component makes about its constituents are called needs; the standards that a smaller component meets are called requirements. Every need is matched with a requirement, to produce a graph of components where, at every step, it is clear exactly what is required for the overall system to be safe.
This doesn't really work for open source. 59% of the software in Stewart's new car is open source, she said, according to a disclosure by Honda. That's great, but those components may not be designed for use in safety-critical systems. Even if they are, they're exceedingly unlikely to have reliable documentation, certified by an engineer, that lays out their safety needs. That's not something that open source has really been designed to handle.
So how can these gaps be addressed? The first step, Stewart said, is just to know what components are actually going into a system. For physical hardware, this is usually called a bill of materials (BOM); a software bill of materials (SBOM) is similar. The SBOM lists the exact version of every dependency used by a piece of software.
Safety standards for software are mostly concerned with trying to "minimize
and mitigate systemic faults in the code base for an application
". In order
to do that, they need some way to track what the components in a system are,
what faults have been discovered with them, which bugs are known, and how all of
this changes over time. Today, "it's being done, quite frankly, with
spreadsheets
", which is "really inefficient
".
Since 2009, Stewart has been involved with the System Package Data Exchange (SPDX) project, which creates standardized formats and tools for exchanging and processing SBOM information. For version 3.0 of the SPDX format, she has been specifically working to include the information needed by someone trying to comply with a safety standard. In particular, SPDX 3.0 focuses less on a static document describing a particular component, and more on relationships between components. This allows SBOM tooling to capture all kinds of information about source code, build information, licensing, datasets, AI models, provenance, and everything else that goes into understanding what is part of a software system.
The next hurdle is actually persuading open-source projects to generate SBOM information. Stewart has been trying to make this as easy as possible with SPDX tooling, and many build systems can be configured to generate SBOMs with few changes. Still, like with any change in open source, it does require someone to actually do the work. Once a project has an SBOM, there is existing tooling, from the SPDX project and elsewhere, that can process the SBOM to produce various useful reports or track changes over time.
There are more potential next steps.
Several projects, especially in the embedded space, have
been working to be formally certified under various safety standards,
including
ELISA (where Stewart serves on the technical steering committee),
Zephyr,
Xen, and
Yocto.
But adopting SBOMs is a good place to start — and Stewart concluded her talk
with the hope that
"we'll get there soon.
"
MISRA compliance in Xen
MISRA, the motor industry software reliability association, is perhaps best known for its extensive set of rules on how to produce high quality software, the MISRA C guidelines. Bagnara is a member of the MISRA committee, and works with bugseng, a company that produces tools to help with MISRA compliance, among other things.
The biggest benefit of following MISRA's coding guidelines comes when a project enforces them from the beginning, Bagnara said, but that is often not possible. Especially when an open-source project is started and only later incorporated into a safety-critical system.
Open-source maintainers generally take pride in their software; many projects have existing standards for their code. The problem is that these existing standards rarely match up with what a functional-safety standard mandates. In many cases, artifacts showing that these standards have been followed — such as tests or documentation on requirements — are also missing. This is understandable, since open-source projects are largely driven by volunteers, only a subset of whom will care about safety qualification.
There are also more minor problems. For example, open-source software is often highly configurable, and any given project only cares about the safety of its specific configuration. Also, development of open-source software often moves much faster than traditional safety-critical software does, which makes keeping up to date difficult.
So, faced with this dilemma, with the obvious benefit and utility of open-source software on the one hand, and its attendant challenges on the other, what can be done? Well, we could give up on certifying the software itself, Bagnara said. For example, a project could instead qualify some kind of monitor or hypervisor that ensures the correct operation of the rest of the software. Better than that would be to fork the project. Everybody wants to avoid permanent forks, but a temporary fork is probably inevitable when working on a large change like safety qualification.
The most challenging and most rewarding approach, however, is "retrofitting
safety
", a topic that he has been
studying for some time.
Projects should create or adapt a coding standard after the fact, he said, and let the
subset of volunteers who care about such things do the work of getting the
existing code up to snuff.
That's what the Xen hypervisor project chose to do. As a hypervisor, Xen's safety underlies the safety of many other components in products that incorporate it. In order for Xen to be used in safety-critical systems, Bagnara has been working to bring Xen's existing coding guidelines in line with the MISRA C standard.
There are a lot of misconceptions about MISRA, Bagnara said. Although he is a member of the MISRA committee, he emphasized that his personal responses to these common misconceptions should not be understood as the official position of the committee. People frequently think that the MISRA rules are too strict, he said, to the detriment of actually making code that works. That's not the case; for MISRA, code quality always comes first. If a rule makes it harder to write quality code for some reason, there is a "deviation" process that lets the developer justify departing from the MISRA guidelines.
An example that came up while working on Xen was MISRA rule 10.1 ("Operands
shall not be of an inappropriate essential type
"). Among other
things, the rule disallows some implicit casts, which are a frequent source of
subtle errors in C programs. But there are implicit casts that are safe; in Xen,
this mostly appeared in the form of value-preserving conversions of integer
constants. Rather than pointlessly requiring all of these places in the code to
be marked with a manual cast, "rule 10.1 can be tailored by regarding those
instances as safe
", Bagnara explained, as long as there is a tool that can
be used to check the places where the implicit casts are used.
Another common complaint he sees is that MISRA rules mix too many
concerns under a single rule, and that this makes them complicated. "We have
more than 200 rules
", he explained. "We don't want 2,000 rules.
" So
each rule has to be fairly general, to cover all of the relevant cases.
The general nature of the rules is also a reason some projects will need to deviate from them. Xen uses a number of bit-twiddling tricks, as might be expected from low-level software. For example, there were four places in the Xen code base that used this trick, which contravenes a MISRA rule, without explanation:
variable &= -variable
This construction isolates the least-significant set bit of a value by using the fact that in the common two's complement representation of numbers, the least-significant set bit is the only one shared by both a number and its negation. Bagnara moved it into a macro with a meaningful name and an explanatory comment. That macro is covered by a global deviation that allows the project to use it wherever it is needed; the code is now easier to read for anyone who hasn't seen this particular construct before.
This kind of thing is generally needed in order for an upstream project to accept changes, he said. Maintainers don't want to accept lots of tiny changes in the name of MISRA compliance; they would much rather accept some documentation and additional tooling to justify and check the needed deviations from MISRA rules, instead.
In some cases, that did require making improvements to the proprietary static analyzer that Bagnara works on and with, ECLAIR. For existing high-quality code such as Xen, he said, someone working on MISRA compliance can expect to create more deviations than in a green-field project, because the code is likely to be correct already. Once those deviations have been accounted for and checked, however, the remaining problems highlighted by MISRA rules stand out even more. In the end, attempting to adopt MISRA for Xen did actually improve the code.
Today, Xen is "nearly 100% MISRA compliant
". As always, the last portion
of the work will be the hardest part. Bagnara is optimistic, however, that Xen
will soon be completely MISRA compliant, and that this will make its inclusion
in products with functional-safety requirements much easier.
[Thanks to Linaro for covering my travel expenses to Linaro Connect.]
Addendum: Video from Stewart's keynote and Bagnara's talk are now available. The slides can also be found on Linaro's website.
Out of Pocket and into the wallabag
Mozilla has decided to throw in the towel on Pocket, a social-bookmarking service that it acquired in 2017. This has left many users scrambling for a replacement for Pocket before its shutdown in July. One possible option is wallabag, a self-hostable, MIT-licensed project for saving web content for later reading. It can import saved data from services like Pocket, share content on the web, export to various formats, and more. Even better, it puts users in control of their data long-term.
About wallabag
Social bookmarking was first made popular by a service launched in 2003, called del.icio.us (later just "Delicious"), that allowed people to save, store, and share their web bookmarks online. It was followed by a handful of similar services, including Read It Later, which launched in 2008 and upped the ante a bit: it not only saved a page's title and URL, it stashed the content on its servers for later access (hence the name). Read It Later became Pocket in 2012.
In 2013, Google announced that it was killing off Google Reader, its RSS-feed-aggregator service. This alarmed Nicolas Lœuillet, who worried that the same thing might happen to Pocket. To ensure he had a home for his saved articles and data he began work on a project for self-hosting saved web content called "poche", which is French for "pocket". He renamed it to wallabag in 2014 following some trademark unpleasantness.
The project consists of a web application written using the Symfony PHP framework, as well as a number of client applications and browser extensions to save data to wallabag or fetch articles for reading. The ecosystem page on GitHub has a full list of applications provided by the project, as well as "unofficial" clients that are written by others.
Users should have little trouble finding clients to fit their needs on the desktop or a mobile device. The project has Firefox and Chrome extensions as well for saving pages directly from the browser. Folks who use unsupported web browsers can use the JavaScript bookmarklet for wallabag to save pages without using an extension. Wallabag has official Android and iOS clients which are open source, though the Android app is licensed under the GPLv3 rather than the MIT license. There are several e-reader clients for reading content that has been saved to wallabag, a GNOME application called Read It Later, a command-line client, and even an emacs client. If none of the existing clients quite meet one's needs, the wallabag API seems well-documented for those who would like to write their own client, and there are API wrappers in Go, Java, JavaScript, and Rust.
Wallabag server
The project has installation instructions for running wallabag on shared hosting (where one might not have root access) or on a dedicated server. The project supports MySQL/MariaDB, PostgreSQL, or SQLite as the backend database. There is also a container image for use with Docker, Podman, or other container managers. Users who do not want to mess with self-hosting, at least not right away, can opt for paid hosting from third-party providers, including wallabag.it which is run by Lœuillet.
For my testing, I chose to run the container image with Podman, with SQLite as the database using this command:
$ podman run -p 8282:80 -e \ "SYMFONY__ENV__DOMAIN_NAME=http://localhost:8282" \ -v /home/user/wallabag/data:/var/www/wallabag/data:z \ -v /home/user/wallabag/images:/var/www/wallabag/web/assets/images:z \ wallabag/wallabag
This starts the container and forwards localhost port 8282 to port 80 in the container (-p 8282:80). The documentation for the wallabag container image unfortunately omits the port number from the environment variable that specifies the domain name for Symfony (-e "SYMFONY__ENV__DOMAIN_NAME=http://localhost"). If the container is started without that, it will mostly work, except wallabag will be missing its CSS—which is not great if one wants to use the web interface.
The -v options create persistent storage volumes for the container; users should replace /home/user with the appropriate path on their system. Note the :z suffix at the end of the path specification for the volume mounts. That tells Podman to set the correct SELinux labels for the directory. Without that, if SELinux is enabled, the container will not be able to write into those directories and the container will not start.
Configuring wallabag
Once started, the wallabag container uses "wallabag" as the default administrative username and password. The first order of business should be to change the username and password after logging in. Clicking the "My account" menu with the person icon at the top, right-hand side of the web interface will bring up the menu for managing wallabag. Go to "Users Management" to edit the username if desired. The password can be changed by going to "Config" and then "Password".
The first user created has full administrative privileges on the wallabag server, so it is a good idea to create a second user for day-to-day use. Additional users do not have administrative privileges, and wallabag has no way to add additional administrative users via the web interface. However, it is possible to bump up a user's privileges via console commands.
Wallabag's other server settings, such as public URL sharing and export formats, can be tweaked by going to the "Internal Settings" page from the menu. By default, all of the sharing services that wallabag supports—such as diaspora*, Shaarli, and Unmark, are turned on by default. Unfortunately, wallabag does not have a feature specifically to share via ActivityPub/Mastodon yet, though there has been a GitHub issue open on the topic since 2017.
Using wallabag
To get this out of the way early: yes, wallabag's web interface does have a dark (and light) theme. Users have the option of choosing one or the other, or having the theme automatically follow system settings. I found the light theme more pleasant and usable, but that's likely because I prefer light themes in general.
Once wallabag is set up, it's time to start adding bookmarks. In the web interface there is a plus (+) icon at the top of the page for adding new bookmarks: just click that and copy the URL to add as a bookmark. The navigation menu for existing bookmarks is on the left-hand side of the page. It has entries for unread articles, archived (read) articles, and so forth.
Wallabag has a tagging system for categorizing saved pages. Each page can have multiple tags, and wallabag allows users to browse saved pages by tag. Naturally, wallabag has search and filtering features to help users sort through their pile of bookmarked pages. Search and filter are located in the top toolbar; search is invoked with the ubiquitous magnifying glass icon, and the filter menu icon is the upside-down triangle. Pages can be filtered by status (read, unread, starred, or annotated), tags, language, domain name, creation date, and more.
It is much more convenient to save pages directly from one's web browser than to navigate to wallabag and copy the URL over manually, so most folks will want to set up the Firefox or Chrome extension right away. Wallabag requires more than a username and password pair to connect clients to the server, though: users have to create a client ID and client secret pair as well. This can be found from the menu item "API clients management".
Wallabag attempts to fetch content to cache it for later reading, but results can be mixed. As one might expect, sites that are more straightforward HTML are saved without much trouble. On the other hand, sites with complex layouts, or that are festooned with ads and JavaScript, tend to produce poor results. Wallabag can also run into problems fetching content if the page is not found, access is forbidden, or otherwise unavailable.
The fetch errors documentation explains how to enable debug logs and troubleshoot problems with incomplete or incorrect content. The project encourages users to create issues on GitHub if something is breaking the wallabag parser and the problem is reproducible.
Assuming that content has been fetched and saved correctly, wallabag not only allows reading the content in the web interface; it can be exported to EPUB, PDF, XML, JSON, CSV, and plain text. It's not entirely clear to me why one might want to save a page to CSV, but the feature is there for anyone who wants it.
Wallabag's web interface also has an annotation feature that allows highlighting text and adding notes to pages. It is nice to be able to add notes to saved content, but the feature is somewhat limited in that annotations are not exported when one saves to EPUB, PDF, etc. It would be ideal if users could export annotations with their content, and separately. For example, many e-reader applications let users save annotations or highlighted phrases separately—which can be quite useful when gathering content while writing an article or report.
Importing
Wallabag can import data from Chrome or Firefox bookmarks, as well as several social-bookmarking services including Instapaper, Omnivore, Pinboard, and Pocket. With bookmarks, it is as simple as exporting one's bookmark data from Firefox or finding the bookmarks file from Chrome (or Chromium) and then going to the Import page in Wallabag to upload the file. For some services, like Pocket, it may be necessary to create an API key to import data. Wallabag's import page (go to the "My account" menu and click "Import") has short instructions for each service that it supports, and more detailed instructions in its documentation. The project also has a script for importing data from the command line.
I tested imports from Firefox bookmarks, Pocket, Instapaper, and
from another instance of wallabag. I have not been a heavy user of
Pocket or Instapaper, so I only had about 300 saved pages between the
services. Even so, it took wallabag a few minutes to complete the
import. In its documentation, the project acknowledges
that "imports can take ages
", and it has developed an
optional asynchronous tasks
system that users can set up if a server is going to be handling
large imports. That is probably overkill for single-user systems where
importing data is likely to be a one-time or infrequent task.
Other than speed, the import process seems to work well enough. It would be better if users had the option to add a tag to the content to identify it as imported content. For example, it'd be nice to be able to add a "firefox" tag to pages imported from Firefox bookmarks. It does allow users to mark imported content as read when importing, but it is an all-or-nothing affair.
Project
The most recent release of the wallabag server was version 2.6.12, a minor update pushed out on April 10. The last major feature release was version 2.6.0, which came out in June 2023. The most active contributors in that time frame are Lœuillet, Yassine Guedidi, and Jeremy Benoist. The project does not have a security list for announcing security vulnerabilities—they are simply announced along with releases that fix vulnerabilities. The project's security policy on GitHub is just a notice asking for vulnerabilities to be sent to the general-purpose "hello@wallabag.org" email address.
Generally, the project seems to be mature and active enough to give users confidence that it will be around for the long haul. However, it is also more informally run than some might like. Right now, many of the standard governance questions are unanswered. For instance, the ownership of trademarks, the wallabag.org domain, the GitHub organization, its code repositories, is a bit fuzzy. Also, what might happen if project creator Lœuillet was suddenly out of the picture—whether the project has a "bus factor" larger than one, in other words—is unclear from the outside.
Since Lœuillet runs a for-pay hosted wallabag service, it would be
reassuring if the project had a clear statement on the separation
between the project and his product. I emailed Lœuillet to ask about
those topics, and he replied that these were things that the project
had not previously considered. He indicated that he would follow up
"very soon
" with a response, and I will update the article if
that reply arrives after publication.
The project maintains a set of milestones for upcoming releases on GitHub. The 2.7.0 release has no due date, but it appears to be down to a handful of unresolved issues before it is ready for release. That release will remove MOBI export, which was deprecated in 2024. 2.7.0 will also add the ability to search annotations.
For many people, wallabag will be a suitable replacement for Pocket. It has most of the same features, though arranged differently and perhaps not as attractively, as Pocket. In fact, it surpasses Pocket in several ways such as allowing export to EPUB, PDF, and other formats, which is not something in Pocket's feature set. The fact that users can run their own server is, of course, also a major selling point over any offering that depends on the whims of a company or organization that can shut down the service at any time.
The first half of the 6.16 merge window
As of this writing, 5,546 non-merge changesets have been pulled into the mainline kernel repository for the 6.16 release. This is a bit less than half of the total commits for 6.15, so the merge window is well on its way. Read on for our summary of the first half of the 6.16 merge window.
As always, the LWN kernel source database provides summary statistics and historical breakdowns for subscribers. Here are the most interesting commits of the 6.16 merge window so far:
Architecture-specific
-
Five-level page-table support is now
unconditionally enabled for x86_64.
"
Both Intel and AMD CPUs support 5-level paging, which is expected to become more widely adopted in the future. All major x86 Linux distributions have the feature enabled.
" - PowerPC now supports dynamic preemption (i.e., changing the kernel's preemption settings at boot time).
- The intel_pstate driver now registers an energy model for use with energy-aware scheduling on hybrid platforms without symmetric multithreading (also known as hyperthreading).
- Users can now control C1 demotion (the process whereby a CPU can independently decide to remain in a higher power state when the kernel tries to enter a lower one) with a sysfs knob, and retrieve information on the capacity of different CPUs.
- Arm64 lazy-preemption support and scalable-matrix-extension support have both been merged.
Core kernel
- The kernel will now hand out pidfds for processes that have already exited when a SO_PEERPIDFD socket is used to request a pidfd for a thread group leader that has already been reaped. Since user-space code needs to handle processes dying while it holds a pidfd anyway, this should simplify error handling.
- Core dumps can now be sent to an existing Unix socket, instead of being written to a file or spawning a user-mode helper. Christian Brauner hopes that this will reduce the number of CVEs related to the user-space core-dumping API.
- io_uring can now be used to create pipes.
- Futexes are now NUMA and mempolicy aware. This gives greater control over where futexes are placed in memory, so that they can be located close to the processes that will use them.
- There is now a command-line option for enabling or disabling group scheduling of realtime tasks.
Filesystems and block I/O
- The bfs and omfs filesystems now use the new mount API. The API, added in 2019, has slowly been adopted by the kernel's many filesystems.
- A new sysctl knob, vfs_cache_pressure_denom, indirectly controls the number of dentry cache entries ('dentries') that are preserved while the system is experiencing memory pressure. Specifically, the minimum proportion of dentries reclaimed during a memory reclamation event is 1/vfs_cache_pressure_denom, so setting a higher value will reduce the number of cache entries that are guaranteed to be collected, leaving more cache entries to be evicted or retained on their merits.
- Zoned loop block devices, which emulate a generic block device using multiple files on an existing file system, are now available and documented.
- Among many bcachefs changes, the filesystem now supports only performing rebalance operations when the system has AC power.
- XFS supports atomic writes. LWN recently covered discussions about support for atomic writes in more filesystems.
- EROFS can now make use of Intel QAT hardware acceleration.
-
"
Stupendous
" performance improvements on ext4 from a number of optimizations. - The ancient uselib() system call, which has been deprecated for some time, has now been removed, hopefully without breaking any user-space applications in the process. The system call is used to map dynamic libraries with writing disabled, so that they can be shared between different programs. That use case is served today by calling mmap() with appropriate flags.
- The block-layer maintainers have finally eliminated bounce buffering; see this article for details.
Hardware support
- Clock: System Timer Modules on S32G NXP SoCs, EcoNet HPT timers, and Analog ADP5055 digital to analog converters.
- GPIO and pin control: EcoNet EN751221 SoCs, SG2044 SoCs, loongson, mc33xs2410 high-side switches, rzg2l-gpt pulse-width-modulation controllers, max77759 companion PMICs, VeriSilicon BLZP1600 GPIO interfaces, and Spacemit K1 SoCs.
- Hardware monitoring: PTC support on int340x, Airoha EN7581s, and IPQ5018 SoCs.
- Input: AMD HID2, Renesas RZ/G3Es, Rockchip RK3528s, and Samsung Exynos Autov920 processors.
- Media: MT8192 Spherion and MT8186 Corsola SoCs.
- Miscellaneous: SDHCI OF on the SpacemiT K1 SoCs.
- Networking: Qualcomm IPQ5018 WiFi chipsets.
- Power: Pegatron Chagall batteries, Maxim MAX8971 battery chargers, Huawei Matebook E Go chargers, Dimensity 1200 MT6893s, SM4450 power domains, RK3562 SoCs, Allwinner H6/H616 PRCM PPUs, and TI TPS65214 integrated power management chips.
- Sound: AMD ACP 7.x, Cirrus Logic CS35L63 amps and CS48L32 audio processors, Everest Semiconductor ES8375s and ES8389s, Longsoon-1 AC'97 audio codecs, NVIDIA Tegra264 SoCs, Richtek ALC203 and RT9123 codecs, Rockchip SAI controllers, Intel WCL, and DJM-V10 mixers.
Miscellaneous
- Filesystems and EFI variables can now be frozen (and later unfrozen) on a best-effort basis during suspend and hibernate operations.
Virtualization and containers
- Control group shared tracking for recursive statistics turned out not to scale well to large numbers of control groups and has been removed.
- Virtual machines (VMs) can now communicate with a TPM device emulated by a Secure VM Service Module.
Internal kernel changes
- The timer API has undergone some significant renaming. Functions with irregular names have been converted to use the timer_ prefix. For example, init_timer_key() is now timer_init_key(). There is another large refactoring expected near the end of the merge window.
- Rust modules can now use configfs.
- Stub drivers for the nova DRM driver continue to make their way upstream.
- The virtual filesystem (VFS) interface has been cleaned up to reduce the proliferation of confusing names.
- The writepage() method has been completely removed from struct address_space_operations, along with its remaining uses.
- The resctrl filesystem interface has now been moved to its own directory, as the next step toward letting it be used on multiple architectures.
- The kernel-doc script, the origins of which predate the Git era, is used to extract documentation from the kernel source during the documentation-build process. Prior to 6.16, it was a horrifying Perl script full of impenetrable regular expressions. That script has been replaced with a Python version that is better integrated into the Sphinx build system. The regular expressions are no more penetrable than before, but the script as a whole will be far more maintainable.
- The kernel's build scripts had a separate option to handle specifying compiler flags that disable warnings, due to some inflexibility in the option-parsing code. The cc-disable-warning option is no longer required; it can be replaced by the normal cc-option.
- There is a new DMA mapping API, intended to provide an alternative to scatterlists. While it has taken some time to be finalized, the new API should provide better performance for some high-bandwidth DMA devices.
- The idle-CPU-selection logic for sched_ext can now apply topology-based optimizations in more cases, including to tasks with CPU affinities.
There are 5,379 non-merge commits currently waiting in linux-next, so there are certainly more new kernel features to come. The merge window is expected to close on June 8. As usual, we will post another summary of those once the merge window closes.
Block-layer bounce buffering bounces out of the kernel
As the end of the 1990s approached, a lot of kernel-development effort was going into improving support for 32-bit systems with shockingly large amounts of memory installed. This being the 1990s, having more than 1GB of memory in such a system was deemed to be shocking. Many of the compromises made to support such inconceivably large systems have remained in the kernel to this day. One of those compromises — bounce buffering of I/O requests in the block layer — has finally been eased out for the 6.16 release, more than a quarter-century after its introduction.A 32-bit pointer can only address 4GB of memory, putting a hard limit on the size of the address space that a program on such a system can use. Linux, though, includes a couple of architectural choices that limited the amount of useful memory much more severely. The 4GB virtual address space contained both user and kernel-space memory; separating those spaces would have made more virtual address space available, but at a huge performance cost. The kernel also mapped all of physical memory into its portion of the address space, making it easy for the kernel to directly access every page in the system. That meant that the kernel's portion of the address space had to be larger than the amount of physical memory that the kernel managed.
Most configurations in those days set aside the uppermost 1GB of virtual address space for the kernel, leaving 3GB for user space. But, since the kernel required that all of physical memory be mapped into its space, it could only make use of rather less than 1GB of physical memory — some of its 1GB had to be used for the kernel itself, the vmalloc area, and so on. By the late 1990s, systems with more memory than that were becoming widely available; indeed, through some hardware hackery, it was possible to put more than 4GB of physical memory into a 32-bit system. But Linux could not use that memory; this was generally seen as a bad thing.
The solution that was adopted at the time was a concept that was termed "high memory". Any memory beyond that which the kernel could address directly was placed in this class; it could be mapped into user space, but the kernel could not access it directly without creating an explicit (and temporary) mapping. As a result, high memory was generally unusable for any sort of kernel data structures. Making that memory available to user space solved a pressing problem, but it did not take long for the kernel's inability to use high memory for its own purposes to create problems of its own; there just wasn't enough low memory for the system to operate efficiently.
The 2.3.27 kernel release in November 1999 included a significant step toward mitigating this problem; specifically, it moved the page cache (a large kernel data structure) into high memory. There were many challenges that had to be handled to make this move work, one of which was that, at that time, there were many block drivers in use, and none of them had been written with high memory in mind. If one of those drivers were to be presented with a buffer that was not in directly accessible memory, the resulting explosion would not be pretty. Even for a development kernel (as 2.3.27 was), that was generally seen as a bad thing.
Fixing all of the drivers was not something that could be done in short order, so a workaround was required. Thus, with this release, the block layer gained the concept of bounce buffering. If an I/O request involved a buffer in high memory, and the driver destined to handle that request could not cope with high memory, the data would be copied ("bounced") into a low-memory buffer. For writes, the bouncing would happen prior to handing the buffer to the driver; for reads, the driver would be given a low-memory buffer that would be copied back to high memory after the operation completed. The data could live (long-term) in high memory, and the driver could continue to work unmodified.
Bounce buffering is the sort of solution that nobody likes. Copying data before or after I/O operations can only make those operations slower. The bounce buffer also increases the memory used by the operation, which can be especially problematic if the system is under memory pressure and the data is being written so that the pages it occupies can be reclaimed and reused. But, bounce buffering made it possible to move the page cache out of low memory, so it was a cost that the development community was willing to pay.
Amusingly, the bounce-buffering code added to 2.3.27 included
a comment that it would be "moved to the block layer in 2.5
",
when a big rewrite of the block code was planned. That rewrite was indeed
done as one of the first changes in the 2.5 development series (which began
in late 2001), but this particular move only happened for the 3.16
kernel in 2014.
In 2025, it is rare to find a system that needs the high-memory concept; developers have been talking about removing high memory support altogether for some years, but there is still a need for it on some 32-bit systems. There is no use for high memory on 64-bit systems, though, since the kernel is once again able to fit even shockingly large amounts of memory into its address space.
Beyond that, the number of block drivers in active use has fallen over the years, and those that remain tend to not need bounce buffering. A modern driver that uses the kernel's DMA API need not worry about where a given buffer is placed; the driver will almost certainly not access that buffer directly, and the DMA layer ensures that buffers are properly placed. Drivers for older hardware might still need to operate directly on a buffer; they can be taught to create temporary mappings in the high-memory case, avoiding the need for a bounce buffer. So the number of users for the bounce-buffering machinery has dropped, but the block layer has continued to support that functionality.
When Christoph Hellwig looked at bounce buffering in early May, he found that only four drivers still used it; all but one are for ancient devices (two are parallel-port drivers, for example) that may no longer exist in the wild. Even the bounce-buffer-using driver that is still in heavy use — the USB storage driver — only needs bounce buffers for archaic host controllers that use programmed I/O. So Hellwig put together a patch set causing all four drivers to fail to load on systems where high-memory support is configured, dropping the number of bounce-buffer users to zero. At that point, the bounce-buffering code itself could be (and was) removed.
Even after Hellwig's series, the kernel is not devoid of bounce buffering; there are times when it cannot be avoided. The remaining bouncing is generally handled in the DMA and swiotlb layers, though, and is available beyond the block layer when it is needed.
The patch series, which was included in one of the first merges for the 6.16 release, removes nearly 300 lines of code from the kernel and, perhaps more importantly, takes an annoying special case out of the block-I/O paths. An old relic from the high-memory era has been removed, perhaps getting the kernel one step closer to the point where high-memory support can be removed entirely. These are all generally seen as good things.
Hardening fixes lead to hard questions
Kees Cook's "hardening fixes" pull request for the 6.16 merge window looked like a straightforward exercise; it only contained four commits. So just about everybody was surprised when it resulted in Cook being temporarily blocked from his kernel.org account among fears of malicious activity. When the dust settled, though, the red alert was canceled. It turns out, surprisingly, that Git is a tool with which one can inflict substantial self-harm in a moment of inattention.
Linus Torvalds reacted
strongly to Cook's pull request after noticing that many of the commits
found within it had been modified in strange ways. Git tracks both the
author of a commit (the person who wrote the code), and the committer (the
person who put that code into the repository). In this case, there were
changes that claimed to have been committed by Torvalds, but they were
actually rewritten (but unmodified beyond the metadata) versions of his
commits with different SHA IDs. Torvalds said: "You seem to have
actively maliciously modified your tree completely
", implying that some
sort of deliberate, underhanded change had been made. He copied kernel.org
maintainer Konstantin Ryabitsev, asking that Cook's account there be
disabled; Ryabitsev duly
complied. News quickly spread around the Internet, along with a lot of
speculation about possible supply-chain attacks or other malicious
activity.
While use of kernel.org is not mandatory, most kernel maintainers do keep their repositories there. Banishment therefrom will, thus, leave a maintainer unable to do their work; unable, in this case, to even fix the problems that caused that banishment in the first place. It has never been explicitly said that a request from Torvalds is enough to cause a kernel.org account to be disabled, but it is not surprising in retrospect. Still, it must have come as a shock, even without the suggestions of possible malicious activity.
Cook, though, reacted calmly to his banishment, saying that he had not
created the problematic repository intentionally; "I think I have an
established track record of asking you first before I intentionally do
stupid things with git
". He went through the exercise
of recreating that repository, showing all the steps along with data from
the Git reflog. In the end, he was able to reproduce the problem with an
invocation of the b4 tool's
trailers subcommand.
B4 is a tool that has made life far easier for kernel developers and (especially) maintainers. It handles many of the tasks of applying patches, ensuring that all offered tags ("Reviewed-by" and such) are applied, and more. The b4 trailers command, in particular, will look for replies to a set of already-committed patches containing new tags, then rewrite the commit history to include those tags in the changelogs. It is, at its core, a rebasing operation. Those should always be approached with care, but they do not ordinarily lead to this kind of problem.
In this case, b4 trailers advised Cook that it was about to modify 39 commits. By his own admission, Cook missed that warning and told it to proceed, then used a forced push to upload the resulting repository to kernel.org. Ryabitsev, who is the b4 maintainer, was willing to share the blame:
Well, that's the point where the user, in theory, goes "this is weird, why is it 39 commits," and does Ctrl-C, but I'm happy to accept blame here -- we should be more careful with this operation and bail out whenever we recognize that something has gone wrong.
He added that he was "100% convinced
" that there was no malicious
activity involved. Cook's account was reactivated; he promptly put
together a new pull request for the hardening fixes, which was pulled
by Torvalds on June 1.
There will be some changes to b4 to try to prevent this kind of mistake from happening again. Torvalds has asked that it refuse to rewrite commits that were committed by anybody other than the user running the command; Ryabitsev has agreed to make that change. There will probably be others as well, once the developers involved understand why b4 decided to modify so many commits in this case.
So this episode appears to have run its course. The real moral of the story, perhaps, is that powerful tools can sometimes have powerfully adverse effects. It can take time — and hard experience with those effects — to learn where the pitfalls are and what types of guard rails need to be installed. We have just seen an example of how that experience is gained.
Device-initiated I/O
Peer-to-peer DMA (P2PDMA) has been part of the kernel since the 4.20 release in 2018; it provides a framework that allows devices to transfer data between themselves directly, without using system RAM for the transfer. At the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF), Stephen Bates led a combined storage, filesystems, and memory-management session on device-initiated I/O, which is perhaps what P2PDMA is evolving toward. Two years ago, he led a session on P2PDMA at the summit; this year's session was a brief update on P2PDMA with a look at where it may be heading.
He began by looking at where P2PDMA is today. It started as an in-kernel API that enabled DMA requests between PCIe devices; one of the first users of the API was the NVMe-over-fabrics target, which allowed data to flow directly into an NVMe drive via remote DMA (RDMA). Access to the feature from user space was added, so that mmap() could be used to map device memory. That capability is being used by some companies, sometimes in conjunction with out-of-tree patches expanding the functionality.
Bates admitted to having stepped away from P2PDMA for a bit due to a job change and other distractions, so he was not entirely sure about the status of the feature in some respects. He wondered about Arm 64-bit support; Christoph Hellwig said that it is supported, which means that architecture support is in pretty good shape, Bates said.
Trying to do P2PDMA in the presence of an I/O memory-management unit (IOMMU) had been difficult, he said, so initially users turned off IOMMUs; that is not required any longer. There are a handful of PCIe features that have gained support, including page request interface (PRI) and address translation services (ATS); those features are meant to improve things but also end up making everything more confusing, he said.
Device initiation
His goal for the session was to talk about "what the next natural
step for peer-to-peer is, and that is device-initiated I/O
". The first
two days of the summit have
been interesting, Bates said; he has been talking with other attendees
about the speed of progress on PCIe devices, as well as the increases in
I/O operations per second (IOPS) on newer
NVMe SSDs. The block layer is also able to do more IOPS, with people
reporting that a "hacked" io_uring can do up to 60-million IOPS per core,
though Bates noted that the exact number should be taken with a grain of
salt, but IOPS are increasing overall.
People are reporting that the NVMe driver can support eight-million IOPS per core, he said, in IOMMU pass-through mode, but Jens Axboe said that his testing shows around 10-12 million IOPS on a particular Threadripper-based system. The numbers vary widely and depend on other factors such as the temperature of the system (thus whether the CPUs are thermally limited), Bates said. The NVMe driver can only sustain around two-million IOPS when establishing a DMA mapping and programming windows into the IOMMU is required. He noted that there were some sessions at LSFMM+BPF about improving the DMA-mapping code, which may help reduce some of that overhead.
But handling eight-million IOPS is consuming an entire CPU core to do the I/O and there are SSDs
coming that can do up to ten-million IOPS. It seems a shame that people are
buying fast, expensive processors that are spending all of their cycles
doing I/O. There are two ways to improve that situation, he said; either
reduce the number of CPU instructions needed per I/O operation or have the
devices themselves issue the I/O. There is an Intel CPU instruction, which
Matthew Wilcox called the "NVMe-queue-submission instruction
", that
might help, though Hellwig pointed out that "number of instructions" is not
necessarily the same as time spent since instructions take a variable
number of cycles.
Since you can already do things like DMA from PCIe devices, Bates said, "an
accelerator or some other kind of I/O device that has enough
intelligence to have code on it that generates NVMe-submission-queue
entries and rings doorbells
" could handle its own I/O—or I/O for other
devices. The smart device would run enough of the NVMe driver to do I/O
requests. He noted a paper
that reported on what NVIDIA and the University of Illinois had done using
GPUs for NVMe I/O, though it was only a proof of concept. Hellwig pointed
out that Mellanox (which is now part of
NVIDIA) had been doing similar things for RDMA well before that paper was
written. Bates said there had been patches for that at one point, but he
did not think they
were merged; Jason Gunthorpe said that the feature was part of a
shipping product at this point.
Bates would like to see an open, vendor-neutral framework for
device-initiated I/O "where anyone who wants to can play in this
space
". He thinks there would need to be a way to request and reserve
NVMe hardware queues as part of that, though Hellwig does not think that is
workable. Bates said that he does not want to take control of the device
completely away from the kernel (as with SR-IOV
or VFIO),
for error-handling reasons and because there may be a need to tie it into
filesystems. The administrative side would also remain in the kernel, he
thought.
If the feature was added, obviously some kind of protection is needed so
that devices cannot simply read and write wherever they want. There
was talk of "protection domains" for NVMe at one point, but he did not
think they were added to the specification. It would be useful for NVMe to
have the ability to restrict the kinds of operations that specific queues
are allowed to do. For example, they might be restricted to using a particular
namespace or a logical-block-address (LBA) range for a namespace. That
way, the controller would "act like a guard if the device got a little
crazy
".
Hellwig said that doing any of that requires a separate NVMe controller that can
be used to shut down misbehaving devices. That is an expensive option,
which is why people are looking for other solutions. Bates said that he is
"not 100% convinced
" that is the only way forward, however.
Another thing that a "naughty device
" might do would be to provide
an invalid destination address for a DMA operation, either maliciously or
mistakenly. Gunthorpe said that IOMMUs already handle that kind of problem
and that code to use them is available in the kernel.
Hellwig went into some detail of what was done for Parallel
NFS (pNFS) that would be applicable for this use case. It requires the
use of layout
leases, which reserve the block layout of the file on the storage medium,
but the lease is revocable by the kernel any time the layout changes. He
said that he and some colleagues are considering writing a paper on using
that technique for GPUs; "you should give us some good hardware and
you'll get a mention
", he said with a laugh.
Bates said that sounded promising as a way forward for device-initiated I/O
and that he had another topic that was slightly different, which he wanted
to discuss in the last few minutes of his slot. There is a large
accelerator company that is claiming that "AI workloads need a lot of
IOPS
", around 200-million IOPS per GPU, which is huge. The I/O
operations are small, less than 512 bytes—less than a block in size—and
the workload is read-only. "Trying to do that via standard NVMe
commands seems very foolish.
"
There is an NVMe base address register (BAR) for a controller memory buffer (CMB) that might be used to allow the AI accelerator to do load and store operations as with persistent memory. CXL has something similar, Bates said. The NVMe buffer could be mapped to a particular namespace or range in a namespace and allow the accelerator to do CXL-like memory accesses to the data. Hellwig said that using the CMB was not the right approach because it would be slow, with too much command overhead; instead the BAR should be used to create a direct mapping from the namespace that is read-only. He used an NVMe device ten years ago that could do that, so a proof of concept is probably easy to put together, but getting it added to NVMe will take longer.
At that point, the conversation split up into several as time expired.
Two sessions on faster networking
Cong Wang and Daniel Borkmann each led a session at the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit about their respective plans to speed up networking in the Linux kernel. Both sessions described ways to remove unnecessary operations in the networking stack, but they focused on different areas. Wang spoke about using BPF to speed up socket operations, while Borkmann spoke about eliminating the overhead of networking operations on virtual machines.
sk_msg
Wang began by explaining that struct sk_msg is a data structure used internally for socket-layer messaging. He compared it to the more widely used struct sk_buff, but said that sk_msg was much simpler. BPF programs can access sk_msg structures through socket maps, where they are primarily used to let BPF programs redirect messages between sockets.
There are a few use cases for redirections like this. For example, bypassing the TCP stack when sending messages between a client and a server on the same machine. This can avoid unnecessary overhead, Wang explained, but it's only helpful if forwarding the messages in BPF is actually faster. After questioning from Borkmann, Wang clarified that this use case is speculative; it is not actually being used in production.
Depending on the types of sockets involved, exactly how messages are redirected can vary. When redirecting from a transmitting socket to a receiving socket, for example, the BPF program doesn't need to make any changes to the data at all, resulting in a fast transfer. When redirecting from a receiving socket to a transmitting socket, on the other hand, the BPF program needs to execute a series of conversions to convert the received data into the right format.
The TCP stack has had a lot of optimization over the years. For example, it efficiently batches short messages. As a result, BPF redirection is actually slower than traversing the whole TCP stack for short messages. That has been partially corrected by work from Zijian Zhang to add a buffer to batch short messages in sockets. Wang thinks that performance can be improved further, however, by reusing the code for Nagle's algorithm from the TCP stack.
Wang then presented a number of different, more speculative ideas for improving performance. These included introducing new, more efficient interfaces for BPF programs manipulating socket messages, removing locks where possible, and simplifying the transformations needed for the receiving-socket-to-transmitting-socket case.
There was some discussion of where sk_msg structures are used throughout the kernel and how those areas would be impacted. Wang closed out the session with the observation that TCP sockets are widely used; increasingly, containerized workloads use TCP sockets to communicate within the same physical machine. Any work to speed up local sockets will undoubtedly be generally useful.
Netkit for virtual machines
Virtual machines (VMs) provide comprehensive isolation from the physical hardware, at the cost of additional overhead. Where possible, it would be nice to reduce that overhead. Borkmann spoke about his work to remove some of the overhead of networking in VMs, as part of a larger plan to try to make VM workloads and container workloads use the same underlying tooling in Kubernetes.
Today, a VM running under Kubernetes runs inside a container with QEMU. This odd state of affairs is because Kubernetes started as a container-management engine, so putting the virtual machine manager inside a container lets Kubernetes reuse many existing tools. Borkmann shared this slide to explain what this does to the networking stack:
In short, a network packet destined for an application running in a virtual machine must be received by the physical hardware, handled by the host kernel, forwarded to the virtual container bridge network, given to the host side of QEMU's virtual network device, passed into the virtual machine, and finally handled by the guest kernel.
This is a lot of unnecessary work, Borkmann said. About a year ago, QEMU got a new networking backend based on AF_XDP sockets; he suspected that AF_XDP sockets could be used to bypass the steps above. The change is not trivial because express data path (XDP) is not supported inside network namespaces (which are used in containers). Borkmann's idea was to reserve a set of queues on the physical network card, bind those to Cilium's netkit (a kernel driver that is designed to help reduce the overhead for network namespaces), and dedicate those queues to the network namespace of the container.
This would let traffic go directly from the physical hardware, to QEMU's AF_XDP networking backend, to the VM's kernel. This is about as minimal as the overhead could be, because the host system still needs to be in control of the actual hardware. The design would also let BPF programs running on the host intercept and modify traffic as normal.
Just before the summit, Borkmann got a proof-of-concept implementation working. The code is not too complicated, he said, but there are still several APIs that he would like to slightly tweak in order to simplify the idea. In particular, the XDP API is fairly limited, compared to a hardware networking device; Borkmann wants to extend that API with support for various kinds of hardware offload.
Although that session was not the last in the BPF track, it does mark the completion of LWN's coverage for this year. The last session in the BPF track was already covered in the same article as Mahé Tardy's earlier session.
Reports from OSPM 2025, day three
The seventh edition of the Power Management and Scheduling in the Linux Kernel Summit (known as "OSPM") took place on March 18-20, 2025. Topics discussed on the third (and final) day include proxy execution, energy-aware scheduling, the deadline scheduler, and an evaluation of the kernel's EEVDF scheduler.As with the coverage from the first and second days, each report has been written by the named speaker.
Aren't you tired of proxy-execution talks?
Speaker: John Stultz (video)
Continuing what has been a persistent topic over the last many years at OSPM and the Linux Plumbers Conference (LPC), John Stultz gave a talk on proxy execution, titled "Aren't you tired of proxy-execution talks?". While some in the audience kindly expressed feelings to the contrary, Stultz prefaced the talk by mentioning that, after a few years of doing these talks, he was at least a little bit tired of the topic.
The talk started with a quick summary of why proxy execution has been worth this extended effort: it resolves the priority inversion problems seen commonly with SCHED_NORMAL tasks when methods like cpu.shares, CFS bandwidth throttling, cpusets, or SCHED_IDLE are used to restrict background tasks so they don't affect foreground tasks. If one of those background tasks calls into the kernel and acquires a kernel mutex before it is preempted, it can block important foreground tasks from running if they also need that kernel mutex, causing unacceptable stalls that are visible to the user.
Proxy execution solves this using a form of priority inheritance. Instead of tracking a linear priority order (which only works with realtime tasks), it keeps tasks waiting for a mutex on the run queue; if the important mutex-blocked task is selected to be scheduled, the scheduler finds the lock owner and runs that task instead. This safely enables the use of the above-mentioned features to restrict background tasks and, in a number of example tests, has shown an impressive reduction in outlier delays to the foreground tasks, providing more consistent behavior for users.
Stultz highlighted some progress that has been made since his 2024 LPC talk. The preparation patches that had been submitted repeatedly over months were finally merged. The next chunk of the series, focusing on proxying only if the lock owner is on the same run queue as the waiter, had been submitted a few times and received helpful review feedback. Finally, the full proxy-execution patch series (as of v14) was merged into the android16-6.12 branch (disabled by default) so vendors could experiment and update their vendor hooks to support the concept of the split scheduling and execution contexts.
As the full patch series is complex, Stultz has been trying to submit the series in smaller, reviewable chunks. He outlined the current stages of the full patch series, with the current one under review being the "single-run-queue proxying" step. That is followed by "proxy-migration and return-migration" then "sleeping owner enqueuing and activation" and finally "chain migration".
Stultz covered the highlights from the most recent (at the time of the talk) release (v15) of the patch set, some of the near-term to-do items, and the larger plans left for the year. Then, for discussion, he covered a topic he wasn't able to get to at the LPC talk last year, which is the complex logic required to handle sleeping-owner activation. When the proxy-execution logic walks the blocked-on chain and finds the lock owner sleeping, there's no way to effectively donate time and make the owner run. So the scheduler dequeues the donor and, instead, adds it to a list attached to the owner of the lock it's waiting on. When the lock owner wakes up, the scheduler can enqueue the blocked donor task(s) onto the same run queue, so proxying can commence.
However, when wait/wound mutexes are involved, we can have "mid-chain" wakeups, which requires waking up and enqueuing just the tasks waiting on that mid-chain wounded mutex. Thus we have to maintain a full tree structure of blocked-on tasks. There is an additional complication, since the relationship is task-to-task, protected by the task->blocked_lock, and we need to take the task->pi_lock to wake a task. The locking is particularly complex, requiring that we drop and reacquire all the locks in each step. Making it all the more complicated, we can have standard unlock wakeups and multiple mid-chain wakeups happening in the tree in parallel. Stultz suggested that, when this chunk is submitted, he would welcome close review and suggestions for alternative approaches.
Peter Zijlstra raised the point that the scheduler could just keep a single list and wake everything found there; that wouldn't be optimal, but hopefully this is a rare case in general. Steven Rostedt found this simpler approach attractive as well. Stultz was hesitant, as early versions of the patch used that approach, but it was a source of numerous bugs around wait/wound mutexes that have seemingly disappeared with the more complex approach. But he said he would evaluate it further.
Thomas Gleixner pointed out that the rt_mutex logic has similar handling, so that should be looked at. Specifically, he was referring to the way the logic is documented, as it is also complex, and trying to add the documentation inline with the code makes it messy. Instead, Gleixner suggested explaining it all up front in a big comment, then using numbered annotations that can be added to the specific code. Others in the audience agreed that this was a nice way to document the complexity.
Stultz was then able to move on to a grab bag of other topics that he has run into on the Android Systems team.
The first was that the quality-of-service (QoS) efforts that were much discussed at last year's OSPM were still important. He highlighted that Qais Yousef's QoS API and ramp-up multiplier patches had shown to provide a nice (5-15%) performance improvement for Chrome on Android. In fact, when applied to an unoptimized 6.12 kernel, the results were similar to what is seen on a productized 6.1 kernel that includes all the device-specific vendor hooks and heuristics to improve scheduling behavior. More evaluation is needed to understand the relative power impact, but it seemed promising.
Dietmar Eggemann raised the point that the series has quite a number of different changes together, and that it might be nice to split these apart to better understand which of the changes provided the benefit. He has mentioned this before, and Stultz clarified that he has made an attempt at pulling some of the changes apart, but found dependencies through the series that made this difficult. But he acknowledged the feedback and asked that folks try to help review the patches next time they are submitted.
Another topic Stultz covered was Android's increasing use of the freezer to restrict background tasks. Similar to proxy execution, this is a solution to prevent background tasks from impacting foreground tasks, but does so in a way that freezes background tasks in user space, where they can't hold kernel resources. The freezer is commonly used for suspend/resume and checkpoint/restore, but less frequently used on a per-task basis.
This approach has shown promising results, but does have side effects. These include problems like synchronous binder calls to frozen tasks, which won't return. Binder buffers sent asynchronously have to be kept around and cannot be freed while the receiver is frozen. Efforts are ongoing to address those, but one other issue that has come up is problems with recurring timers and watchdog timers. Since a task can't run while it is in the freezer, when it is released from the freezer, any recurring application timers will fire repeatedly. This is similar to an issue seen in the early days with suspend and resume, which motivated the change causing the CLOCK_MONOTONIC clock ID to halt in suspend. Similarly, things like watchdog timers may fire while a task is frozen so, when it wakes up, it might panic because it wasn't runnable for such a long time.
So there is a desire for a new, per-task clock ID that would not account time while the task was frozen. Stultz is hesitant about the idea, as it is not generalized yet, but wanted to raise the idea so folks were aware this type of use of the freezer was going on. There was some discussion in the crowd that, maybe, time namespaces could be used, though that seemed to get argued down. Gleixner highlighted that having a clock ID wouldn't be terrible, but the timer side would be difficult to manage.
The next topic he covered was CPU-frequency-management trouble seen with realtime audio; he walked through a trace from the Android mainline kernel on a Pixel 6 device where a realtime task changed its CPU needs, but the CPU frequency didn't change quickly enough to avoid underruns (and more strangely the frequency only increased after the task had migrated and the CPU had gone idle). Vincent Guittot pointed out that, with mainline kernels, the CPU frequency should already be at a maximum and, despite android-mainline not having the scheduler vendor hooks enabled, there has to be some other logic in the Android patches causing this.
Zijlstra asked why deadline scheduling isn't being used. Stultz clarified that efforts were made years ago to use it, but they were unsuccessful; he didn't have enough context on the details as to why. He agreed to dig further into the problem to understand the problems Android had with the deadline scheduler and to look deeper into the Android patches to see what has been done to frequency scaling on realtime tasks.
Finally, he ended by suggesting that folks in the room try out Perfetto, which provides a great visualization tool for system behavior. Its traces (or images of the trace) can become an efficient sort of shorthand for describing problems or issues, which he thinks would be valuable for the community. Attendees asked how its functionality differs from KernelShark; Stultz explained that, while the visualization provided by Perfetto is useful, that part is similar to KernelShark, but the real power of Perfetto is that it ingests the trace into a database, which you can then query with SQL and render the results of that query into visualization. This makes it easy to identify where in a trace a problem occurred, and enables you to add your own visualization for metrics that aren't there by default. He mentioned that a lot of the documentation is Android or ChromeOS focused, but he has created a document to show how to use it against mainline kernels, and really hoped folks would take some time to try it out as it has a lot of potential to improve communication in the community.
Improving energy-aware scheduling through better utilization signals
Speaker: Pierre Gondois (video).
Utilization signals, which tell the scheduler how much CPU time any given task requires, are impacted by tasks sharing the same run queue. Indeed, in this case, the measured utilization doesn't necessarily represent the actual size of a task; it represents how much time the scheduler gave to that task. In particular, if two tasks, A and B, are in the same queue, task A is perceived as idling when task B is running.
On a fully utilized CPU, a task's util_avg value represents an underestimation of the actual size of the task. The util_avg of the task is bounded by the relative nice value of the task. A task with a low nice value will receive more running time and appear bigger than if it had a higher nice value.
A prototype signal, util_burst, has been created to address this problem. This signal accumulates the amount of time a task ran while it was enqueued, then accounts the contributions all at once upon dequeuing. This allows co-scheduled tasks not to be perceived as idling when another task is scheduled. On a fully utilized CPU, the util_burst signal of a task is not bounded by the relative nice value of the task regarding the underlying CPU; it will slowly grow to reach the maximum value of 1024.
This estimation of the size of tasks is related to the UCLAMP_MAX feature. UCLAMP_MAX aims to put CPUs in a fully-utilized state. This rigs the estimation of the size of UCLAMP_MAX tasks and leads to inaccurate task placement.
Indeed, UCLAMP_MAX tasks are now placed by the energy-aware scheduler (EAS), which uses utilization values to place tasks. On fully utilized CPUs, the load should be balanced among CPUs and utilization should be ignored. Thus the load balancer should be used in this case instead, or UCLAMP_MAX tasks should not be scheduled through EAS.
Rethinking overutilized — when to bail out of EAS?
Speaker: Christian Loehle (video).
The Linux scheduler's energy-aware scheduling mode makes per-task CPU placement decisions, based on per-entity load tracking (PELT) utilization signals, to place and pack tasks onto the most energy-efficient CPU. It's rather obvious that this isn't always desirable, so, when a run queue's utilization exceeds 80% of the CPU's capacity, EAS deactivates itself and falls back to traditional capacity-aware load balancing. This is motivated by the utilization data no longer being trustworthy when compute demand isn't met and potential energy savings being minimal in those scenarios.
However, this mechanism is showing signs of age. The 80% threshold is static, independent of CPU type, and indifferent to workload behavior. On modern, heterogeneous systems, particularly mobile SoCs with complex core topologies or laptops with many big cores, this results in the system frequently bailing out from EAS. These transitions are often triggered by short-lived spikes that don't reflect sustained demand, but the global overutilized state impacts the entire system nonetheless.
On the Pixel 6, which features clusters of little, mid-size, and big core, the little cores, with capacities of 160 (out of the normalized 1024), frequently cross the overutilization threshold under modest workloads. Since the 80% threshold translates to just 128 capacity on those CPUs, a transient bump is enough to deactivate EAS, even when big cores are idle and available. This leads to task migration from little to mid-size or big cores, then back again once demand drops, forming a cycle of unstable placements undermining energy savings.
On desktop-class hardware, like the Apple M1 variants, which have more big cores than little cores, a problem manifests differently. A single compute-heavy thread on one big core can push it past the threshold, deactivating EAS across the entire system. For a mobile SoC with, traditionally, no more than two big CPUs (which can't both be sustained thermally), this was acceptable and rare; for a laptop use-case this can be improved upon.
The fundamental issue around overutilization is that the scheduler cannot trust PELT's util_avg (and its derivatives) when compute demand exceeds the provided compute capacity. If a task is placed on an underpowered CPU and throttled, the task's load will be overestimated.
Several proposals were discussed to address these issues. One is to make the overutilization threshold dynamic, as suggested by Yousef. Rather than a hard-coded 80%, the threshold would consider how much utilization a task could accumulate assuming it ran continuously until the next tick. This can account for differences between a 160-capacity little core and a 1024-capacity big core. While this introduces useful topology awareness, it risks backfiring on little cores by reducing their safe operating headroom, potentially triggering the overutilized state even sooner under normal bursts.
A more conservative approach is to introduce a linger period before setting or clearing the overutilized flag. Since most overutilization events are short-lived, often under 1ms, delaying state transitions can filter out noise. Experimental results show that introducing a one-tick linger before clearing the flag reduces overutilized events by nearly an order of magnitude without significantly increasing time spent in the overutilized state overall. This could improve EAS stability with minimal side effects on responsiveness.
Specifically, for topologies like those seen in laptops, if a big CPU is marked overutilized due to a single task, that core could be excluded from the global overutilized evaluation, since trustworthiness of PELT signals isn't impacted and maximum throughput for that task is already guaranteed. While this idea has limited value on mobile SoCs, it becomes relevant on desktops and laptops where lone-task domination is more common, and the performance budget is less constrained by thermal design.
A more radical idea is to redefine the overutilized condition entirely, moving away from utilization metrics and, instead, basing the decision on observed idle time. If a CPU has not been idle within a recent window, say 32ms (the current PELT-half life), it may be assumed to be overutilized. This approach reflects actual compute demand more directly.
Two issues arise. The approach is no longer DVFS-independent, meaning slow DVFS reactions or temporary DVFS restrictions may trigger the overutilized state. The bigger issue is that this is incompatible with UCLAMP_MAX since, whenever UCLAMP_MAX is actually taking effect, the condition is expected to be met (since provided compute capacity is artificially restricted).
Here lies a deeper problem. The existence of UCLAMP_MAX, used on Android to restrict task-utilization estimations to save energy, undermines any mechanism that relies on utilization or idle inference. A task may run at 50% of CPU capacity, not because of limited demand, but because it has been constrained by user-space policy. In such cases, its utilization data is effectively garbage. Worse, any additional task scheduled on the same CPU contaminates its own utilization signal. The presence of UCLAMP_MAX therefore limits the design space of alternative overutilization definitions.
One potential fix is to redefine UCLAMP_MAX as a hard cap on util_avg, effectively throttling the task by blocking or scheduling it out once it hits its maximum. This would make PELT trustworthy again, although at the cost of changing the semantics of UCLAMP_MAX in a way that may not be acceptable to existing users. It's unclear whether this is viable, especially given how heavily Android vendors depend on these interfaces through custom vendor hook implementations.
The takeaway is clear: the current overutilized mechanism does not scale well with modern topologies or different use cases. Its reliance on a single global threshold and its entanglement with UCLAMP_MAX are all contributing to misbehavior across platforms. While approaches had been discussed previously, none of the proposals made it anywhere so far.
Discussion revolved around the proposals discussed, limitations of PELT signals, but also the UCLAMP_MAX issue, more specifically having user-space policy that doesn't have obvious mainline users (but rather indirect downstream users with modified functionality) and how much these should be restricting future developments around the overutilized state.
How to verify EAS and uclamp functionality?
Speaker: Dietmar Eggemann (video).
This talk addressed the lack of review for several EAS and uclamp patch sets on the mailing list. High commercial interest in Android, fragmented test environments, and vendor-specific customizations have hindered collaboration in this area for some time now. Traditional performance benchmarks are insufficient, as EAS and uclamp aim to balance both performance and energy efficiency. The workload generator rt-app, ftrace, and EAS/PELT kernel tracing were mentioned as tooling for this job instead.
A proposed solution is to share rt-app configuration files that trigger the behavioral changes introduced by patches, along with instructions for adapting them to different hardware (e.g., CPU count, capacity, task parameters). This allows consistent testing across diverse heterogeneous platforms, while maintaining flexibility in data analysis.
Arm has already adopted this method for the uclamp sum aggregation patch series, using JupyterLabs notebooks and the LISA (Linux Integrated System Analysis) toolkit. It should be clear to all of us that relying solely on testing during Android release cycles is insufficient and risks introducing regressions into mainline EAS/uclamp code.
Making user space aware of current deadline-scheduler parameters
Speaker: Tommaso Cucinotta (video).
The SCHED_DEADLINE scheduler allows reading its statically configured run-time, deadline, and period parameters through the sched_getattr() system call. However, there is no immediate way to access, from user space, the current parameters used within the scheduler: the instantaneous run time, as well as the current absolute deadline. This information can tell a task how much more CPU time it has available to it in the current computation cycle.
The talk described the need for this data in the context of adaptive realtime applications, with two use cases. The first deals with imprecise computations, which are quite common in control applications, where a realtime task, after performing its most important computation, may decide to execute an optional computation refining its output if enough time is available in the leftover run time until the deadline. The second use case deals with a pool of SCHED_DEADLINE threads serving jobs from a shared queue, where each thread needs precise information on its own leftover run time and absolute deadline to estimate whether a new job can be picked and processed.
The talk compared a few ways to read the current SCHED_DEADLINE parameters.
- With user-space heuristics, running the risk of "open-loop" measurements that might go out-of-sync with the real values used by the kernel.
- Using the values exposed through the /proc/self/task/pid/tid/sched special file, which contains lines for the dl.runtime and dl.deadline for SCHED_DEADLINE tasks. This approach is inefficient because of the need to format numbers in decimal notation in kernel space and parse them back in user-space. This approach also suffers from the problem of correctly interpreting, in user space, the absolute deadline information available in the rqclock reference.
- Using a kernel module that exposes the needed numbers more efficiently, in binary format, through a /dev/dlparams special device. However, this module does not have direct access to the deadline-scheduler API to update the leftover run time, so it may call schedule() for the purpose, which is probably overkill considering what is actually needed. Also, this would not work if the parameter were to be read from a different CPU than the one where the monitored SCHED_DEADLINE task is running.
Finally, the talk presented a small kernel patch to sched_getattr(); it uses the flags parameter (which is mandated to contain zero at the moment) to enable a process to request its leftover run time and absolute deadline (converted to a CLOCK_MONOTONIC reference), instead of the statically configured parameters.
Questions for the OSPM audience included:
- The suitability of the user-space interface, especially the addition of a flag to sched_getattr(); the feedback seemed positive in this regard.
- A possible option to return the current absolute deadline relative to CLOCK_REALTIME in addition to CLOCK_MONOTONIC: there seemed to be little value in adding such an option, as any realtime application should be designed not to rely on CLOCK_REALTIME.
- The possibility that the absolute deadline value returned could exhibit some nanosecond-level variability if multiple calls are made within the same deadline cycle.
- The possible drawback that reading the current parameters would cause a SCHED_DEADLINE task that exhausted its run time to be immediately throttled. Without such a call, with the default kernel configuration, throttling would have happened at the next clock tick. This didn't seem such a big problem, in view of the fact that a reasonable use of SCHED_DEADLINE would need to enable the HRTICK_DL scheduler feature, so in such conditions there would be no such difference in behavior.
- If the fact that sched_getattr() grabs the rq->__lock of the run queue, where the monitored SCHED_DEADLINE task resides, to call update_rq_clock() and update_curr_dl(), might constitute a problem when the task executing the system call is running on a different CPU.
The EEVDF verifier: a tale of trying to catch up
Speaker: Dhaval Giani (video)
The talk started with the promise of (Swiss) chocolates for audience participation. Giani started by talking a bit about the EEVDF technical report, which directed the implementation of the kernel's EEVDF scheduler. He pointed out that this is the first time that the SCHED_NORMAL class of the CPU scheduler has been based on an academic paper and it brings about a few interesting ideas, the chief among them being around the idea of functional tests of the scheduler. For the longest time, the scheduler has been a bunch of heuristics around a core algorithm.
He started by walking through the EEVDF algorithm with a simple case — showing how lag, deadline, and virtual time are calculated. With this background, the mathematical guarantees offered by the algorithm were discussed. These guarantees will form the basis of the functional tests for EEVDF. When prompted, Zijlstra agreed this was an adequate explanation of the paper for the basis of future discussions.
With this, the focus shifted to start talking about how the Linux implementation differed from the paper. One of the major changes comes from the fact that EEVDF doesn't consider SMP (support for which was introduced in Linux in 1996, while the technical report came out in 1995). Another major difference is the existence of a hierarchy of control groups. Due to these reasons, it is not possible for there to be a global virtual clock, which is what the paper relies on. Instead, for each scheduler run queue, there is a "zero-lag" vruntime value that is used to keep track of where the virtual clock of the CFS run queue is. The Linux implementation also takes some liberties by using some of the mathematical properties to clamp down lag to reduce the numerical instability.
Next, Giani talked about using the results expected from the lemmas and theorems introduced in the paper to test if the algorithm is still faithful to the paper, and if not, then document the differences introduced. Two test cases were covered, and there was consensus that they made sense, even if they would always pass. As previously mentioned, this is testing the functionality — and either of them failing would mean something fundamental has been changed.
Zijlstra suggested that, for the purpose of testing, some of the clamps can be relaxed to see how unstable the math is around the vruntime updates. It will also help document in which cases the clamping is needed.
It is still to be decided how these tests would be triggered, but they will need to implemented as a part of the kernel as they look into deep scheduler internals and there is no appetite to expose those internals outside the scheduler.
As the talk would down, Giani pointed out that the two tests he had finished developing work quite well on the current implementation, congratulated Zijlstra, and offered him an additional chocolate for the hard work.
The importance of free software to science
Free software plays a critical role in science, both in research and in disseminating it. Aspects of software freedom are directly relevant to simulation, analysis, document preparation and preservation, security, reproducibility, and usability. Free software brings practical and specific advantages, beyond just its ideological roots, to science, while proprietary software comes with equally specific risks. As a practicing scientist, I would like to help others—scientists or not—see the benefits from free software in science.
Although there is an implicit philosophical stance here—that reproducibility and openness in science are desirable, for instance—it is simply a fact that a working scientist will use the best tools for the job, even if those might not strictly conform to the laudable goals of the free-software movement. It turns out that free software, by virtue of its freedom, is often the best tool for the job.
Reproducing results
Scientific progress depends, at its core, on reproducibility. Traditionally, this referred to the results of experiments: it should be possible to attempt their replication by following the procedures described in papers. In the case of a failure to replicate the results, there should be enough information in the paper to make that finding meaningful.
The use of computers in science adds some extra dimensions to this concept. If the conclusions depend on some complex data massaging using a computer program, another researcher should be able to run the same program on the original or new data. Simulations should be reproducible by running the identical simulation code. In both cases this implies access to, and the right to distribute, the relevant source code. A mere description of the algorithms used, or a mention of the name of a commercial software product, is not good enough to satisfy the demands of a meaningful attempt at replication.
The source code alone is sometimes not enough. Since the details of the results of a calculation can depend on the compiler, the entire chain from source to machine code needs to be free to ensure reproducibility. This condition is automatically met for languages like Julia, Python, and R, whose interpreters and compilers are free software. For C, C++, and Fortran, the other currently popular languages for simulation and analysis, this is only sometimes the case. To get the best performance from Fortran simulations, for example, scientists often use commercial compilers provided by chip manufacturers.
Document preparation and preservation
The forward march of science is recorded in papers which are collected on preprint servers (such as arXiv), on the home pages of scientists, and published in journals. It's obviously bad for science if future generations can't read these papers, or if a researcher can no longer open a manuscript after upgrading their word-processing software. Fortunately, the future readability of published papers is enabled by the adoption, by journals and preprint servers, of PDF as the universal standard format for the distribution of published work. This has been the case even with journals that request Microsoft Word files for manuscript submission.
PDF files are based on an open, versioned standard and will be readable into the foreseeable future with all of the formatting details preserved. This is essential in science, where communication is not merely through words but depends on figures, captions, typography, tables, and equations. Outside the world of scientific papers, HTML is by far the dominant markup language used for online communication. It has advantages over PDF in that simple documents take less bandwidth, HTML is more easily machine-readable and human-editable, and by default text flows to fit the reader's viewport. But this last advantage is an example of why HTML is not ideal for scientific communication: its flexibility means that documents can appear differently on different devices.
The final rendering of a web document is the result of interpretation of HTML and CSS by the browser. The display of mathematics typically depends on evolving JavaScript libraries, as well, so the author does not know whether the reader is seeing what was intended. The "P" in PDF stands for "portable": every reader sees the same thing, on every device, using the same fonts, which should be embedded into the file. The archival demands of the scientific record, combined with the typographic complexity often inherent to research papers, requires a permanent and portable electronic format that sets their appearance in stone.
To aid collaboration and to ensure that their work is widely readable now and in the future, scientists should distribute their articles in the form of PDF files, ideally alongside text-based source files. In mathematics and computer science, and to some extent in physics, LaTeX is the norm, so researchers in these fields will have the editable versions of their papers available as a matter of course. Biology and medicine have not embraced the culture of LaTeX; their journals encourage Word files (but often accept RTF output). Biologists working in Word should create copies of their drafts in one of Word's text-based formats, such as .docx or .odt; though these files may not be openable by future versions of Word, their contents will remain readable. Preservation of text-based, editable source files is essential for scientists, who often revise and repurpose their work, sometimes years after its initial creation.
Licensing problems
Commercial software practically always comes with some form of restrictive license. In contrast with free-software licenses, commercial ones typically interfere with the use of programs, which often throws a wrench into the daily work of scientists. The consequences can be severe; software that comes with a per-seat or similar type of license should be avoided unless there is no alternative.
One sad but common situation is that of a graduate student who becomes accustomed to a piece of expensive commercial analytical software (such as a symbolic-mathematics program), enjoying it either through a generous student discount or because it's paid for by the department. Then the freshly-minted PhD discovers the real price of the software, and can't afford it on their postdoc salary. They have to learn new ways of doing things, and have probably lost access to their past work, which is locked up in proprietary binary files.
A few months ago, an Elsevier engineering journal retracted two papers because their authors had used a commercial fluid-dynamics program without purchasing a license for it. The company behind the program regularly scans publications looking for mentions of its product in order to extract license fees from authors. In these cases, the papers had already been cited, so their retraction is disruptive to scholarship. Cases such as these are particularly clear examples of the potential damage to science (and to the careers of scientists) that can be caused by using commercial software.
In addition, certain commercial software products with per-seat licensing "call home" so that the companies that sell them can keep track of how many copies of their programs are in use. The security implications of this should be obvious to anyone, yet government organizations, while adhering minutely to security rituals with questionable efficacy, permit their installation. While working at a US Department of Defense (DoD) lab, I was an occasional witness to the semi-comical sight of someone running around knocking on office doors, trying to find out who was using (or had left running) a copy of the program that they desperately needed to use to meet some deadline—but were locked out of.
Software rot
Ideally scientists would only use free software, and would certainly avoid "black box" commercial software for the various reasons mentioned in this article. But there is another category that's less often spoken of: commercial software that provides access to its source code.
When I joined a new project at my DoD job, the engineer that I was supposed to work with was at a loss because a key software product had stopped working after he upgraded the operating system (OS) on his workstation. The operating system couldn't be downgraded and the company was no longer supporting the product. I got a thick binder from him with the manual and noticed a few floppy disks included. These contained the source code. Right at the top of the main program was a line that checked the version of the OS and exited if it was not within the range that the program was tested on. I figured we had nothing to lose, so edited this line to accept the current OS version. The program ran fine and we were back in business.
The point of this anecdote is to illustrate the practical value of access to source code. Such proprietary but source-available software occupies an intermediate position between free software and the black boxes that should be strictly avoided. Source-available software, although more transparent, practical, and useful than black boxes, still fails to satisfy the reproducibility criterion, however, because the scientist who uses it can't publish or distribute the source; therefore other scientists can't repeat the calculations.
Software recommendations
The following specific recommendations are for free software that's potentially of use to any scientist or engineer.
Scientists should, when practical, test their code using free compilers, and use these in preference to proprietary options when performance is acceptable. For the C family, GCC is the venerable standard, and produces performant code. A more recent but now equally capable option is Clang.
For Fortran, GFortran (which is a
front-end for GCC) is a high-quality compiler and the standard free-software choice. Several more recently developed alternatives
are built, as is Clang, on LLVM. To avoid
potential confusion, two of these are called "Flang". Those interested in
investigating an LLVM option should follow the project called (usually) "LLVM Flang", which is written from
scratch in C++, and was renamed to "Flang" once it became part of the LLVM
project in 2020. Its GitHub page
warns that it is "not ready yet for production usage
", but this is probably
the LLVM Fortran compiler of the future. Another option to keep an eye on
is the LFortran compiler. Although
still in alpha, this project (also built on LLVM) is unique in providing a
read-eval-print loop (REPL) for Fortran.
For those scientists not tied to an existing project in a legacy language, Julia is likely the best choice for simulation and analysis. It's an interactive, LLVM-based, high-level expressive language that provides the speed of Fortran. Its interfaces to R, gnuplot and Python mean that those who've put time into crafting data-analysis routines in those languages can continue to use their work.
Although LaTeX is beloved for the quality of its typesetting, especially for mathematics, it is less universally admired for the inscrutability of its error messages, the difficulty of customizing its behavior using its arcane macro language, and its ability to occasionally make simple things diabolically difficult. Recently a competitor to LaTeX has arisen that approaches that venerable program in the quality of its typography (it uses some of the same critical algorithms) while being far easier to hack on: Typst. Like LaTeX, Typst is free software that uses text files for its source format, though Typst does also have a non-free-software web application. Typst is still in alpha, and so far only one journal accepts manuscripts using its markup language, but its early adopters are enthusiastic.
A superb solution for the preparation of documents of all types is Pandoc, a Haskell program that converts among a huge variety of file formats and markup languages. Pandoc allows the author to write everything in its version of Markdown and convert into LaTeX, PDF, HTML, various Word formats, and more. Raw LaTeX, HTML, and others can be added into the Markdown source, so the fact that Markdown has no markup for mathematics (for example) is not an obstacle. The ability to have one source and automatically create a PDF and a web page, or to produce a Word file for a publication that insists on it without having to touch a "what you see is what you get" (WYSIWYG) abomination, greatly simplifies the life of the writer/scientist. Pandoc can even output Typst files, so those who use it are ready for that revolution if it comes.
Conclusion
The goals of the free-software movement include ensuring the ability of all users of software to form a community enriched and liberated by the right to study, modify, and redistribute code. The specific needs of the scientific community bring the benefits of free software into clear focus and they are critical to the health and continued progress of science.
The free-software movement has an echo in the "open-access movement", which is centered around scientific publication and began in the early 1990s. It has its origins in the desire of scientists to break free of the stranglehold of the commercial scientific publishers. Traditionally, those publishers have interfered with the free exchange of ideas, while extracting reviewer labor without compensation and attaching exorbitant fees to the access of scientific knowledge. Working scientists are aware of the movement, and most support its aims of providing free access to papers while preserving the curation and quality control inherited from traditional publishing. It is important to also continue to nourish awareness of the crucial role that free software plays throughout the scientific world.
Page editor: Jonathan Corbet
Next page:
Brief items>>
