LWN.net Weekly Edition for April 17, 2025

Welcome to the LWN.net Weekly Edition for April 17, 2025

This edition contains the following feature content:

What's new in APT 3.0: there are a lot of changes in the new version of Debian's package manager.
Don't panic: Fedora 42 is here: an Adams-themed version of the Fedora distribution has been released.
Ongoing LSFMM+BPF 2025 coverage:
- Atomic writes for ext4: supporting atomic (or untorn) writes on the ext4 filesystem.
- Topics from the virtual filesystem layer: a handful of issues in the VFS to discuss—or at least introduce.
- Parallel directory operations: finding a way to handle multiple file operations in a directory in parallel.
- Preparing DAMON for future memory-management problems: an update on the DAMON monitoring subsystem and what is likely to happen next.
- Management of volatile CXL devices: CXL memory stresses the kernel's view of hardware in a number of interesting ways.
- Managing multiple sources of page-hotness data: more data about data-access frequency is good, but how can the kernel best make use of it?
- The state of the memory-management development process, 2025 edition: the annual look at how the development community is doing and what can be improved.
- Automatic tuning for weighted interleaving: how can weighted interleaving be made to just work on dynamic systems?
- Improvements for the contiguous memory allocator: two sessions on better ways to ensure the success of large, physically contiguous allocations in the kernel.
- Inlining kfuncs into BPF programs: a way to make calling select kernel functions from BPF programs more efficient.
- In search of a stable BPF verifier: two approaches to supporting BPF programs across multiple stable kernels.
- Taking BPF programs beyond one-million instructions: what would it take to eliminate this arbitrary limit on the size of BPF programs?

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

What's new in APT 3.0

By Joe Brockmeier
April 16, 2025

Debian's Advanced Package Tool (APT) is the suite of utilities that handle package management on Debian and Debian-derived operating systems. APT recently received a major upgrade to 3.0 just in time for inclusion in Debian 13 ("trixie"), which is planned for release sometime in 2025. The version bump is warranted; the latest APT has user-interface improvements, switches to Sequoia to verify package signatures, and includes solver3—a new solver that is designed to improve how it evaluates and resolves package dependencies.

The APT suite includes the high-level apt command-line tool as well as a number of lower-level utilities for configuring APT, managing packages, or querying its internal database (known as the cache). It also includes the libapt-pkg package-management library that is used by alternative frontends to APT, such as Aptitude and Nala. In turn, APT relies on dpkg to actually work with Debian package files (.debs). The relationship between all of Debian's package-management tools is well-described in the Debian FAQ.

Better interface

The first version of APT, 0.0.1, was released in 1998 by Scott K. Ellis. Current development for APT is primarily driven by Julian Andres Klode. APT 3.0 is the first major release since version 2.0 was released in March 2020. Version 3.0 entered unstable on April 4, and made its way into testing on April 10. Klode has dedicated the 3.0 series to Debian and Ubuntu developer Steve Langasek ("vorlon"), who passed away at the beginning of the year.

Development of the 3.0 series began in April 2024, with version 2.9.0. That update included some apt interface changes that closed a decade-old bug from Joey Hess. The bug was a request for apt to provide a clearer message to users about which packages it would remove when performing an operation like dist-upgrade. He complained that messages about package removals would be "buried in the middle of masses of other data that are liable to be skimmed at best, and scroll right off the terminal at worst".

With 3.0, apt organizes its output into more readable sections, and adds columnar display of packages as well as colorized output. For example, as shown in the screenshot below on the bottom pane, prior versions of apt would clump all of the information together in a block of text that may be hard to read. The output for apt 3.0, shown in the top pane, puts information in logical blocks and highlights package removal last—in red—to help ensure that the user is aware they're about to perform a potentially destructive operation.

The 3.0 branch also automatically invokes a pager for commands such as "apt search", "apt show", or "apt policy" that may have longer output to the terminal.

DEB822 and `apt modernize-sources`

Debian has been, quite slowly, moving to a new format for APT data sources. Most Debian users are familiar with the /etc/apt/sources.list file and its format, which puts all information about a package archive on one line, such as this example:

    deb https://deb.debian.org/debian/ bookworm contrib main non-free non-free-firmware
    deb-src https://deb.debian.org/debian/ bookworm contrib main non-free non-free-firmware

That specifies the archive type (deb for binary packages, deb-src for source packages), the repository URL, the distribution release name (bookworm here), and the components such as contrib, main, non-free, and so on. In 2015, APT 1.1 added support for a new format, DEB822, taken from the RFC 822 standard for internet text messages (mail). (The most recent version of the Internet Message Format is RFC 5322.) In the new format, the same archive information would be in the file /etc/apt/sources.list.d/debian.sources and look like this:

    Types: deb deb-src
    URIs: https://deb.debian.org/debian
    Suites: bookworm
    Components: main contrib non-free non-free-firmware
    Architectures: amd64
    Enabled: yes
    Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg

This has the advantage of making the source format much more readable, and adds the ability to have options. In this example there is "Enabled", which allows the user to specify whether a source should be enabled or not without having to comment out each line if it is disabled. "Architectures" is straightforward, it specifies the CPU architecture to download. "Signed-By" stanzas specify the GnuPG key used to sign the Release files in the archive. As the sources.list manpage notes, the intent has been to gradually make DEB822 the default format "as it is easier to create, extend and modify for humans and machines alike" and eventually deprecate the old format.

Klode had lobbied to move to DEB822-type source files in Debian 12, but that change did not take place. Likewise, at least for current Debian testing/trixie installs, it has not taken place for Debian 13, either. However, apt has gained a modernize-sources command that will convert sources.list and other APT data source files to the DEB822 format. Running "apt modernize-sources" as root, if there are unconverted files, will display a message that details any files that will be converted and asks the user whether it should continue. If yes, it will back up the old file or files with a .bak extension and write out a file to /etc/apt/sources.list.d/debian.sources. If not, the command will write to the console what it would write out in the new format rather than replacing the file or files.

Ubuntu users may already be familiar with the DEB822 format, as Ubuntu started using it in the 24.04 "Noble Numbat" release.

In any case, the old format is likely to be with us for quite some time. In the discussion about using DEB822 sources by default in Debian 12, Klode said that the timeline for removing support for the sources.list format would be 2030 at the earliest. That was in 2021 and would have included deprecating the format in trixie first. At this point, the earliest it would even be deprecated is with the Debian 14 ("forky") release and that schedule is not set yet: but it is unlikely to debut sooner than 2027.

New solver

APT handles determining dependencies, conflicts, and interactions between packages behind the scenes, so that users can install software without worrying about that information. If a user wants to install Git, they do not need to know that it requires the GNU C Library, Perl, the zlib compression library, PCRE2, and several other packages. They do not need to know about the dependencies those packages have, nor do they need to know ahead of time if software already installed on their system conflicts with the software they would like to install.

APT uses its solver to sort out the relationships between packages and provides a solution—if one exists in the data sources available to it—to installing all of the packages needed to provide Git or whatever software the user had asked to install. The user does not need to know about the dependencies those packages have, or the dependencies that the dependencies have, and so on.

In 2023, Klode wrote about upgrade approaches for APT and how dependency solving should be done differently between upgrades within a release (e.g., package updates to an installed Debian 12 system), and upgrades from one release to the next (e.g., Debian 12 to Debia 13). Within a release, APT should minimize package changes—but on upgrades to a new release, APT should minimize the divergence between an upgraded system and a fresh install of the release. His proposal was that the APT solver should "forget" which packages were installed automatically and instead aim for normalization.

For example, a user might use apt to install their favorite editor on a system running Debian bookworm. If that operation pulls in automatic dependencies, prior versions of APT would seek to keep those dependencies when the system is upgraded to trixie—even if there are different options that would be installed if the user had installed the editor on a fresh install of trixie. The new solver would opt instead to replace automatically installed dependencies with the preferred options for trixie so that the install does not diverge so much from other trixie installs.

The first version of solver3 appeared in APT 2.9.3. On his blog, Klode described the new solver as a Davis–Putnam–Logemann–Loveland (DPLL) solver without pure-literal elimination.

In practical terms, Klode said that the "most striking difference" to the classic solver is that solver3 always keeps manually installed packages, including those marked as obsolete. In the future, he said, this policy will be relaxed to allow obsolete packages to be replaced if there are packages that can replace them. During his talk at DebConf24 last year, he estimated that the new solver would be about 40% faster than the classic version because it does not evaluate every single package in a repository when coming up with a solution.

APT 3.0 also aims to be more helpful in tidying up users' systems. The apt autoremove command is useful for getting rid of packages that are no longer needed. This includes packages that have been installed as a dependency of another package that has since been removed, or packages that an upgrade has made unnecessary. Solver3's autoremove will be more aggressive about removing unneeded packages because, Klode said, it only knows about the strongest dependency chain for each package—and will no longer keep packages that are only reachable via weaker chains.

Since a system can have multiple APT repositories enabled, and a package may exist in more than one of those repositories, APT uses priorities for repositories and individual packages to decide which to install. The default priority is 500, with higher values representing higher priorities. When packages have the same priority, APT will choose the package with the highest version number. If the packages have different priorities, APT will choose the one with the highest priority. Pinning can be used to change priorities to packages or repositories, and might be used if a user wants to install some packages from third-party repositories or to prefer packages from Debian backports over the ones in Debian stable. The "apt-cache policy" command displays the priority of repositories and any pinned packages.

In 3.0, apt has a new option, --no-strict-pinning. This tells apt to consider all versions of a package and not just the best version ("candidate" version, in APT terminology). Klode gives the example of trying to install package foo version 2.0:

    # apt install foo=2.0 --no-strict-pinning

That would install foo 2.0 and upgrade or downgrade other packages that are needed to satisfy its dependencies, even if the repositories containing the needed dependencies are pinned at a lower level and it means downgrading other packages to satisfy foo 2.0's dependencies. This can be useful, for example, to install one or two packages from Debian's experimental repository without getting all packages from that repository.

`sqv` by default

Historically, APT has used GnuPG as its OpenPGP implementation to verify the signature on the Release.gpg file included with a repository. In 3.0, APT now uses the Sequoia PGP project's sqv verification tool by default on supported platforms. (LWN covered Sequoia in January.)

Klode explained in November last year that there were several reasons why he had worked to replace GnuPG. He referred to GnuPG's upstream becoming incompatible with other OpenPGP implementations, which LWN covered in 2023. He also said that GnuPG's implementation had quality issues, such as silently ignoring some options, not producing errors on expired signatures, and some unsafe features:

For example, the new --assert-pubkey-algo feature accepts <operator><name><size> as the syntax, so it looks at >=ed448 and accepts ed25519 as being stronger because 25519 >= 448, whereas it is the weaker curve.

He also noted that sqv is written in a memory-safe language (Rust), which could make parsing OpenPGP safer. He did not mention speed as a reason for the transition, but Ian Jackson thanked all the people who had contributed to the change, and said that it had sped up running the dgit test suite "by a factor of nearly 2".

According to a message Klode sent to debian-devel last year, sqv became the default in unstable for all of Debian's release architectures and some of its ports (work-in-progress architectures that have not been promoted to release status). Since APT 3.0 has now migrated into testing, and therefore trixie, sqv will be the default going forward.

`apt install 3.0`

Users of the Debian unstable and testing releases should already have 3.0 at their fingertips, assuming they've done an upgrade since April 10. It will also be in Ubuntu 25.04, which is scheduled for release on April 17. Per the release notes for 25.04, the new solver will be used automatically "if the classic solver cannot find a solution to either find a solution or add more context to the failure" or to evaluate its performance.

It will be interesting to see how the new APT is received by users when Debian trixie is released and it starts being widely used. In my brief and not entirely rigorous testing, APT 3.0 seems to be an improvement all around—but real-world usage will no doubt shake out some interesting issues.

Comments (46 posted)

Don't panic: Fedora 42 is here

By Joe Brockmeier
April 15, 2025

Fedora Linux 42 has been released with many incremental improvements and updates. In this development cycle, the KDE Plasma Desktop has finally gotten a promotion from a spin to an edition, the new web-based user interface for the Anaconda installer makes its debut, and the Wayland-ification of Fedora continues apace. In all it is a solid release with lots of polish.

The ultimate answer

The Fedora Project has, as a one-time exception, resurrected its tradition of release names. Fedora 42 is named "Adams" to honor Douglas Adams, the creator of The Hitchhiker's Guide to the Galaxy, who passed away far too soon in 2001. Prior to this, the last Fedora release to have a release name was Fedora 20, which was called Heisenbug; the use of release names was discontinued by the Fedora Board in October 2013. The Fedora Board itself was replaced by the Fedora Council in November 2014.

As many LWN readers no doubt already know, 42 is, in the story, the answer to the ultimate question of life, the universe, and everything. In addition to the name, the default wallpaper for the release also has a little Easter egg from the series that some have found puzzling.

Installer changes

Fedora Workstation users who do a fresh installation of 42 will get a chance to try out the Anaconda web-based installer. (Other editions and spins still offer the "classic" Anaconda experience.) Web-based installers seem to be all the rage these days: the openSUSE project is taking a similar approach with its Agama installer.

In addition to replacing the technology used under the hood for the user interface, which is now based on the Cockpit framework rather than GTK, Anaconda drops the "hub and spoke" model in favor of a wizard-based installation approach. The rationale for this, according to the change proposal, is that the traditional installer presents "too much information at once" and is harder to use if the user does not know what they need. It's hard for me to say, objectively, whether that's true or not—especially since I've been using the old-style Anaconda installer workflow for many years and have performed hundreds of installs with it. Good, bad, or indifferent, the new installer is going to take some getting used to.

However, in presenting less information, the new installer does make it harder to find custom partitioning. The installer has four main steps: choosing the language and keyboard layout, installation method, storage configuration, and confirming the choices before starting the installation. One might expect that "storage configuration" would be the step to modify partitions, etc. However, that step is only for users to choose whether or not they want to encrypt disks. To modify partitions, users need to select "Launch storage editor" from the three-dot menu (known as a kebab menu) in the right-hand corner of the installer while in step two. Before being taken to the storage editor, users are presented with a warning dialog and advised that the editor is "an advanced utility and not intended to be used in a typical installation".

Presumably, a "typical installation" is one in which Fedora is installed on a system with a single disk that it gets to own entirely. That is probably the most common scenario, but many users have more advanced needs. For example, users might want to dual-boot Fedora with Windows, or to use more than one disk (e.g., putting /home on a separate disk). This change does users something of a disservice. Instead of making complicated things easier to do, it seems to try to dissuade users from doing them instead.

That said, users who don't want to step off the path paved by the new installer should find it to be simple and unintimidating. I ran through the installation a few times in virtual machines, trying different options, and it worked flawlessly.

KDE edition

KDE is finally being presented on equal footing with GNOME as a Fedora edition, but what does that mean for those who have used the KDE spin all along? Not very much, really. The change is, as the proposal notes, "mainly a marketing change". It means that KDE will, eventually, show up on the front page of the Fedora site alongside Fedora Workstation, Fedora Server, Fedora IoT, Fedora Cloud, and Fedora CoreOS.

This is not to say that there are no goodies in store for KDE users in this release. Fedora 42 updates KDE Plasma to the 6.3 series, which has some minor improvements over the version in Fedora 41, Plasma 6.2.

The KDE edition still uses the classic Anaconda installer, so users doing a fresh install will not notice any differences from previous installs. After the installation, KDE pops up the Welcome Center dialog that provides a short overview of KDE's features and asks the user to share anonymous usage information. Data sharing is off by default, so if one speeds through the Welcome Center without paying close attention (as many users are no doubt likely to do if they have used KDE previously), they will not inadvertently give up their usage data. Users are also asked if they would like to enable third-party package repositories.

KDE 6.3 offers better fractional scaling, following work to overhaul how KWin places content on the screen's pixel grid. I had a chance to tinker with scaling shortly after installing the Fedora KDE edition on a test system, when I couldn't shake the feeling that things weren't quite right. Specifically, I had the impression that the interface had been set up by someone with an affinity for large-print books. My first thought was that KDE had not detected my monitor's resolution properly. Instead, I found that the display resolution was correct, but KDE had defaulted to 115% scale.

KDE allows the user to set fractional scaling to a value between 50% and 300%, so I spent some time trying different values to see what worked best. Anything below 75% resulted in text too small and fuzzy for my taste, but perhaps there are users who have the right combination of hardware and eyesight to find that 50% is usable. Likewise, scaling at 150% or 200% is usable on a 13-inch Framework laptop display, with its oddball 2256x1504 resolution, but 300% is too much of a good thing. In contrast, GNOME also supports fractional scaling—but only for values of 100% or above and in 25% increments. That is, one can set GNOME to 100%, 125%, 150%, and so forth, but not 115%. KDE supports smaller increments, so users can try 80% if 75% isn't quite right, or 120% rather than 125%, and so on. Ultimately, likely to the disappointment of the KDE developers who have been working hard on fractional scaling, I set it at 100%.

KDE e.V. has a process for selecting goals to focus on every three years. Last year, it announced several goals to take it through 2026—one of which is to improve support for input devices. 6.3 delivers on this with improvements for users with drawing tablets and touchpads. Sadly, I do not have the appropriate hardware to test KDE's improvements for configuring drawing tablets (see this site for a rundown), but I was pleased to see that KDE developers have added the ability to switch off the touchpad if a laptop has a mouse plugged in. This is one of those small changes that offers a major quality-of-life improvement.

One change that may make a small number of users very happy: the full set of KDE packages are now available for the Power architecture (ppc64le), and there are installable live images for the OpenPOWER-based systems.

KDE Gear, the collection of standard KDE applications such as the Dolphin file manager, Okular document viewer, and Kdenlive video editor, has a different release schedule than Plasma. While KDE's desktop is updated twice a year, Gear's feature releases are on a quarterly cycle. Fedora's KDE special interest group (SIG) generally maintains Gear applications in a rolling-release fashion across supported Fedora releases. For example, Okular, Dolphin, and Kdenlive packages for Fedora 40, 41, and 42 are all from the 24.12 feature release of KDE Gear. (Specifically, Gear version 24.12.3 released on March 6.)

There are a few interesting new features in 24.12, such as improved keyboard navigation in Dolphin, but it is mostly polish. That extends to the overall KDE experience in Fedora 42—a number of small enhancements and polish, but nothing that is likely to have a great impact on day-to-day use. That goes both ways, of course: there are no changes I noticed in 42 that detract from the KDE experience.

Wayland-ification continues

The slow deprecation of X11 throughout Fedora continues in this release. Last year, the KDE SIG dropped support for X11 with the move to Plasma 6. Kevin Kofler stepped up to maintain the required packages to allow users to install X11 support post-installation, after some requisite drama and intervention by the Fedora Engineering Steering Committee (FESCo).

This release takes a few more steps toward chiseling X11 out of Fedora. The first change is Anaconda shipping as a native Wayland application and dropping X11 packages from the installation images when possible. Spins that still depend on X11, such as LXDE and Xfce, include the X11 packages for now.

Beware of leopard

The GNOME Display Manager (GDM) has, somewhat stealthily, dropped support for X11 sessions in Fedora 42 as well. Dominik 'Rathann' Mierzejewski raised an objection to this on the fedora-devel mailing list on April 9. He noted that there was nothing in the change list about it, and suggested that it be reverted for this release and moved to Fedora 43. Kevin Fenzi said that it could have been noted as a change to raise awareness, but noted that GNOME upstream plans to remove X11 support in the next cycle–and that the change itself was made in Rawhide in September 2024. Re-adding X11 support would only be useful for trying to run X11 GNOME sessions, which, he suspects, will be increasingly difficult.

Michael Catanzaro said that it does seem "a little petty" for Fedora to disable X11 support in GDM before upstream GNOME disables it. But if the project is confident X11 support will be removed in upstream, "then it seems reasonable to do this in Fedora first". He added that he did not have a strong opinion but observed that it was "long past time for you to figure out a path forward" for those still using GNOME on X11.

GNOME and Fedora Workstation both switched to Wayland by default in 2016; it's been 9 years now, a *really* long time to not be ready for this.

The removal was proposed as a blocker bug, which would have delayed the Fedora 42 release and required restoration of X11 support, but it was voted down.

Something for everybody

All Fedora editions and spins share the same base set of packages, meaning that each version has the same kernel, libraries, and utilities. Fedora 42 brings Linux 6.14.0, and the usual GNU toolchain updates—GNU C Library (glibc) 2.41, Binutils 2.44, and the GNU Debugger (gdb) 16.2. Python has been updated to 3.13.3. The Python 3.8 package, which was available in earlier Fedora releases for compatibility with other Linux distributions, has been retired. Go 1.24.2, LLVM 20, OpenJDK 21.0.6, Perl 5.40.1, PHP 8.4.6, Ruby 3.4.2, Rust 1.86.0, and Tcl/Tk 9.0 are also available.

One under-the-hood change in this release worth noting is the unification of /usr/bin and /usr/sbin. This may go entirely unnoticed by many Fedora users but required a fair amount of work on behalf of Fedora contributors. The primary benefits of the change are to simplify packaging—developers don't have to consider whether things should be installed in bin or sbin—and to make Fedora more compatible with other Linux distributions by using the same paths for utilities like ip, nstat, and ping. This change follows the /usr move that was finalized in Fedora 17.

Packages that require the git binary, but not everything that comes with Git, can now depend on the git-core package. As described in the switch to git-core change proposal, the git-core package only has nine packages as dependencies and consumes 8MB on install media or 32MB when installed. The git package requires 76 packages as dependencies (including a slew of Perl modules) and takes up 19MB on install media or 75MB when installed.

Even though KDE has graduated to edition status, the number of Fedora spins remains the same. The project has added a new spin, Fedora COSMIC, which is based on the alpha 6 release. LWN covered COSMIC in August 2024.

Microsoft has offered the Windows Subsystem for Linux (WSL) 2 on Windows 10 and Windows 11 to run Linux in a virtual machine with some integration into the desktop. One of the changes in Fedora 42 was to start producing Fedora WSL images that Windows folks can install so they have Fedora as an option. Currently, Fedora images are not available in the Windows Store, so users need to grab an image and install it manually. Images are available for x86_64 and Arm (aarch64) and instructions are on the wiki.

With Adams out the door, Fedora developers can now turn their full attention to Fedora 43, which is due in October. Some of the changes proposed for 43, so far include 99% package reproducibility, moving to RPM 6.0, and disabling support for building OpenSSL engines. Naturally, 43 will also have the usual flurry of software updates as well.

Fedora 42 is almost certainly not the answer to the ultimate question, but it is a solid update with quite a few minor improvements. Users on Fedora 41 won't see a drastic change, but that is not a bad thing—it's a stable release that keeps users up to date with the most recent open-source applications without too many sharp edges.

Comments (19 posted)

Atomic writes for ext4

By Jake Edge
April 10, 2025

LSFMM+BPF

Building on the discussion in the two previous sessions on untorn (or atomic) writes, for buffered I/O and for XFS using direct I/O, Ojaswin Mujoo remotely led a session on support for the feature on ext4. That took place in the combined storage and filesystem track at the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit. Part of the support for the feature is already in the upstream kernel, with more coming. But there are still some challenges that Mujoo wanted to discuss.

For ext4, a single filesystem block can be written atomically; that support was merged for 6.13, he said. There is work in progress on doing multi-block atomic writes in ext4. There are two main allocation challenges that need to be handled for multi-block, though: unaligned extents that do not match the hardware alignment requirements and ranges with mixed mappings, for example those that cover both unwritten data and hole sections.

The ext4 bigalloc feature eliminates the problem with unaligned extents because the cluster size for the filesystem can be set to, say, 16KB, so everything will be aligned on those boundaries. But it is a filesystem-wide setting, even if atomic writes are only needed for a few files, and it requires that any existing filesystem be reformatted to use the feature. Reformatting may not be desirable for all use cases, but multi-block writes with bigalloc is working now.

Currently, without bigalloc, ext4 does not have a way to guarantee the needed alignment; if an atomic write is done on an unaligned extent, ext4 has no fallback, it simply returns an error to the user. In order to ensure the alignment, Mujoo is exploring a combination of extsize and forcealign. Extsize is a per-inode alignment "hint" to the allocator that is set with an ioctl() command; it will try to allocate all extents to the boundary specified, but can fail. The forcealign extended attribute can be set on a file that has an extsize specified in order to require that allocation alignment; it can be seen as a per-file bigalloc.

Luis Chamberlain said that he has done some analysis of ext4 using bigalloc with a 16KB cluster size and noticed that some writes are not aligned on 16KB boundaries; he wondered why that was. Ted Ts'o said that bigalloc guarantees that data blocks are aligned to the cluster size, but not metadata blocks, which are still 4KB-aligned. Journal updates, inode updates, and bitmap-allocation-block updates could all cause writes that are not aligned to the cluster size.

Chamberlain wondered if there was any way to support 16KB writes for ext4 metadata; Ts'o said that it would require ext4 support for filesystem block sizes larger than the page size. The buffered I/O path for ext4 would probably need to switch to using iomap, he said; the ext4 developers are interested in getting patches that make that switch and he understands that large-block-size support is fairly straightforward once that happens.

The idea is to not require any reformatting of the filesystem with extsize and forcealign, Mujoo said. That will require fallbacks for files that are not properly aligned when forcealign is set for them. A "compat" feature flag can be added that can be set on existing filesystems; that will allow older kernels to mount the filesystem. An ioctl() command can be added to fix files that are not properly aligned. The forcealign feature might also have use cases outside of atomic writes; for example, it might help with getting properly aligned blocks for use with a direct-access (DAX) filesystem.

The problem of mixed mappings affects both bigalloc and non-bigalloc ext4; avoiding atomic writes with mixed mappings should be the goal, but it may not always be met. If a mixed mapping is used for an atomic write, there are three solutions that he sees. The first is to return an error, which might be popular for those who do not want a fallback path. Another is to zero the holes and write them with the rest. Finally, ext4 can do something similar to what XFS is doing: write the data in a new place and atomically change the extent mappings. Ext4 has no infrastructure to support the XFS-like solution, however, so it would add complexity to the solution.

Mujoo described the roadmap for ext4 atomic-write support. The patch sets for multi-filesystem-block writes using bigalloc and for adding extsize and forcealign support to ext4 are being targeted for Linux 6.16. Subsequent features, including using extsize and forcealign for multi-block atomic writes, exploring an extent-swapping fallback, and enabling buffered atomic writes for ext4, will come later.

Chamberlain asked if the idea of using the multi-index feature of XArray had been evaluated as a means to more generically support atomic writes for all filesystems. Mujoo agreed that it would be nice to have VFS support for atomic writes that could be used by more filesystems, but has not really looked at that.

Comments (8 posted)

Topics from the virtual filesystem layer

By Jake Edge
April 16, 2025

LSFMM+BPF

In the first filesystem-track session at the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF), virtual filesystem (VFS) layer co-maintainer Christian Brauner had a few different topics he wanted to talk about. Issues on the agenda included iterating through anonymous mount namespaces, a needed feature for ID-mapped mounts, the perennial unprivileged mounts topic, potentially using hazard pointers for file reference counting, and Rust bindings. He did not expect to get through all of them in the 30 minutes allotted, but the session did move along pretty quickly to at least introduce them to the assembled filesystem developers.

He noted that one of the accomplishments for the filesystem community over the last few years was in reworking the mount process and API. He was hopeful that mount notifications would be merged during the 6.15 merge window that was taking place during the summit—and they were. That feature will be useful for getting notifications of changes to mount trees, rather than having to frequently query the kernel or /proc files to keep track.

Anonymous mount namespaces

In the VFS, there is a notion of anonymous mount namespaces ("or 'detached mount trees', however you want to look at it") where a mount namespace exists, but processes cannot use setns() to enter it; it is not attached to anything else. Oddly, though, a process can chroot() to a directory in it, but then cannot list the mounts in the namespace. These mount namespaces do not have a representation in /proc/PID/mountinfo, which makes them "completely opaque to user space"; there is no way to interact with them, he said. That is a problem because, for example, a process can have a file descriptor for such a namespace, which pins various things into memory, but there is no way for user space to figure out what is responsible for pinning the memory.

So there is a need to expand the listmount() and statmount() system calls to be able to interact with anonymous mount namespaces. There is now a way to iterate through all mounts in all mount namespaces, except those attached to anonymous mount namespaces. Container workloads, under Kubernetes in particular, can have hundreds of mount namespaces; prior to the addition of listmount() and statmount(), listing all of the mounts would have required concatenating the output from all of the /proc/PID/mountinfo files. Extending that to the anonymous mount namespaces will help complete the API, he said.

Jeff Layton asked if the anonymous mount namespaces are collected onto a list in the kernel somewhere, but Brauner said they are not currently. There is a red-black tree for the mount namespaces, but that is indexed by the sequence number assigned to the namespace; anonymous mount namespaces all have a sequence number of zero. Instead of using the zero to recognize anonymous entries, a flag or something should be used, then a regular sequence number can be assigned and the entries can be added to the tree, Brauner said.

Layton said that he has been running into a similar problem with network namespaces; NFS can take a reference to a disconnected network namespace that then stays alive with no way to track down what has the reference. Brauner said that he had discussed this problem with Josef Bacik recently and they agreed that all namespaces should follow the plan for mount namespaces and assign a sequence number for each; the struct ns_common that holds namespaces would have the sequence number (or some kind of identifier) added to it, so that namespaces can be added to data structures and operated upon. It is a generic problem for namespaces that should be solved in a unified way, Brauner said.

ID-mapped mounts

Another thing he has been working on is a follow-on to the ID-mapped mounts feature that was merged into the 5.12 kernel in 2021. In 6.15, the ability to change UID and GID mappings for an ID-mapped filesystem by performing another ID-mapped mount on it was added. Now there is a need for various squashing options, where, for example, a range of IDs all map to a single ID.

One of the problems with the existing ID-mapping mechanism is that every ID present in the filesystem needs to be explicitly mapped to another ID or a file with an unmapped ID has no ID—it cannot be interacted with at all. For efficiency reasons, there may need to be limits on the number of squashed ranges that can be supported, but some kind of simple range squashing is needed. In addition, a way to say that any unlisted IDs map to a single specific ID should be supported.

The interface for specifying mappings currently follows the model that user namespace mappings use, where mapping information is written to /proc files. That works, but Brauner does not think it will scale to the needs of ID-mapped mounts. His immediate focus is to add a way to squash all of the unmapped IDs to a specified ID; he has a proof-of-concept implementation in his tree that passes all of his tests. It will make things much simpler for use cases like mapping all of the IDs on a filesystem to a single ID as no lookup will be needed.

Unprivileged mounts

Brauner informally polled the audience for which topic the attendees wanted him to talk about next: unprivileged mounts or file reference counts. Several spoke up for the mount topic and any reference-count fans were notably silent. "Dammit, I wanted to pitch hazard pointers", he said with a laugh, though he did eventually get to the topic.

There are two aspects to the unprivileged-mount problem; the first is to allow unprivileged users to mount "a random USB stick", which is "a really terrible idea". The other is to mark a specific filesystem as being mountable inside of a user namespace, "which adds a bit more protection, at least in terms of setuid binaries and all that kind of stuff", he said, but it does not help with the problems of malicious filesystem images. He does not think that there is any solution for allowing unprivileged users to mount random filesystem images; "I don't believe that Rust will solve this problem, I think that's a pipe dream".

Brauner pointed to the solution described in an LSFMM+BPF session from 2023, which allows user space to "safely delegate mounting of filesystems to unprivileged users" using systemd-mountfsd. That solution does not work for network filesystems, but Layton said that was simply a policy decision; the same mechanism could be used but network filesystems would need to add some capabilities to enable it.

For the USB-stick case, Brauner said, the solution should be to use Filesystem in Userspace (FUSE) and "don't mount untrusted stuff". Over the remote link, Jan Kara said that the solution for USB mounts does not lie in the kernel. OpenSUSE has started looking into mounting USB sticks using the Linux Kernel Library (LKL), which is somewhat similar to User Mode Linux; it has a FUSE daemon that uses LKL to mount the filesystem and expose it to the kernel, which "provides additional isolation". For USB sticks, performance is not particularly important, he said, so this "seems like a promising solution"

Reference counts

Last year, Brauner said, he added a reference-count mechanism for struct file; the patch set uses something similar to rcuref with dead zones so that an unconditional increment can be done when taking a reference to an entry in the files table via the file descriptor. It provides a 3-5% performance increase when there is a lot of contention, "which is great obviously", but he thinks the scalability problem has just been pushed out further.

So there is a need to explore other options. There was a patch set implementing hazard pointers for the kernel that he has been experimenting with, but that implementation is not suitable for struct file. It does scanning in the background and memory allocation. If hazard pointers were to be used, though, the file-reference path may need to allocate memory, which would add another possible error path.

He would like to explore the idea further, but it is "a very vague idea" at this point; it might lead to regressions in the single-threaded case, though, which would not be desirable. Amir Goldstein asked where the scalability problem lies; Brauner said that it comes from contention with socket file descriptors in highly threaded workloads. "It's not fantasized, it's actually an issue", but is not likely to be one for highly threaded writes, because there are other synchronization activities present for those workloads.

Wrapping up in his final 30 seconds, Brauner said that the Rust inode bindings should be discussed at some point. He plans to pick up those patches in the hopes of "getting something like that merged" within the next two years.

Comments (11 posted)

Parallel directory operations

By Jake Edge
April 16, 2025

LSFMM+BPF

Allowing directories to be modified in parallel was the topic of Jeff Layton's filesystem-track session at the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF). There are certain use cases, including for the NFS and Lustre filesystems, as mentioned in a patch set referenced in the topic proposal, where contention in creating multiple files in a directory is causing noticeable performance problems. In some testing, Layton has found that the inode read-write semaphore (i_rwsem) for the directory is serializing operations; he wanted to discuss alternatives.

He and Christian Brauner had worked on a patch set that addressed some of the low-hanging fruit from the problem. It would avoid taking the exclusive lock on the directory if an open with the O_CREAT flag did not actually create a new file. It provided a substantial improvement for certain workloads, Layton said.

But what is really desired, he said, is to be able to do all sorts of parallel directory operations (mkdir, unlink, rmdir, etc.). Neil Brown developed the patch set referenced earlier (which was up to v7 as of the summit) that would stop taking the exclusive (write) lock for every operation on the directory; it would take the shared (read) lock instead. There are filesystems, such as NFS, that do not need to have those operations serialized, Layton said.

Brauner and Al Viro reviewed Brown's patch set and gave some good feedback, Layton said. The stumbling block for those patches is that they take the directory i_rwsem as a shared lock, rather than as an exclusive one, but take an exclusive lock on the directory entry (dentry), which opens up a lot of possibilities for deadlock. There are lots of corner cases, so the problem is with making it "provably correct, which is going to be an order of magnitude more difficult than the existing exclusive lock".

Brown has been working on the patches for a while. The most recent version adds many new inode operations with _async added, "which is a little ugly", Layton said, but that is "cosmetic stuff". The hard part is getting the locking right.

That is where things stand, he said, and wondered what filesystem developers could do to help with the "long slog", because it is important for scalability in directories. The workloads where the problems are seen are "big untar-type workloads", where there are many threads all creating files in parallel in the same directories.

Brauner asked if individual filesystems would have to opt into supporting the feature. Layton said that he did not see any other possibility. One path might be to, say, create a mkdir_async() inode operation, slowly convert all of the filesystems to use that, and then replace the mkdir() operation. Brauner said that he did not like adding lots of new inode operations, but thought that if the idea was viable, the existing operations could add an asynchronous form that would only be used for filesystems that opt in. Layton suggested starting slow, with a single inode operation for unlink, say; Brauner thought that sounded like a reasonable approach.

Josef Bacik wondered if it made sense to push the locking down to individual filesystems; "I hate doing this" but it would mean that filesystems could try alternative locking that would allow more parallelism. There would need to be code that embodies the existing locking for every filesystem (except those trying out alternatives) to use.

Mark Harmstone asked about batching the operations for these workloads, which could be done, but requires rewriting applications. Brauner pointed out that io_uring could probably be used to do the batching today if that was desirable, but Harmstone said that he had been using io_uring for directory-creation (mkdir) operations which still takes locks for each one. Layton said that a fast-path option could perhaps be added to io_uring, but there is a more general problem to be solved.

For example, if multiple files are created in the same directory using NFS, those operations are all serialized on the client side, which means that it requires lots of network round trips before anything can happen. Fixing that would still end up serializing many of the operations on the server side, but that would be better. Bacik concurred, noting that you can take the untar workload out of the picture; creating a file in a directory takes an exclusive lock, so any lookups that involve the directory are held off as well.

Part of the problem is that it is not entirely clear what i_rwsem is protecting, Brauner said. For example, the setuid exit path recently started taking that lock to avoid some potential security problems. David Howells said that it is also used by network filesystems to ensure that direct and buffered I/O do not conflict.

Chris Mason asked which local filesystems would be able to handle parallel creation in directories. Layton did not know of any, but Bacik said that Btrfs would be able to; he agreed when someone else suggested that bcachefs might as well. Timothy Day from the Lustre filesystem project said that it could do so as well.

Bacik noted that once parallel creates are in place, "for sure we are going to run into the next thing"; there will be other constraints that will be encountered. Bacik said that network filesystems are likely to be able to more fully take advantage of the feature, while local filesystems will only have modest gains. He has an ulterior motive in that Filesystem in Userspace (FUSE) servers may be able to find better locking schemes in user space if the kernel is not taking exclusive locks on its behalf.

Comments (2 posted)

Preparing DAMON for future memory-management problems

By Jonathan Corbet
April 10, 2025

LSFMM+BPF

The Data Access MONitor (DAMON) subsystem provides access to detailed memory-management statistics, along with a set of tools for implementing policies based on those statistics. An update on DAMON by its primary author, SeongJae Park, has been a fixture of the Linux Storage, Filesystem, Memory-Management, and BPF Summit for some years. The 2025 Summit was no exception; Park led two sessions on recent and future DAMON developments, and how DAMON might evolve to facilitate a more access-aware memory-management subsystem in the future.

A status update

The first session was dedicated to updating participants on what is happening with DAMON; Park started by thanking all of the members of the development community who have contributed changes. DAMON, he reminded the group, provides access-pattern information for both virtual and physical address spaces, and allows the specification of access-aware operations to be performed. Thus, for example, an administrator can set up a policy to reclaim all pages that have been idle for at least two minutes. DAMON is increasingly used for this sort of proactive reclaim by large providers like Amazon's AWS, and in various memory-tiering settings.

Major changes to DAMON since the 2024 update include a new tuning guide. Page-level monitoring has been enhanced to expose information about huge pages that can be used in policy filters. Shakeel Butt asked whether this filter would select specific huge pages, or those within a given region; Park said that the filter can be applied to any address range. DAMON can provide more detailed information, including what proportion of a given region has passed a filter. There are filters for huge-page size and whether pages are on the active least-recently-used (LRU) list.

DAMON now has some automatic tuning for monitoring intervals. Getting the interval right is important; if it is too short, all pages will look cold, but making it too long makes all pages look hot. The default interval (100ms) is "not magic", he said. DAMON users have all been doing tuning in their own ways, duplicating a lot of effort. The tuning knobs were finally documented, in the above-mentioned guide, in 6.14, which is a start; documentation is good, he said, but doing the right thing automatically is better.

The important question is: how many access events should each DAMON snapshot capture? The answer is the "access event ratio" — the number of observed events divided by the number that could have been observed. The automatic tuner runs a feedback loop aiming for a specified ratio; in the real world, it tends to converge near 370ms for a normal load, but four or five seconds on a light load. The result is good snapshots with lower overhead. Given how well the feature seems to work, Park wondered whether it should be enabled all the time.

Michal Hocko said that DAMON is not enabled in the SUSE distributions because nobody has asked for it. So, while he has no problem with enabling this tuning by default, he does not like the idea of it being active by default.

In the 2024 discussion on using DAMON for memory tiering, a simple algorithm was proposed:

On each NUMA node lacking a CPU, look at how that node sits in the hierarchy:
- If the node has lower nodes (hosting slower memory), then demote cold pages to those lower nodes, aiming for a minimum free-memory threshold.
- If the node has upper (faster) nodes, promote the hot pages, aiming for a minimum used-memory threshold on the upper node.

The core patches enabling this algorithm were merged for the 6.11 kernel release. A DAMON module has been implemented to manage the tiering, and an RFC posting has been made. When asked whether this module could make use of access information not collected by DAMON itself, Park answered that there is no such capability yet, but agreed that it would be useful.

In Park's testing system, use of the DAMON tiering module improved performance by 4.4%, but the existing NUMA-balancing-based tiering degraded instead. So there is a need for more investigation, but he concluded that, sometimes at least, DAMON can improve performance for tiering workloads.

Park ended the session with a brief discussion of where DAMON is going next. He wants to merge his work on generating NUMA utilization and free-space goal metrics, and implement a "just works" tiering module, suitable for upstream, that is able to identify a system's tiers automatically. Thereafter, this module could be extended to more general heterogeneous memory-management tasks. Taking CPU load and available memory bandwidth into account when performing migration was the last item on the list.

Future requirements

DAMON remained on the agenda for the next session, though, which was focused on requirements in a future where the memory-management system has more visibility into access patterns. There are a number of proposals circulating for ways to acquire that data, Park said, including working-set reporting, hotness monitoring from CXL controllers, accessed-bit scanning (as was discussed earlier in the day), data from AMD's Instruction Based Sampling feature, data from the multi-generational LRU, and more. The time is clearly coming when it will be necessary to handle multiple sources of access information to better manage memory. With that information, it should be possible to provide better control to administrators, automatically unblock a primary workload's progress, or achieve better fairness between tasks.

DAMON provides an "operations set" layer called DAMOS that handles the implementation of fundamental memory-management operations. That includes managing sources of information, which is then fed into the DAMON core and distilled into region-specific info. DAMOS operates under quotas that bound its resource usage; they can be specified by the administrator or tuned automatically. There are also filters that can narrow its attention to specific pages.

In the future, Park said, he would like to reduce the overhead of DAMON while improving its accuracy. If it can be made more lightweight, "everything will be solved". New data sources will help toward that goal. He plans to add an optimized API for access reporting; specifically, there will be a new function, damon_report_access(), to provide memory-access information to the kernel. It will take the ID of the accessing process, the address of interest, the number of accesses, and the NUMA node from which the access is made. The plan is for this function to be callable from any context.

There may also be a new function, damon_add_folios(), to indicate that specific folios are the highest-priority target for memory-management operations. It can be used to identify folios for promotion or demotion, for example.

He will be posting damon_report_access() soon, he said, so that DAMON can take advantage of access information from sources beyond its own scanning. The need for damon_add_folios() is less clear at the moment; he mostly just imagines that there might be a use for it. Park would like to hear from people with use cases for this functionality.

Jonathan Cameron said that the fact that a hardware-based access-reporting system does not report on a specific range is also significant; he suggested adding an interface for pages that are known not to have been accessed. He also said that, as finer-grained access data comes from more sources, the regions that DAMON considers will be broken up more finely in response. That will lead to denser data overall, and more for DAMON to track; he wondered if Park had thoughts on reducing the resulting tracking overhead. Park answered that DAMON already lets users set limits on the number of regions, and that this keeps DAMON in check now; it should work in the future as well.

The last question came from John Groves, who wondered if the access-reporting interface should be tied to virtual memory areas (VMAs). A mapped file will be represented by a single VMA, he said, even if it has a lot of shared users, so that could be a logical way to manage it. Park answered that a VMA-based interface could maybe make sense, but he would need to hear more about the uses cases for it.

Comments (7 posted)

Management of volatile CXL devices

By Jonathan Corbet
April 10, 2025

LSFMM+BPF

Compute Express Link (CXL) memory is not like the ordinary RAM that one might install into a computer; it can come and go at any time and is often not present when the kernel is booting. That complicates the management of this memory. During the memory-management track of the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit, Gregory Price ran a session on the challenges posed by CXL and how they might be addressed.

When people think about the management of CXL memory, they often have tiering in mind, he said. But there are more basic questions that need to be answered before it is possible to just put memory on the PCI bus and expose it to the page allocator. There needs to be a way, he said, to avoid putting some applications into some classes of memory. The mempolicy API is the obvious tool to use, but he is not sure if it is the right one.

A running system, he said, has three broad components: the platform, the kernel, and the workloads. The platform (or its firmware) dictates how the kernel objects representing the system, such as NUMA nodes, memory banks, and memory tiers, are created; these objects are then consumed by the mempolicy machinery, by DAMON, and the rest of the system. But question of how and when those objects are created is important. For example, can the contiguous memory allocator (CMA) deal properly with NUMA nodes that are not online at boot time? Getting the answers wrong, he said, will make it hard to get tiering right.

Adam Manzanares suggested that it would be possible to create QEMU configurations that would match the expected hardware and asked if that would be helpful. Price said it would, enabling developers to figure out some basic guidelines governing what works and what does not.

The system BIOS, Price said, tells the kernel where the available memory ends up; the kernel then uses that information to create NUMA nodes describing that memory. A suitably informed kernel could create multiple nodes with different use policies if needed to use the memory effectively. Jonathan Cameron interjected that NUMA nodes are "garbage", at least for this use case; Price agreed that NUMA might not be the right abstraction for the management of CXL memory.

At a minimum, he continued, somebody should create documents describing what the kernel expects. He has done some work in that direction; links can be found in his proposal for this session. Any vendors that go outside those guidelines would be expected, at least, to provide an example of how they think things should work. Dan Williams said that a wider set of tests would also be helpful here; the current tests are mostly there to prevent regressions rather than ensuring that the hardware is behaving as expected.

Price would like to isolate the kernel from CXL memory as much as possible — if kernel allocations land in CXL memory, the performance and reliability of the system as a whole could suffer. There are a lot of tests to ensure that specific tiering mechanisms work well, but not so many when it comes to landing kernel allocations in the right place. There is no easy way to say which kernel memory, if any, should be in CXL memory. Configuring that memory as being in ZONE_MOVABLE would solve that problem (kernel allocations are not movable, so cannot be made from that zone), but it creates other problems. For example, hugetlb memory cannot be in ZONE_MOVABLE either.

John Hubbard suggested creating a more general mechanism, perhaps there is a need for a ZONE_NO_KERNEL_ALLOC. He, too, asked whether NUMA is the right abstraction for managing this memory. Williams suggested just leaving the memory offline when it is not in use, but Hubbard said that would run into trouble when the memory is put back into the offline state after use. If any stray, non-movable allocations have ended up there, it will not be possible to make the memory go offline. Williams said that the moral of that story is to never put memory online if you want to get it back later; Hubbard answered that this is part of the problem. The rules around this memory do not match its real-world use.

Alistair Popple suggested that this memory could be set up as device-private; that, however, would prevent the use of the NUMA API entirely, so processes could not bind to it. Williams said that device-private memory is kept in a driver that would manage how it is presented to user space. Popple said that approach can work, but is a problem for developers who want to use standard memory-management system calls. Hubbard said that he would rather not be stuck with a "niche" approach like device-private memory.

Price continued, noting that the standard in-kernel allocation interface is kmalloc(). He wondered whether developers would be willing to change such a widely used interface to add a new memory abstraction. If they do, though, perhaps adding something for high-bandwidth memory at the same time would make sense. Michal Hocko suggested that it would be better to make a dedicated interface than to overload kmalloc() further.

User-space developers want something similar; there tend to be a number of different workloads running on a system, some of which should be isolated from CXL memory. Page demotion can ignore memory configurations now, with the result that it can force pages into slower memory even if the application has tried to prevent that. He has, he said, even seen stack pages demoted to slow memory, which is "an awful idea". That said, he added, there are some times when it might make sense to demote unused stack memory.

An important objective is "host/guest kernel parity", where both a host system and any guests it is running have full control over memory placement. The guests should be able to do their own tiering, and should be able to limit what ends up in slower memory, just like the host does. He would like to define some KVM interfaces to pass information about memory layout and needs back and forth.

A big problem is the current inability to place 1GB huge pages in ZONE_MOVABLE. Allocating those pages to hand to guests can improve performance, but this limitation keeps them from being allocated in CXL memory. He has heard suggestions to use CMA, but the CXL memory is not online when CMA initializes, so CMA cannot use it.

There are other problems as well, he said at the end. If the placement of a bank of CXL memory is not properly aligned, that memory will be trimmed to the page-block size, wasting some of that memory. There are a lot decisions that can ripple through the system and create problems later on. Hocko commented that the page-block size is arbitrary, and that the memory-management developers would like to get rid of it. The fact that it is exported via sysfs makes that change harder, though. A lot of these problems, he said, are self-inflicted and will be fixed eventually.

Price concluded with an observation that memory tiers, as currently implemented, do not make much sense. A common CXL use case involves interleaving pages across multiple devices; that forces those devices into a single NUMA node. That creates confusion on multi-socket systems and really needs to be rethought, he said.

Price has posted the slides for this session.

Comments (34 posted)

Managing multiple sources of page-hotness data

By Jonathan Corbet
April 11, 2025

LSFMM+BPF

Knowing how frequently accessed a page of memory is (its "hotness") is a key input to many memory-management heuristics. Jonathan Cameron, in a memory-management track at the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit, pointed out that the number of sources of that kind of data is growing over time. He wanted to explore the questions of what commonality exists between data from those sources, and whether it makes sense to aggregate them all somehow.

Cameron's own focus is on the CXL "hotness monitoring unit", which can provide detailed data on which pages in a CXL memory bank have been accessed, but there are many other data sources as well. He fears that it may be crazy to try to combine them all, but hopes that it makes sense to do so at least some of the time.

Hotness, he said, is a proxy for something that cannot be measured ahead of time: the performance cost of putting data in the wrong place. This data will drive actions like promotion, demotion, swapping, or reclaim. Different sources give different data, which may or may not include a virtual address, the NUMA node that made the access, or when the access happened.

Combining this information from different providers can be challenging. The hotness monitoring unit, for example, does not see accesses that are resolved from the CPU's caches; that results in data that works reasonably well for tiering, but less so for the management of least-recently-used (LRU) lists. CPU-based hardware-supported methods do not include accesses from prefetch operations, making it harder to use that data to balance memory bandwidth use. Every way we have of measuring hotness, he said, is wrong in some way.

Cameron mentioned the aggregation and promotion threads proposed by Bharata Rao, which were discussed earlier in the day. Rao, over the remote link, said that he was working to accumulate information from different sources and to provide an API that various subsystems could use. Cameron said that it was necessary to start with something, and that Rao's proposal seemed as good as any.

Getting back into the meaning of "hotness", Cameron said that it is a guess at the future cost of moving (or not moving) memory. The kernel cannot see into the future, and cannot measure past costs, but it can measure access frequency to some extent. Some measurement methods, though, sample infrequently and can miss accesses. Techniques like access-bit scanning or tracking page faults can capture individual events; there is some commonality between sources of this type. What is needed is an efficient data structure to aggregate the data they produce.

The hotness monitoring unit, though, only provides lists of hot pages, separated from the events that were observed. Data from all these sources goes into some sort of "hot list", he said, that may be used for page promotion. But the size of the hot list is constrained, he said, so the kernel cannot track everything. The entries in that list, as a result, are a sort of random subset of the hot pages, and the list can change frequently. There may be ways to improve this data, perhaps by tracking pages that were previously considered hot, but any solution will be complex and involve a lot of heuristics.

Davidlohr Bueso asked if the aggregation of this data could be centralized in the DAMON subsystem. It has the APIs to do this aggregation and could be a good place to try things out to see what works. Cameron agreed, saying that there are a lot of good features in DAMON, but that its regions abstraction does not work well for some use cases. In the worst case, it can end up with each page in its own region.

Cameron concluded by saying that it may be a while yet before the community knows how to unify these data sources; Bueso answered that it would be better to do something now to learn what works. Gregory Price asked for an asynchronous bulk page migrator that would run in its own thread. Making hotness data available in an API would help to make that happen, he said, and to evaluate the usefulness of each of those sources.

Comments (2 posted)

The state of the memory-management development process, 2025 edition

By Jonathan Corbet
April 14, 2025

LSFMM+BPF

Andrew Morton, the lead maintainer for the kernel's memory-management subsystem, tends to be quiet during the Linux Storage, Filesystem, Memory-Management, and BPF Summit, preferring to let the developers work things out on their own. That changes, though, when he leads the traditional development-process session in the memory-management track. At the 2025 gathering, this discussion covered a number of ways in which the process could be improved, but did not unearth any significant problems.

Morton started with the "usual whinges", primarily that many memory-management changes take too long to finalize. He accepted responsibility for much of that problem, saying that he needs to send more emails to move things along.

He has heard complaints that he is overly eager to merge things, with the result that buggy code is fed into linux-next. Since memory-management bugs affect everybody, those complaints can come from far outside this subsystem. He has responded by being slower in accepting changes from submitters if he does not know them well, sending them a private note saying that he is waiting for reviews. But some unready code is still getting through, he said; he will seek to be more skeptical going forward. That said, problems are fairly rare, it is really a matter of fine tuning.

Dan Williams (Morton said) has suggested that the mm-unstable branch, which holds the rawer code, not be fed into linux-next. The mm-stable branch, instead, is more fully cooked and almost never requires fixing. But, if mm-unstable is excluded from linux-next, the patches contained within it will not get wider testing, and that would eventually cause mm-stable to be less so. So he plans to create another branch, called mm-next, as an intermediate step. Then, mm-unstable (which could perhaps be renamed "mm-crazy") would be kept private. The problem with this approach is that it will add latency to the process if patches have to work their way through all three branches; that means that nothing posted after -rc5 would be ready for the merge window. Matthew Wilcox indicated that this result would be just fine with him.

Morton said that some changes might go directly to mm-next, while mm-unstable will be the first stop for larger or unreviewed work. It will be his judgment call. One nice effect will be that he can keep piling changes into mm-unstable without breaking any of the linux-next rules — and also take them out. The plan had always been that code could be removed from mm-unstable if it proved to have problems, but that has not worked well in practice; there is a perception that, once Morton has accepted a patch, it is well on its way to the mainline.

Lorenzo Stoakes said that the biggest problem for him is the uncertainty about what will happen with patches as they enter the process. He would like to see more rules laid down; perhaps patches would require a suitable Reviewed-by tag before they could move into mm-next, for example.

A separate problem Morton raised is that nobody is reviewing David Hildenbrand's patches, that there are just too many of them. (My own feeling is that Hildenbrand tends to take on some of the hardest problems — example — and that many people do not feel smart enough to review the results). Michal Hocko said that Morton's expectations regarding who will review specific patches are not always clear. There need to be developers with responsibility for all areas in the subsystem, not a single "superman" who handles everything. Morton should be asking for people to stand up and take responsibility, Hocko said.

In general, Morton said, if anybody wants him to do something specific, they should just ask him. He will often get requests to hold off on merging a patch series for a while, for example, and he generally does so. Williams wondered why Morton should be finding reviewers for other developers' patches, though. Everybody should be asking themselves why they aren't reviewing some developers' work.

Morton suggested that all developers should also be scanning the linux-mm list for things needing attention. Hocko said that he had been doing that, but he just doesn't have time anymore; keeping up with the list is not a sustainable activity.

Wilcox asked about the problem of "zombie maintainers". There are, he said, six developers listed as maintainers of the slab allocator, three of whom have not been seen for 20 years. (That problem is now being addressed). He also complained about contributors who "produce churn", saying that Morton is too nice to such people. Liam Howlett said that he will normally not accept cleanup patches except as part of a larger project. Those cleanups can often break subtle things, he said.

As the session wound down, Morton said that getting replacement patches from developers makes his life harder. He doesn't really know which reviews apply at that point, or which comments have been addressed. He would rather get small delta patches on top of the previous work, especially once that work has landed in mm-unstable.

He closed the session by saying "see you next year!"

Comments (2 posted)

Automatic tuning for weighted interleaving

By Jonathan Corbet
April 15, 2025

LSFMM+BPF

It is common, on NUMA systems, to try to allocate all memory on the local node, since it will be the fastest. That is not the only possible policy, though; another is weighted interleaving, which seeks to distribute allocations across memory controllers to maximize the bandwidth utilization on each. Configuring such policies can be challenging, though. At the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit, Joshua Hahn ran a session in the memory-management track about how that configuration might be automated.

The core purpose of memory-allocation policy, he began, is to determine which node any given virtual memory area (VMA) should allocate pages from. One possible policy is simple round-robin interleaving to distribute allocations evenly across the system. Weighted interleaving modifies the round-robin approach to allocate more heavily from some nodes than others. Properly distributing allocations can maximize the use of the available memory bandwidth, improving performance overall.

The question Hahn had is: how can this interleaving be made to work out of the box? The system, he said, should provide good defaults for the interleaving weights. He had a couple of heuristics to help with the setting of those defaults. The weight ratios, intuitively, should be similar to the bandwidth ratios for the controllers involved. Bandwidth is the ultimate limit on the performance of the system, he said; it is more important than the latency of any given memory access. Weights should also be small numbers; the weight of each node is, in the end, the number of pages to allocate from that node before moving on, so smaller weights will lead to faster distribution of allocations.

A problem arises, though, when new memory is added to the system; the kernel has to respond and recalculate all of the weights. How that should be done is not entirely clear, especially if the administrator has adjusted the defaults. The administrator should be able to tell the system what to do in that case, he said, with the available options being to recalculate all of the weights from the beginning, or to just set the weight for the new memory to one.

Reprising a theme from an earlier session, Hahn brought up the sort of complications that hardware can bring. Given a system with two host bridges, each of which has two CXL nodes, how many NUMA nodes should the system have? The hardware can present the available resources in a few different ways, with effects that show up in the configuration problem at the kernel level.

Ideally, of course, the tuning of the weights should be dynamic, based on some heuristic, but Hahn said that he is not entirely convinced that bandwidth use is the right metric to optimize for. He wondered if the kernel should be doing the tuning, or whether it should be delegated to user space, which might have more information. Liam Howlett said that the responsibility for this tuning definitely belongs in user space; the kernel cannot know what the user wants. Gregory Price (who did the original weighted-interleaving work) pointed out that there is currently no interface that allows one process to adjust another's weights; that would be needed for a user-space solution.

Michal Hocko said that problems like this show that the kernel's NUMA interfaces are not addressing current needs. That problem needs to be addressed; it presents a good challenge that can lead to the creation of better interfaces. Jonathan Cameron said that user space currently does not have enough data to solve this problem.

Price said that users may want to interleave a given VMA from the moment it is instantiated, and wondered whether the NUMA abstraction is able to handle that. Hocko answered in the negative, saying that the NUMA interface was designed for static hardware, and falls short even on current systems. The kernel's memory-policy interface is constraining; it is really time to create a new NUMA API, hopefully one that will handle CXL as well.

Howlett said that kernel developers were never able to get the out-of-memory killer right, so now that problem is usually handled in user space. He was not convinced that the kernel community would be any more successful with interleaving policy. Hocko responded that user-space out-of-memory killers did not work well either until the pressure-stall information interface was added; before then, nobody had thought that it would be the necessary feature that would enable a solution to that problem.

The session ran out of time; it ended with a general consensus that a better interface for controlling memory policies is needed. Now all that is needed is for somebody to actually do that work.

Comments (2 posted)

Improvements for the contiguous memory allocator

By Jonathan Corbet
April 16, 2025

LSFMM+BPF

As a system runs, its memory becomes fragmented; it does not take long before the allocation of large, physically contiguous memory ranges becomes difficult or impossible. The contiguous memory allocator (CMA) is a kernel subsystem that attempts to address this problem, but it has never worked as well as some would like. Two sessions in the memory-management track at the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit looked at how CMA can be improved; the first looked at providing guaranteed allocations, while the second addressed some inefficiencies in CMA.

In current kernels, CMA works by reserving a large area of physically contiguous memory at boot time; it uses that area to satisfy requests for large allocations later on. In the absence of such allocations, the CMA reservation can be allocated for other purposes, but only movable allocations can be placed there. The intent is that, should a need for a large buffer arise, those other allocations can be migrated elsewhere in memory, freeing a physically contiguous range. The hoped-for result is that large allocations are always possible, but that the memory used for those allocations is not wasted when it is not needed.

Guaranteed CMA

Suren Baghdasaryan ran a session to discuss the guaranteed contiguous memory allocator patches that he had posted shortly before the conference. This series, he said, was mostly based on older work from Minchan Kim and SeongJae Park, along with the "cleancache" abstraction first proposed by Dan Magenheimer in 2012.

The kernel has long provided two options for physically contiguous allocations, he said. The first is CMA, which has the advantage that the reserved memory can be used for movable allocations when it is not needed. The downside of CMA is that, sometimes, as was discussed two days earlier, those "movable" allocations prove not to be movable after all; that can cause CMA allocations to fail. Even when allocations succeed, the time required is nondeterministic, since how long it takes to move pages out of the way can vary widely. Both of these problems can be serious. The Android face-unlock application needs to be able to allocate a buffer from CMA; since a user is waiting to access their device, the application cannot tolerate slow allocations or the possibility of not getting its memory at all.

The alternative is carve-outs — setting aside memory at boot but not allowing it to be used for any other purpose. Carve-outs are guaranteed to work quickly, but they also waste memory to provide a buffer that may almost never be used.

To create a better solution, Baghdasaryan set a rule that the CMA area can only be used to cache recoverable content that can be dropped immediately on demand. That memory should otherwise be inaccessible. It can be used to cache useful data, but not to hold anything that might impede its immediate use when a large buffer is needed.

The tool he used to implement his solution is cleancache. As with the original cleancache design, guaranteed CMA will store clean, file-backed pages that can be dropped on demand. It becomes a sort of extension to the page cache that can avoid the need to perform I/O when reclaimed pages are faulted back in. In the new implementation, though, the invasive filesystem hooks required by the original are gone; instead, there are some simple hooks in the memory-management subsystem's fault, eviction, and invalidation paths. It has a simple API that allows pages to be donated to the cache and gotten back quickly when they are needed.

Guaranteed CMA is a backend for cleancache; like CMA, it reserves its region at boot time. The existing devicetree entries used to configure CMA now work for the new version as well. The reserved region is donated to the cleancache, where it can be used until needed. The result is not perfect utilization of the reserved memory; in Baghdasaryan's tests, about 40% of that memory was holding cached data at any given time. The hit rate for queries to the cache was 42%, though, indicating that a lot of I/O had been avoided.

David Hildenbrand asked whether the biggest problem with CMA as it exists now is the latency on allocation requests, or the possibility of allocation failures; Baghdasaryan said that both were big problems for the face-unlock case. Hildenbrand asked what the cause of the failures was, surmising that it might be pages that have been pinned by other subsystems. Baghdasaryan said that might be the source of the problem; direct I/O or writeback might also play into it. Hildenbrand said it would be better to make existing CMA more reliable if possible; Baghdasaryan's approach is good, he said, but may not work well on systems where the workload is dominated by anonymous pages (which cannot be dropped, and so cannot be stored in the cache).

After acknowledging that suggestion, Baghdasaryan concluded with a few loose ends. There may be other uses for guaranteed CMA, which could perhaps manage memory that is reserved for crash-dump kernels, for example. He wondered about security and, specifically, when pages stored in the cache should be zeroed. A bad kernel module could snoop around in the cache, he said, but bad modules can already do that now. He also wondered if guaranteed CMA needs some sort of NUMA awareness; Michal Hocko suggested keeping the implementation simple until a need for more complexity makes itself clear.

As the session closed, Brendan Jackman asked whether this feature could be made available to user space. Baghdasaryan said that DMA buffers could perhaps live in the CMA region, but Jackman was interested in more generalized user-space caching. Hildenbrand suggested that regions mapped with MAP_DROPPABLE (which have similar "can be dropped at any time" semantics) could perhaps be stored in the CMA area.

Optimizing CMA layout

Later that day, Juan Yescas ran a session on a problem he has been experiencing with CMA. In short, on systems with large page sizes, huge areas have to be dedicated to CMA to be able to use it. Currently, the CMA region must be aligned to the kernel's page-block size which, in turn, is driven by an number of parameters, including the system's base-page size. The size of the region must also be a multiple of the page-block size. On systems with 64KB pages and transparent huge pages enabled, the page-block size can be 512MB; that becomes the effective minimum size of the CMA region. If only a fraction of that space is needed, the rest is set aside needlessly.

Yescas was wondering why this alignment requirement exists. Hildenbrand pointed out that CMA has to work on page blocks to be able to migrate pages out when they are in the way of an allocation.

Yescas had some solutions to the problem that he has been exploring. First would be to sum up all of the CMA requirements for the system, then allocate a single CMA region to hold all of them. That does not help, though, on systems with a single CMA user. An alternative is to set the ARCH_FORCE_MAX_ORDER configuration parameter to a smaller value like seven. That will result in smaller page blocks, minimizing the memory waste, but it also makes the allocation of huge pages harder. Finally, one could have CMA just allocate its memory from the kernel's buddy allocator when the reservation size is relatively small.

Vlastimil Babka said that the CMA reservation is not really wasted, even if it exceeds the required size, because movable allocations can be placed there. Hildenbrand said that, in the long term, work should be done to ensure that page blocks are reasonable in size; there could be some sort of "super blocks" abstraction for larger groupings if needed. That is not an option for now, though. He suggested that Yescas could use guaranteed CMA, which does not use migration and, thus, does not need page-block alignment.

On the other hand, Hildenbrand said at the end of the session, setting ARCH_FORCE_MAX_ORDER to a smaller value is not a good idea. Instead, it would be better to find a way to let the page-block size be smaller, as is apparently done with the PowerPC architecture now. That, he concluded, might be the cleanest short-term solution to the problem.

Comments (2 posted)

Inlining kfuncs into BPF programs

By Daroc Alden
April 11, 2025

LSFMM+BPF

Eduard Zingerman presented a daring proposal that "makes sense if you think about it a bit" at the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit. He wants to inline performance-sensitive kernel functions into the BPF programs that call them. His prototype does not yet address all of the design problems inherent in that idea, but it did spark a lengthy discussion about the feasibility of his proposal.

The justification for inlining, as always, is optimization. The BPF verifier's primary role is to analyze the safety of BPF programs. But it does also use the information learned during that analysis to eliminate unnecessary bounds-checks at run time. The same information could potentially eliminate conditional branches in kfuncs (kernel functions callable from BPF) that are used by frequently-invoked BPF programs.

Zingerman first proposed kfunc inlining in November 2024. In his initial request for comments, he focused on bpf_dynptr_slice() and showed that inlining it could eliminate jumps from a switch statement and an if statement, providing a 1.53x speed up on his synthetic benchmark.

bpf_dynptr_slice() is 40 lines of code with 10 conditionals spread across the function, Zingerman said, so inlining it by hand was "a bit tedious". An automated solution that inlines specific kfuncs that benefit from the transformation could potentially be useful.

To do that, he proposed compiling those functions to BPF and embedding the resulting BPF code into the kernel binary. The verifier could inline the code into loaded BPF programs during verification. The problem with that approach is that, currently, BPF is not architecture specific — the same programs will run unmodified on any architecture that supports BPF — but kfuncs can have architecture-specific code. In fact, many kernel headers have architecture-specific functions that need to be taken into account.

One potential workaround could be to compile the kfuncs for the host architecture, but then use some LLVM-specific tooling to retarget the resulting binary to BPF. Another solution could be to only support inlining functions (such as bpf_dynptr_slice()) that don't contain architecture-specific code. Alternatively, any architecture-specific code could be pulled out into a separate kernel function callable from BPF, and the remaining code could be compiled directly. Zingerman didn't really seem happy with any of those approaches, though.

José Marchesi questioned why it was necessary to build for a particular architecture and then retarget it to BPF in the first place; after all, any architecture-specific assembly code wouldn't be able to be retargeted. Zingerman explained that firstly, the build would fail because the kernel doesn't have a "BPF" architecture in its build system, and secondly, there are some data structures that vary in layout depending on the architecture which need to match.

Andrii Nakryiko pointed out that 32-bit architectures need 32-bit pointers, at least, even if more complex structures could be made to work somehow. Marchesi conceded the point, but suggested that the compiler could pretend to be the host architecture to the preprocessor, and still emit BPF code. Alexei Starovoitov explained that there were parts of the build system that would need to be adapted too. Zingerman summarized the situation by saying that they could try it, but there would be obstacles.

I suggested that having multiple architectures (BPF+x86, BPF+risc, etc.) might help. Zingerman agreed that, in the future, it might come to that. But, for his initial proposal, the workaround with LLVM works well enough to prove out the concept.

He was more certain about the right approach for embedding any generated BPF into the kernel: adding them to a data section and using the ELF symbol table to find them when needed. The verifier will need to handle applying relocations inside the kfuncs' bodies, which will allow them to call other functions in the kernel transparently.

The inlining itself is also fairly simple: before verifying the user's program, the verifier should make a copy of each inlinable kfunc for each call in the program. When it reaches a call to the function during verification, it sets up a special verifier state to pass information about the arguments and BPF stack into the code that verifies the instance of the function. Then it verifies the body of the kfunc, ideally using its contextual information to do dead-code elimination. When the BPF code is compiled, the body of the kfunc can be directly included by the just-in-time compiler.

Having a separate instance of the kfunc for each call is not strictly necessary, Zingerman said, but doing it that way keeps the impact on the verifier minimal. To share verification of one kfunc body between call sites, the verifier would need to track additional information about the logical program stack and actual program stack that it does not currently handle. That representation would be complex and harder to reason about, he explained, which is why he favored the more isolated approach.

All of this discussion of the mechanism behind inlining was predicated on being able to choose which functions would benefit from inlining, however. Currently, Zingerman is considering focusing on functions for manipulating dynptrs and some iterator functions, although he's open to expanding the set of inlinable kernel functions over time.

Nakryiko asked about how Zingerman intended to check the types of arguments passed to an inlined kfunc during verification. For the initial version, he didn't worry about that, Zingerman said. He just assumed that the kernel function was compiled correctly. But in the future, the build process could embed BTF debugging information alongside the compiled kfunc and it could be checked that way.

Arnaldo Carvalho de Melo wanted to know about conditional inlining — that is, inlining only the calls to kfuncs that are most used. Starovoitov replied that the BPF subsystem does not currently have any kind of profile-guided optimization. Zingerman said that it was another thing to be explored, but that it wasn't part of his initial proposal. Starovoitov suggested tracking which branches are taken at run time, and reoptimizing the BPF program on the fly. "We aren't doing that, but it would be cool," he said.

Nakryiko also wanted to know why this kind of inlining needed to be done automatically. Zingerman said that making it automatic ensures that inlining doesn't introduce any mistakes, and can be done for more complex functions. Nakryiko suggested that it doesn't really make sense to inline something that's already complicated. Zingerman agreed, saying that was one reason he wanted to lead a session at the summit — to see which other functions, beyond the few he had focused on, people were interested in inlining.

Daniel Borkmann, one of the organizers for the BPF track, suggested that it would be interesting to evaluate the impact of inlining some functions for handling BPF maps. But then he advised people that the session had run out of time, and brought things to a close there.

Comments (none posted)

In search of a stable BPF verifier

By Daroc Alden
April 14, 2025

LSFMM+BPF

BPF is, famously, not part of the kernel's promises of user-space stability. New kernels can and do break existing BPF programs; the BPF developers try to fix unintentional regressions as they happen, but the whole thing can be something of a bumpy ride for users trying to deploy BPF programs across multiple kernel versions. Shung-Hsi Yu and Daniel Xu had two different approaches to fixing the problem that they presented at the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit.

BPF in stable kernels

Yu presented remotely about the problem of running BPF programs on stable kernels. He began by recapping the process by which a patch ends up in a stable kernel: by including a "CC: stable@vger.kernel.org" tag in the patch, by being picked up by the AUTOSEL patch-selection process, or when the developer explicitly asks the stable team to include it. A patch that has been identified for inclusion in a stable kernel by one of those means needs to meet three additional criteria: the patch is present in mainline, the patch applies cleanly to the stable tree, and the stable tree builds after applying it.

That whole process encodes some hidden assumptions, Yu said. For one, it assumes that a patch will work as-is when applied to an older code base, or at least that a patch which doesn't will be caught by the stable-release testers. But "the elephant in the room is that it doesn't really work." Patches taken from the current code base and applied to an older code base are not guaranteed to work. Patches sent to stable get less review, and stable-kernel testers do not systematically exercise BPF functionality.

Yu proposed adding the stable trees to the BPF continuous-integration testing, starting with the 6.12 long-term-support kernel and continuing from there. That prospect isn't as simple as just running the tests on a new branch, however; if an error is found, where should it be sent? Yu listed three possibilities: the stable mailing list, a group of volunteers, or the BPF mailing list. Of the three options, he would most prefer to have a dedicated group of volunteers, but that may not be possible — it depends on whether anyone is willing to step up.

Even if the stable tree is added to BPF's automated tests, the way the tests themselves are updated poses a problem. Usually, a fix and a test verifying the fix are submitted as part of the same patch set. If a fix is backported to a stable kernel, though, its accompanying test might not go back along with it. Yu spoke with Greg Kroah-Hartman, who agreed that he would accept BPF selftests into the stable trees, but he needs someone to identify them and ask for them to be backported.

Finally, Yu said, not all fixes can be backported. That's fine — stable trees miss out on a lot of fixes in the name of stability — but it does pose a problem for security-relevant fixes. Changes to the BPF verifier often qualify as security-relevant.

None of these problems are really BPF specific, Yu emphasized. Maybe there is some low-hanging fruit that can make the experience of using BPF on stable kernels less painful, though, and maybe the BPF developers can find someone willing to step up and take on that work.

The assembled BPF developers pointed out some additional challenges that Yu might have overlooked, such as more load on the test machines and on the maintainers, but seemed generally in agreement that something needed to be done.

Modularizing the BPF verifier

Xu had a different vision of how to enable stability for users of BPF. Xu's long-term goal is to ship BPF changes more quickly; the BPF subsystem changes quickly, but long-term support kernels stick around for a long time. The effort of trying to support BPF programs across a number of kernel versions in the face of that reality makes BPF painful to use, he said. His plan is to try to make the BPF subsystem into a kernel module that can be patched, shipped, and updated independently of the main kernel.

That's a long and difficult project, however, so he would like to start with just the verifier. Separating the verifier out into a kernel module is a good place to start, Xu said, because it is already architecturally a pure function — that is, a function that transforms an input into an output without otherwise affecting the state of the system. The verifier takes in information about a BPF program, and outputs a judgment on whether it is safe, plus some information used by the just-in-time compiler — it doesn't call (many) other kernel functions. Also, changes to the verifier are one of the things that causes the most frustration for users.

Xu put together a proof of concept, which he described as a pretty simple change, except for some complexity around supporting out-of-tree builds. From his testing, the modularized verifier seems to work. Currently, his proof of concept only encapsulates a single file: verifier.c. He still thinks that's a useful starting point, however.

Xu examined the commits between kernel versions 6.3 and 6.13 to get some statistics. Most commits that touch verifier.c, don't change the rest of the kernel. These commits could, theoretically, be easily ported to a standalone stable verifier. Of the commits that affected only verifier.c, there were at least 73 bug fixes, based on the "Fixes" tags in the commits.

So modularizing the verifier will, at least, allow users to receive some bug fixes on an otherwise unchanging kernel. The next question Xu attempted to address with his examination was: are there any other files that could be moved into his kernel module to further decrease the reliance on the rest of the kernel? To answer that question, he looked at files that were edited in the same commits as the verifier.

The most commonly edited was bpf_verifier.h, which is "conceptually private", but in actuality is referenced by a few other files. After that, there was a long tail of other files that were less commonly modified. Xu admitted that this analysis wasn't as rigorous as it could be — for one thing, it should really consider patch sets as well as commits, since applying a single commit from a patch set is occasionally problematic — but he thought that this was still a useful exercise.

For the next steps, Xu wants to move the code to build the modularized verifier out of tree and add continuous-integration testing across many different kernel versions. From his explanation, I believe his intended workflow is for the BPF developers to continue maintaining the BPF verifier in-tree, with the out-of-tree version functioning similarly to stable kernels: as a more stable alternative that still receives cherry-picked bug fixes.

If that goes well, and the modularized verifier is actually helpful, he plans to look at implementing a well-defined interface to make it easier to maintain the verifier out-of-tree. He also intends to modularize more components of the BPF subsystem, work with distributions on distributing appropriate versions of the verifier, and eventually support running the verifier in user space as part of compilers targeting BPF.

Someone in the audience asked why, if Xu wanted to make a version of the verifier that could run across multiple kernels, he didn't use BPF's struct_ops mechanism. Xu replied that it was a cool idea, but that the verifier probably executes a lot more than a million instructions (the current limit for BPF programs). Daniel Borkmann wanted to know whether the verifier was really what was giving users who need to support multiple kernel versions problems — isn't the changing set of kernel functions available to be called by BPF programs a bigger problem? Xu didn't think so, pointing out that functions are either there or they're not, so dealing with missing functions is relatively straightforward. But currently he is dealing with "a bunch" of different verifiers in production, and would really like to get it down to just two or three.

Eduard Zingerman asked whether Xu expected to be able to reduce the entanglement between the existing verifier code and the rest of the kernel further; Xu thought that it was probably possible, and shared a list of kernel symbols that the verifier currently refers to for the assembled developers to ponder. Another person wanted to know whether a modularized verifier would allow for more thorough testing. Xu thought so. Fuzz testing, especially, would be easier without having to run the whole kernel. BPF development would also be nicer if tweaking the verifier did not require rebuilding the whole kernel.

Alexei Starovoitov asked whether Xu had thought about how to handle changes to the verifier that actually change the memory layout of the verifier state. Xu replied that it would need to be addressed on a case-by-case basis. Other developers expressed skepticism that a modular verifier would actually be helpful to users, given that users often need to wait on new kernel functions to become available to BPF programs. David Faust, on the other hand, was enthusiastic about the idea of being able to run the verifier code in user space, since he has repeatedly asked for a way to let GCC verify the BPF code it generates.

Whether the modular verifier will be adopted remains to be seen — and many BPF developers are skeptical — but Xu's working proof of concept suggests that it may not be as daunting a task as it first appears.

Comments (12 posted)

Taking BPF programs beyond one-million instructions

By Daroc Alden
April 16, 2025

LSFMM+BPF

The BPF verifier is not magic; it cannot solve the halting problem. Therefore, it has to err on the side of assuming that a program will run too long if it cannot prove that the program will not. The ultimate check on the size of a BPF program is the one-million-instruction limit — the verifier will refuse to process more than one-million instructions, no matter what a BPF program does. Alexei Starovoitov gave a talk at the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit about that limit, why correctly written BPF programs shouldn't hit it, and how to make the user experience of large BPF programs better in the future.

One million was "just a number that felt big in 2019", Starovoitov said. Now, in 2025, programs often hit this limit. This prompts occasional requests for the BPF developers to raise the limit, but most programs won't hit the limit if they're written to take advantage of modern BPF features. Starovoitov's most important advice to users struggling with the limit was to remove unnecessary inlining and loop unrolling.

Guidance is still circulating on the internet to annotate BPF functions with always_inline. That hasn't been necessary — or helpful — since 2017. The same is true of using #pragma unroll on loops; the BPF verifier has supported bounded loops since 2019. Removing those two things is the most common way that Starovoitov has been able to help users shrink the sizes of their BPF programs. It may or may not always help, but it definitely lets the verifier process the program more quickly, he said.

If those don't help, there are a few other changes that users can make with more care. One thing to consider is the use of global functions in place of static functions — a distinction introduced to BPF in kernel version 5.6 in order to reduce time spent doing verification and allow for more flexible usage of BPF programs. Static functions are functions marked with the static keyword in C, whereas global functions are all other functions. Global functions are verified independent of the context in which they are called, whereas static functions take that context into account. When the verifier is considering static functions, it handles every call site separately. This is good and bad — it means that complex type and lifetime requirements are easily verified, but it also means that every call to a static function counts the entire body of the function against the one million instruction limit. If a static function is called in a loop, this can quickly eat up a program's instruction budget.

    // Verified up to 100 times.
    for (int i = 0; i < 100; i ++) static_function(i);

One member of the audience asked whether Starovoitov's example would really result in the static function being verified multiple times, or whether the path pruner would prune later verifications as being redundant. Starovoitov explained the pruner would only kick in if nothing depending on the loop index were passed to the static function.

Global functions, on the other hand, are only verified once — meaning the cost of verifying the body is only paid once, and each further call only counts as one instruction as far as the one-million-instruction limit is concerned. Because global functions are verified without having any existing context on what registers contain what types of values, requirements for any arguments to the function need to be specified explicitly in the types, and that the verifier cannot take per-call-site information on bounds into account. On the other hand, if the argument to a function is something like a pointer to a BPF arena, the verifier doesn't need to do bounds checks on it anyway, so that's not as much of a problem. There are also various annotations, such as __arg_nullable or __arg_ctx, that the user can add to help the verifier understand the arguments to a global function.

Another trick that is sometimes useful is changing how loops are written. Bounded loops are similar to static functions, in that they "just work", Starovoitov said, but they require the verifier to process the body multiple times. Loops with large numbers of iterations can easily run up against the limit. BPF does have several alternatives available, however. Depending on the situation, either iterators or run-time checks are an option:

    bpf_for (i, 0, 100) // open-coded iterator
    // See below for why this loop uses "zero" in place of "0"
    for (int i = zero; i < 100 && can_loop; i++) // Loop with run-time check

The primary difference between bounded loops and the other options is how the verifier reacts to determining that two iterations of the loop lead to the same symbolic state. In a bounded loop, if an iteration ends in the same verifier state that it started in (i.e., without any tracked variables getting a different value), the verifier rejects it as a potentially infinite loop. In a loop using an iterator or the can_loop macro (the eventual result of the BPF may_goto work), the verifier knows the runtime will halt a loop that continues too long, so it doesn't need to reject the loop.

Starovoitov then went into a long list of correct and incorrect ways to write BPF loops. The logic behind why each option was correct or incorrect required a certain amount of background on the BPF verifier to understand, and people did not find the consequences of each kind of loop obvious.

    // Ok, only processes loop body once:
    bpf_for (i, 0, 100) { function(arr[i]); }

    // Bad, verifier can't prove that the access is in bounds:
    int i = 0;
    bpf_repeat(100) { function(arr[i]); i++; }

The bpf_repeat() macro, an earlier predecessor of bpf_for(), doesn't give the verifier enough information to know exactly how many times the loop will be repeated. So the verifier ends up concluding that i could potentially be any number, and might therefore be out of bounds for arr. Adding a check that i is less than 100 helps a little bit, but the verifier still doesn't know how many loop iterations will happen, so it tries to check every value of i to make sure the bounds check is right:

    // Bad. Accesses are in-bounds, but it hits one million instructions
    // trying to be sure of that:
    int i = 0;
    bpf_repeat(100) {
        if (i < 100) function(arr[i]);
        i++;
    }

Adding an explicit break lets the verifier break out of the loop after a limited number of iterations, but it still has to process the body of the loop 100 times to reach that point.

    // Ok, but the loop body is processed many times:
    int i = 0;
    bpf_repeat(100) {
        if (i < 100) function(arr[i]); else break;
        i++;
    }

A workaround for that is to initialize the loop variable from a global variable, instead of from a constant. The trick is that because the verifier assumes a global variable can be changed by other parts of the code, initializing a loop variable from a global variable makes the verifier track the value of the loop variable as being an unknown value in a range, instead of a known constant. In turn, that means that after incrementing the loop variable, the verifier notices that its knowledge about the variable hasn't changed: an unknown number plus one is still an unknown number. Therefore, the verifier determines that this would be an infinite loop, if it were not for the use of the bpf_repeat() macro, which lets the BPF runtime terminate the loop if it continues too long. The end result is that the verifier only needs to examine the loop body once, instead of examining it with different known values of i:

    // Ok, loop body only processed once, because the global variable
    // changes how the verifier processes bounds in the loop:
    int zero = 0; // Global variable

    int i = zero;
    bpf_repeat(100) {
        if (i < 100) function(arr[i]); else break;
        i++;
    }

These behaviors are "not obvious to anyone, even me", Starovoitov reassured everyone. Eduard Zingerman said that Starovoitov's examples hadn't even discussed widening, to which Starovoitov replied: "I will get into it. You're doing advanced math." The pitfalls of can_loop were no less esoteric:

    // Ok, but processes loop body multiple times because the verifier
    // sees each loop as having a different state:
    for (int i = 0; i < 100 && can_loop; i++) {
        function(arr[i]);
    }

Even adopting the same global-variable-based workaround doesn't always help, because compilers have some optimizations that apply to basic C for loops that don't apply to the bpf_for() macro:

    // Bad. Could work or could fail to prove the bounds check depending
    // on how the compiler handles the loop:
    int zero = 0; // Global variable

    for (int i = zero; i < 100 && can_loop; i++) {
        function(arr[i]);
    }

The solution is to prevent the compiler from applying the problematic optimizations, at least for now. Eventually, Starovoitov hopes to expand the verifier's ability to understand transformed loop variables to render this unnecessary.

    // Ok, only processes loop body once:
    int zero = 0; // Global variable

    for (int i = zero; i < 100 && can_loop; i++) {
        // Tell the compiler not to assume anything about i, so the loop
        // is left in a format the verifier can handle.
        barrier_var(i);
        function(arr[i]);
    }

"I know this is overwhelming, and it's bad", Starovoitov said. The point is that it's hard for experts too, he continued. The situation with arenas is a little better. Because the verifier doesn't need to do bounds checks on arena pointers for safety, its state-equivalence-checking logic can generalize more inside loops. Combining the global-variable trick, can_loop, and arena pointers results in the best loop that he shared, which can always be verified efficiently:

    // Verified in one pass, effectively lets the user write any loop:
    int __arena arr[100];

    for (int i = zero; i < 100 && can_loop; i++) {
        function(arr[i]);
    }

So users have ways to write loops without worrying about the one-million-instruction limit — but there are still clearly parts of the BPF user experience that can be improved. Ideally, users would not need to care about the precise details of how they write their loops.

One approach that Starovoitov tried was to attempt to widen loop variables (that is, generalize the verifier's state from "i = 0" to "i = [some unknown number within a given range]") when checking loops with can_loop. That would have allowed the verifier to recognize when two passes through the loop are essentially the same (differing only by the loop variable's initial value), and remove the need for the global-variable trick. Unfortunately, implementing that idea turned out to be hard. "I bashed my head against the wall", Starovoitov said, before giving up on it.

Another idea was to use bounds checks inside of loops to split the verifier state into two cases, which could potentially permit the same kind of analysis. That turns out not to work well because of a compiler optimization called scalar evolution, which turns i++ loops into i += N loops, which have discontiguous ranges of possible values for the loop variables. John Fastabend had a patch in 2017 that was supposed to cope with scalar evolution, Starovoitov said, so that idea is still viable. That's the path that he currently intends to work on.

Making things better

There is already a lot of specialized code for handling loops in the verifier, Starovoitov said. For example, the code dealing with scalar precision: its job is to make states look as similar as possible, to allow pruning to happen more effectively. It turns out not to really work for loops, so there's special detection logic in the verifier to work around problems with calculating whether different registers are still alive in the presence of loops. Starovoitov called the resulting design a "house of cards" and said that it's time to step back and think about how to rewrite the verifier.

Right now, the verifier has two ways of determining whether a register remains alive; the old one is path-sensitive (that is, it depends on the path the verifier took to reach a particular place in the code), the new one is not. The new one is a lot more similar to the kind of compiler algorithm you find in the Dragon Book or in the academic literature. But the new analysis only works for registers, not the stack, so the old one has to be kept around. If the new analysis could be made to work for the stack, the old analysis could be removed, Starovoitov said, simplifying the logic around loops and making it easier to fix some of the bad user experience.

One way to do that might be to eliminate the stack in its current form (at least inside the verifier). The BPF stack has a fairly small maximum size; the verifier could potentially convert the entire stack into virtual registers. Or the BPF developers could change the ISA to tell compilers that there are now 74 registers and no stack. Theoretically, doing things that way would both dramatically simplify the verifier code and improve the just-in-time compilation of BPF to machine code by making use of more registers, when available on an architecture, he explained.

Zingerman immediately pointed out the problem with that idea, though: access to the contents of the stack based on a variable, instead of a fixed stack offset. Starovoitov agreed that was a problem. There are also some things, like iterators, that have to live on the stack that can't really be converted into registers, he said.

The overarching point of all of this is that the verifier needs to be simplified, Starovoitov concluded. verifier.c is the biggest file in the kernel and "should be shot". That simplification is the best path forward for enabling BPF to continue to grow and develop.

Comments (11 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: CVE funding; Yelp vulnerability; Fedora 42; Manjaro 25.0; GCC 15; Pinta 3.0; Quotes; ...
Announcements: Newsletters, conferences, security updates, patches, and more.

Next page: Brief items>>

LWN.net Weekly Edition for April 17, 2025

Better interface

DEB822 and apt modernize-sources

New solver

sqv by default

apt install 3.0

The ultimate answer

Installer changes

KDE edition

Wayland-ification continues

Beware of leopard

Something for everybody

Anonymous mount namespaces

ID-mapped mounts

Unprivileged mounts

Reference counts

A status update

Future requirements

Guaranteed CMA

Optimizing CMA layout

BPF in stable kernels

Modularizing the BPF verifier

Making things better

Inside this week's LWN.net Weekly Edition

DEB822 and `apt modernize-sources`

`sqv` by default

`apt install 3.0`