Leading items

Welcome to the LWN.net Weekly Edition for May 30, 2024

This edition contains the following feature content:

Readying DNF5 for Fedora 41: another attempt to replace this fundamental packaging tool.
Fedora approves shipping pre-built macOS binaries: an unusual addition to the Fedora repository to support Apple hardware.
The rest of the 6.10 merge window: what else was merged for the upcoming 6.10 kernel release.
Ongoing coverage from the 2024 Linux Storage, Filesystem, Memory-Management, and BPF Summit:
- Atomic writes without tears: a discussion on how to support buffered I/O writes of 16KB with protection against torn (partial) writes.
- Filesystems and iomap: conversions of various filesystems to use iomap are ongoing; what are the remaining problems that need to be solved?
- What's scheduled for sched_ext: sched_ext has come a long way in the past year. What's changed, and what is still needed for the work to be meaningfully complete?
- Recent improvements to BPF's struct_ops mechanism: BPF continues to evolve support for more generic kernel interfaces.
- LLVM improvements for BPF verification: what can compiler developers do to ensure their compilers produce verifiable code?
- Supporting BPF in GCC: GCC can now compile a lot of BPF code. What did it take, and where is the project going next?
- A new swap abstraction layer for the kernel: redesigning the swap layer for better performance, especially with large folios.
- Large-folio support for shmem and tmpfs: improving the kernel's shared-memory mechanisms with large folios.
- The twilight of the version-1 memory controller: the version-1 control-group API was superseded years ago, but users of the old memory-controller interface still exist. How can they be convinced to move on so that this old code can be removed?
- Allocator optimizations for transparent huge pages: proposed memory-management changes to improve the chances of successfully allocating huge pages.
- Two talks on multi-size transparent huge page performance: multi-size THPs are seen as a performance benefit, but how much does the system really gain from them?
- The next steps for the maple tree: upcoming features planned for this relatively new kernel data structure.
- Fleshing out memory descriptors: a first view into what the memory-descriptor future might look like.
- The state of the memory-management community in 2024: the traditional session with Andrew Morton to discuss how memory-management development is going.
- Measuring memory fragmentation: an attempt to find a way to measure how badly memory has been fragmented.
A plea for more thoughtful comments: how you can help improve the LWN comment stream.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Readying DNF5 for Fedora 41

By Joe Brockmeier
May 24, 2024

With the release of Fedora 40 it's time to start looking ahead to what Fedora 41 has in store. One of the largest changes planned for the next release is a switch to DNF5, a C++ rewrite of the DNF package manager. A previous attempt to make the switch, during the Fedora 39 cycle, was called off, and deferred to Fedora 41. The developers have had nearly a year to address compatibility problems and bring DNF5 to a state suitable to replace DNF4. Signs point to a successful switch in the upcoming release, though there may be a few surprises lurking for Fedora users.

It has been a long road to get to this point. Daniel Mach announced the start of DNF5 development in early 2020. The initial, tentative, roadmap called for replacing DNF4 in Fedora 34—which was released in April 2021. It is no surprise that it has taken significantly longer than planned: DNF is a fundamental piece of the distribution.

From a user perspective, it is not enough that that DNF5 is able to successfully install, update, remove, and generally manage software on a Fedora system—it needs to do so in a way that does not require users to re-learn how to use the tool. Even minor deviations from the prior version, like a missing option or a change in the output of the dnf command, can break build pipelines, force developers to rewrite scripts that expect the old behavior, and generally frustrate users who have learned how to use the existing tool. For some of us, the pain of the switch from Yum to DNF nine years ago is still fresh in our minds. In short, the bar for success is high, and the appetite for change is low.

DNF's developers made the first attempt at a switch ahead of Fedora 39, but the rewrite was found wanting. Instead of trying again for Fedora 40, the team regrouped and planned to try again in 2024. A new change proposal was put forward in March for Fedora 41. The Fedora Engineering Steering Council (FESCo) gave the green light on April 8 to take another run at the switch. On April 25, DNF developer Jan Kolarik announced a side tag, a build done "on the side" that does not affect the rest of the packages, for testing DNF5 and said he planned to push it into Rawhide soon in the absence of negative feedback on critical functionality.

Michael Gruber asked whether distribution upgrades would work by the time Fedora 41 was released. If DNF5 moved into rawhide now, he said, it was important to be sure that users could use it to upgrade to Fedora 42 when the time comes. Kolarik indicated that it should be available before the release:

While our goal is to deliver the final system-upgrade functionality before the stable release, some adjustments may be made during the Fedora 41 lifecycle to ensure smoother upgrades from F41 to F42.

Kevin Fenzi said he thought the system-upgrade functionality was a beta requirement. Having it available to test then, he said, was important. Kolarik noted that DNF5 had the system-upgrade command already, though it was in need of more testing. Even so, users would be upgrading Fedora 40 to 41 using DNF4: users wouldn't need to rely on DNF5 to upgrade until Fedora 42 beta.

"Maxwell G" pointed out on April 25 that the side tag includes an update to libdnf version 5.2.0, which contains breaking API changes not communicated in the change proposal. The API bump caused breakage to the fedrq tool for querying package repositories, and likely other applications. He asked whether other applications have had a chance to port to the new API and if a porting guide could be made available "so API users can fix their software before this is pushed to rawhide?"

Kolarik apologized for failing to document the major version bump in the change proposal, but said that "implementing the new functionality without breaking ABI and API would have required a lot of extra work". He promised to provide a guide to API differences between 5.1 and 5.2 (found here) and to reach out to teams that would be affected by the version bump.

On May 9, Kolarik reported that the feedback on the testing side tag was positive, and that DNF5 had moved into Rawhide. DNF5 also got a workout during the Fedora 40 release cycle as the default package manager used to build the distribution. This deployment of DNF5 was simple enough and transparent to users, because it only required telling the Mock build environment manager to install DNF5 and use it when building packages in Fedora's Koji build systems. That hit a small snag immediately with the groupinstall alias. Version 4.x of dnf accepts groupinstall or group install as a valid command to install a package group. But DNF5 has dropped groupinstall and other single-word group command aliases like grouplist, groupremove, and so on. Fenzi added an alias to support groupinstall during builds in Koji to fix the problem.

It was also discovered that dnf5 group install did not support the --allowerasing option. (That option, as the name implies, allows erasing of installed packages if needed to resolve dependencies.) Support for that option was added, and the switch to DNF5 to build the distribution was successful. Fedora 40 was built using DNF5. It is also the default now for building packages locally with Mock, for Fedora 40 and greater.

More small snags are likely to be in store for users once DNF5 becomes the default. While feature parity with DNF4 is a goal, the change proposal acknowledges that some features may not be implemented in time. The full list of changes to the command-line interface between DNF4 and DNF5 is quite lengthy, with fewer changes to the API and configuration options.

Many of the differences are minor, such as dropping command-line options that are DNF5's default behavior anyway. For example, the ‑‑all option for dnf info and dnf list has been dropped because that behavior is the default. Many options and commands have been renamed, such as updateinfo which is now advisory. There are also a number of plugins for DNF4 that have yet to be implemented for DNF5. (Plugins to be implemented are being tracked on GitHub.)

Barring any major bugs, though, it seems likely that the time has come for Fedora users to get ready for DNF5. The deadline to decide whether to go ahead or revert to DNF4 is August 8, when Fedora 41 branches from Rawhide. That leaves just a few months for testing and bug fixing to ensure that it's up to the task.

Comments (4 posted)

Fedora approves shipping pre-built macOS binaries

By Joe Brockmeier
May 29, 2024

The Asahi Linux project works to support Linux on Apple Silicon hardware. The project's flagship distribution is the Fedora Asahi Remix, which has its own installer (rather than Anaconda) to accommodate the unique requirements of installing on Apple's hardware. Previously the installer was built by the Asahi project, but it has asked for (and received) an exception from the Fedora Engineering Steering Committee (FESCo) to include two binaries from upstream open-source projects so that the installer can be built on Fedora infrastructure.

Apple Silicon does not support something as simple as plugging a USB stick in and rebooting into a Linux installer. Users who want to install Linux on an M1 or later Mac have to start the installation in macOS, resize the disk so there's room for Asahi, and then reboot into macOS Recovery (recoveryOS) to finish the installation. Asahi Linux is typically installed alongside macOS, so users can choose to boot into either operating system though users can get rid of the macOS partition entirely. As part of the process, Asahi replaces the macOS kernel used for system recovery with Asahi's m1n1 bootloader for Apple hardware. The entire boot process for Apple Silicon is well-described in the Asahi January/February 2021 progress report.

This means that the installer (which is written in Python) requires two macOS binaries to perform the installation: a Python interpreter for macOS and libffi, which is used by Python in recoveryOS to extract firmware from the macOS kernel for Linux to use. Unfortunately, it requires Xcode to build these for macOS so it's not possible to build the binaries on Linux, which means shipping prebuilt binaries.

According to Fedora packaging guidelines all "program binaries and program libraries" should be built from source for security and to ensure that they use the standard Fedora compiler flags. (This does not extend to content binaries such as images or PDFs, which may be included without corresponding sources.) Since this isn't possible, Asahi contributor Davide Cavalca requested an exception on May 15 for a macOS build of Python and a build of libffi from the Homebrew project:

We specifically want to do this because it will allow us to ship to users an m1n1 stage1 that is also built in Fedora (the Asahi Linux installer ships its own prebuit one).

Neal Gompa replied "this is probably fine, since from our perspective, macOS is 'firmware-ish'". Tom Stellard wondered whether it would be possible to cross-compile the binaries rather than pulling in binaries produced on macOS. Cavalca responded that he did not believe it was practically possible to do so, short of running a macOS virtual machine with Xcode on top of Linux. At some point it might be possible to use Darling, a project aimed at running macOS software on Linux, "but I don't believe it's in a usable state yet (which is also why we haven't packaged it for now)".

Former FESCo member Miro Hrončok said that he would probably be against allowing the exception. He made the argument that allowing prebuilt binaries in for macOS opened the door to dropping the requirement to build everything from source altogether. He also asked "how do we know the macOS binaries don't contain some proprietary macOS/Xcode bits?" and suggested that the request should be discussed on a mailing list or in Fedora's Discourse forum, but the conversation was never carried over.

Cavalca said he had not audited the binaries, but that they come from official upstream sources (Python and Homebrew, respectively) and are redistributable. He responded in a roundabout way to Hrončok's question about proprietary bits by saying that using Xcode does not preclude redistribution "as otherwise you wouldn't be able to use the compiler for much of anything".

The matter was taken up by FESCo as new business in the May 20 meeting. (The meeting log format for Fedora meetings, unfortunately, does not allow linking directly to individual comments or timestamps. The discussion begins at 19:11:58.) During the meeting it was noted that the Fedora Project has another program built outside Fedora's Koji build system that targets macOS: Fedora Media Writer. Josh Stone asked how that was handled, and Stephen Gallagher replied "poorly". Gompa followed up to explain that the macOS binary is built elsewhere and then submitted to Fedora release engineering to be notarized (digitally signed by Apple) so macOS users don't receive warnings when running the program.

After some back-and-forth discussion about the oddities and problematic licensing of the macOS toolchain, Gallagher said he did not understand the advantage of packaging the binaries if Fedora did not control the build system. Cavalca said that having the installer package built by Fedora means "we go from the installer being a random untrusted blob to it being a trusted package that relies on two smaller blobs".

Eventually, Zbigniew Jędrzejewski-Szmek said he had started out against the proposal but had come around to a "more positive view". He noted that the code would not run on Fedora, but on macOS, and that accepting upstream binaries was the least-bad solution:

We're not experts at building stuff for MacOS, so replicating the builds that are already done doesn't gain us much. It's likely that it could introduce additional problems and bugs. And since that code is never going to be executed on a Linux system, it's like firmware, i.e. something that we accept for pragmatic reasons.

Gallagher pressed for a vote after about 50 minutes of discussion on the topic. (Timestamp 19:50:20 in the meeting log.) David Cantrell, Kevin Fenzi, Josh Stone, and Stellard all voted against the exception. Major Hayden, Tomáš Hrčka, Gallagher, Gompa, and Jędrzejewski-Szmek voted in favor, approving the exception by one vote, five to four. After the vote was tallied, Gallagher said: "that's the most contentious vote I've seen in a while".

After the meeting minutes were posted to the Fedora development mailing list, Hrončok wrote: "I am a tad sad that this was approved by FESCo without being first discussed with the wider community." Fenzi agreed.

For Asahi Linux users, little will change. The installer will continue to work the same way as it had previously, but it will be built with Fedora infrastructure. It will be interesting to see whether this sets a precedent for prebuilt binaries, or ends up being a one-time concession to helping users migrate away from a proprietary operating systems. We have a chance to find out before long: FESCo is also being asked to approve an exception to allow signed SGX enclave binaries for running confidential virtual machines, and should be taking that up soon.

Comments (12 posted)

The rest of the 6.10 merge window

By Jonathan Corbet
May 27, 2024

Linus Torvalds released 6.10-rc1 and closed the 6.10 merge window on May 26. By that time, 11,534 non-merge changesets had been pulled into the mainline for the next release; nearly 5,000 of those came in after "The first half of the 6.10 merge window" was written. While the latter half of the merge window tends to focus more on fixes, there was also a lot of new functionality that landed during this time.

Significant changes merged since the first-half summary include:

Architecture-specific

32-Bit Arm systems can now be built with Clang-based control-flow integrity.
The PowerPC BPF JIT compiler now supports kfuncs.
The RISC-V architecture has gained support for the Rust language.

Core kernel

It is now possible to map tracing ring buffers directly into user space. See this merge message and this documentation commit for more information.
An initial set of patches toward the eventual consolidation of hugetlbfs into the core memory-management subsystem has been merged; there should be no user-visible changes.
The ntsync subsystem, which provides a set of Windows NT synchronization primitives for Linux, has been merged. It is, however, marked as "broken" for this release and cannot yet be used for its intended purpose.
After a significant amount of discussion and change, the mseal() system call was merged as one of the final features for this development cycle. mseal() allows a process to forbid future changes to portions of its address space; the initial application is in the Chrome browser, which will use it to strengthen its internal sandboxing. More information can be found in this documentation commit.

Filesystems and block I/O

There is a new netlink-based protocol for the control of the NFS server in the kernel; a new nfsdctl tool is said to be on its way into the nfs-utils package.
The XFS filesystem continues to gain more online repair functionality.
The filesystems in user space (FUSE) subsystem now supports integrity protection with fs-verity.
The overlayfs filesystem is now able to create temporary files using the O_TMPFILE option.

Hardware support

Clock: Sophgo CV1800 series SoCs clock controllers, STMicroelectronics stm32mp25x clocks, NXP i.MX95 BLK CTL clocks, and Epson RX8111 realtime clocks.
Media: Broadcom BCM283x/BCM271x CSI-2 receivers and sixth-generation Intel image processing units.
Miscellaneous: Acer Aspire 1 embedded controllers, Lenovo WMI camera buttons, ACPI Quickstart buttons, Lenovo Yoga Tablet 2 1380 fast chargers, Dell AIO UART backlight interfaces, MeeGoPad ANX7428 Type-C switches, Zhaoxin I2C interfaces, Lenovo SE10 watchdog timers, ARM MHUv3 mailbox controllers, Samsung HDMI PHYs, MediaTek 10GE SerDes XFI T-PHYs, and Rockchip USBDP COMBO PHY.

Miscellaneous

The perf tool has, as usual, seen a lot of changes; details can be found in this merge message.

Networking

The new IORING_CQE_F_SOCK_NONEMPTY operation for io_uring can be used to determine whether there are more connection requests waiting on a socket.

Security-related

The Landlock security module is now able to apply policies to ioctl() calls; see this documentation commit for a bit more information.
The new init_mlocked_on_free boot option will cause any memory that is locked into RAM with mlock() to be zeroed if (and only if) it is freed without having been first unlocked with munlock(). The purpose is to protect memory that may be holding cryptographic keys from being exposed after an application crash.

Internal kernel changes

Developers may be unaware of the no_printk() macro. Its job is to do nothing, but to preserve printk() statements in the code should somebody need to restore them for future debugging purposes. In prior kernels, no_printk() still contributed indexing data to the kernel image, even though it printed nothing; that has been fixed for 6.10.
Some changes to how memory for executable code in the kernel is allocated have made it possible to enable ftrace and kprobes without the need to enable loadable-module support.
Work items in BH workqueues can now be enabled and disabled; with this change, it should be possible to convert all tasklet users over to the new mechanism.
The (sometimes controversial) memory-allocation profiling subsystem has been merged; this should help developers optimize memory use and track down memory leaks. See this documentation commit for some more information.
There are 371 more symbols exported to modules in 6.10, and 18 new kfuncs; see this page for the full list of changes.

If this development cycle follows the usual timeline (and they all do anymore), then the final 6.10 release will happen on July 14 or 21. Between now and then, though, there will be a need for a lot of testing and bug fixing.

[Note that LWN subscribers can find more information about the contributions to 6.10-rc1 in the LWN kernel-source database.]

Comments (23 posted)

Atomic writes without tears

By Jake Edge
May 24, 2024

LSFMM+BPF

John Garry and Ted Ts'o led a discussion about supporting atomic writes for buffered I/O, without any torn (or partial) writes to the device, at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit. It is something of a continuation of a discussion at last year's summit. The goal is to help PostgreSQL, which writes its data using 16KB buffered I/O; it currently has to do a lot of extra work to ensure that its data is safe on disk. A promise of non-torn, 16KB buffered writes would allow the database to avoid doing double writes.

Garry began by going over the problem being solved; databases generally write their data in chunks larger than the block size of the block device (which is generally the same as the page size of the system, 4KB). MySQL and PostgreSQL both use larger chunks, up to 16KB. They need to ensure the persistence of these chunks, in full, in order to maintain an uncorrupted database. MySQL uses direct I/O, which is generally able to ensure that 16KB is either fully written, or not written at all, on today's storage hardware.

The kernel does not guarantee atomic 16KB writes, even for direct I/O, however. So Garry has come up with a patch set for supporting atomic block writes (as well as one to add the feature to XFS). Later in the session he said that there is ongoing work to support the feature in ext4; he has also posted an RFC patch set for buffered atomic writes.

In his patches, there is a new RWF_ATOMIC flag for pwritev2() that requests torn-write protection for the write, he said; there is a corresponding flag for io_uring as well. The RWF_SYNC flag is still needed to guarantee persistence, though. The statx() system call can be used to query some new fields to determine the minimum and maximum atomic sizes supported, as well as the maximum number of atomic segments allowed for a given write operation. All of those values are dependent on the underlying filesystem, block-layer, and storage-device limitations.

In order to make a call using RWF_ATOMIC, the total length must be a power of two, between the minimum and maximum, be "naturally aligned", and the write must have an iov_count no more than the maximum number of segments. Damien Le Moal asked whether this feature required hardware support in order to ensure persistence with RWF_SYNC; Garry said that it did. Hannes Reinecke asked what was meant by "naturally aligned offset"; Garry said that it means aligning an 8KB write on an 8KB boundary, 16KB on 16KB, and so on.

Hardware

Both SCSI and NVMe support torn-write protection, Garry continued, but they do it differently. NVMe implicitly does atomic writes; there is no dedicated command to request them. Devices have a limit and if the write is less than that, and does not cross the device-specific atomic-write boundary, if any, it will be written atomically. SCSI has a separate command and more constraints; unlike NVMe, though, atomic writes that do not meet the requirements are rejected. Reinecke asked whether it made sense to merge BIOs in an atomic-write request; Garry said that they will not be merged if the combination cannot be done as an atomic write.

The XFS support adds a new FORCEALIGN inode flag that can be set via an ioctl() command, which forces the file to be aligned on certain boundaries. That can be used to ensure that file extents are 16KB-aligned, for example, on a per-inode basis.

Ts'o said that the cloud vendors are already advertising torn-write protection for MySQL; they are using ext4 with particular settings on devices that can provide that protection. But there are "lots of sharp edges"; the vendors have to audit the code paths and hope that a kernel update does not break them in some way, since the kernel makes no guarantees. The feature can provide a 60-100% improvement in database performance, he said, because MySQL can avoid doing a double write, which makes it attractive.

He said that the database developers want write requests that are only ever torn at 16KB boundaries. With the atomic-write support, kernel developers are trying to do better than that, but it is important to ensure that the huge performance gain does not get diminished or lost in the process. For example, the database might send a contiguous 256KB write, but the only thing the developers are looking for is to know if the I/O fails for some reason and that the failure can only happen at 16KB boundaries. It is important to remember that the database developers want "untorn writes"; guaranteeing more with atomic writes is fine, Ts'o said, as long as it comes for free.

Dave Chinner said that XFS added atomic writes ten years ago, which doubled MySQL performance, so the effects of this kind of change have been known for a long time. Garry said that the term "atomic" is used for the feature, because that is what the hardware vendors call it, but that it is providing the "untorn writes" that database developers want. Matthew Wilcox noted that NVMe is specified to have 16KB tear boundaries; he wondered if the SCSI vendors could be convinced into doing something similar. But Martin Petersen wanted to know what problems there were with the current SCSI semantics; there are differences with NVMe, but he is unsure why they are a problem. It turned out that the currently proposed implementation for atomic writes does not need to use everything that SCSI provides, so it is not clear whether there are any actual deficiencies.

Buffered I/O

Ts'o said that the buffered I/O piece is where this all gets interesting. The proposed API works great for direct I/O, because right after the call to pwritev2(), the I/O is actually done to the device. For buffered I/O, that is not the case, since everything is going through the page cache, which means that the write may not actually happen for "30 seconds or until user space calls fdatasync()". The reason for caring about buffered I/O is because PostgreSQL uses it; depending on who you talk to, it will be three to six years before the database can switch to using direct I/O.

Part of the problem is that using the proposed API means keeping track of which writes were done using the atomic flag. If a 64KB write is done, then a 16KB write, both with the flag, they need to be tracked separately. There has also been talk of a hybrid mode, he said, where a non-atomic-aware application can also write to the file in a non-aligned way such that "things don't blow up". The problem has become over-constrained; pwritev2() is fine for direct I/O, but does not really fit with what the database developers are asking for in the buffered I/O case.

There are multiple ways to create an interface for buffered I/O, Ts'o said. It could be an inode-based write granularity, set with something like the XFS FORCEALIGN flag, an ioctl() command, or fcntl() operation. Then there is a question of whether to stick with the pwritev2() interface, which is more powerful than what is needed, or, perhaps, to require that the application using buffered I/O only do atomic writes at the granularity specified. That would mean that the kernel does not have to track various in-flight atomic-write sizes. Another way to do that might be to require that the size of the folio used for the write specifies the granularity.

An attendee said that with buffered I/O, there is no way for the application to get an error if the write fails. Ts'o said that any error would come when fdatasync() is called, which the attendee called "buffered in name only". But Ts'o said that it is how PostgreSQL works today; it does multiple buffered writes, then calls fdatasync() to make that data persistent and to detect if there are any problems. The developers understand that model and it is the documented POSIX interface.

Jan Kara suggested that instead of tracking different sizes of atomic writes, a single size could be tracked; if another write comes in with a different size, the earlier writes could be flushed out. In his RFC, Garry said, the granularity was effectively set by the FORCEALIGN size.

There was some discussion of the SCSI semantics with respect to whether reads were synchronized with writes, and whether that means an atomic read operation is also needed. The answer seemed to be that no atomic read was needed. But, because SCSI has separate write commands for atomic versus non-atomic, there does need to be some kind of indication from user space about the kind of I/O it expects. That could be done with a flag on the inode or folio.

Chinner suggested that the page-cache code could interpret these flags and implement writethrough for writes of that sort. It could be implemented using the direct I/O code, so that those kinds of writes are not truly buffered. But Garry said that the tricky piece is handling a mixture of atomic and non-atomic writes on the same folio.

The only reason an application would be using atomic writes, though, is for performance, Chinner said. Trying to support both types of writes, including non-aligned writes, does not make any sense. It comes down to a question of whether it is an error to mix the two types of writing for the same file, Ts'o said, or if there is a call to pwritev2() with the wrong alignment; there is a need to clearly define what the semantics are.

Kara asked about the impact of these changes on the database code. Ts'o said that he believes the PostgreSQL developers are looking forward to a solution, so they are willing to make the required changes when the time comes. They are likely to backport those changes to earlier versions of PostgreSQL as well. Wilcox said that probably did not matter, because the older versions of PostgreSQL are running on neolithic kernels. Ts'o said that is not necessarily the case since some customers are willing to upgrade their kernel, but require sticking with the older database system.

The discussion trailed off at that point so any further progress will presumably come on the mailing list.

Comments (8 posted)

Filesystems and iomap

By Jake Edge
May 28, 2024

LSFMM+BPF

The iomap block-mapping abstraction is being used by more filesystems, in part because of its support for large folios. But there are some challenges in adopting iomap, which was the topic of a discussion led by Ritesh Harjani in a combined storage and filesystem session at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit. One of the main trouble spots is how to handle metadata, which is not an area that iomap has been aimed at.

Iomap has become a VFS abstraction for mapping logical offset ranges to the physical extents for files, Harjani began. It provides an iterator model that is filesystem-centric, rather than page-cache-centric, with regard to mapping the byte offset of a file to its blocks on the storage device. It also abstracts common page-cache operations and supports mapping from folios (and large folios).

Managing the "dirty" flags for individual blocks was not possible with large folios, because there was only a single bit to track that state for the entire folio. Per-block tracking of that state has been added, so that only those blocks that are actually dirty need to be handled during writeback. That provides a significant savings by avoiding the write amplification that happened during writeback without the tracking, he said, which means that iomap scales much better than before.

Harjani then talked about the upstream status of various pieces. Ext4 and Btrfs both switched to using iomap for direct I/O, in Linux 5.5 and 5.8, respectively. The 6.6 kernel added large-folio support and per-block dirty tracking to iomap, as well as iomap for ext2 direct I/O. In 6.9, the multi-block mapping optimization for iomap was added; it allows specifying a range of dirty blocks for writeback.

There are various things in progress as well. Ext2 buffered-I/O iomap-conversion patches are in-process at this point, while the ext4 buffered-I/O conversion, for filesystems mounted with default options, is being worked on. There is work going on to optimize access to filesystems that have indirect block mappings. Getting iomap documentation into the kernel tree was discussed at last year's summit, but has not yet happened; he has a documentation patch that is out for review, so that problem should be solved relatively soon.

There are a number of things that are motivating filesystem developers to make the switch to iomap. Support for buffered atomic writes in iomap is in the works, as well as support for block sizes larger than the system's page size. Beyond that, Matthew Wilcox noted that if developers switch their filesystem, "I will stop bugging you about large-folio support"; XFS uses iomap and no longer deals with pages at all. "Folios, pages, you don't care anymore if you use iomap."

There is a long list of filesystems that have at least some support for iomap, Harjani said, based on a search for "FS_IOMAP" in the fs tree. He did the same search for "LEGACY_DIRECT_IO" to show filesystems that have no support for iomap and wondered what the plan in the filesystem community was for those. Al Viro said that the LEGACY_DIRECT_IO search was not really the right way to look at the problem, because it artificially splits filesystems that are not that different from each other. There are several filesystems on the list that could directly benefit from the work that has been done on ext2, for example, but only if someone actually cares enough about them to do it. He may look into adapting the ext2 work for minixfs.

Amir Goldstein wondered who was going to test any changes to the legacy filesystems, many of which do not have any tests—or even a way to create a filesystem (mkfs). Harjani said that he thought the person doing the conversion would work with the maintainer, but Goldstein said that some of the maintainers do not really have the time to work on things of that sort. It goes back to the problem of unmaintained filesystems in the kernel, which has been a recurring topic at the summit over the years.

In his conversion of ext2, Harjani found that the directory-handling code uses the page cache directly. Iomap does not export an API that is similar to the byte-oriented API that ext2 currently uses. Perhaps iomap can export an API that can be used for that.

There is no support for metadata I/O at all in iomap. One possible solution is to lift the buffer-cache code from XFS, as was discussed in the large-block-size session earlier in the day. Another solution would be to do some "surgery" on the buffer-head API. That would require adding ways to read metadata blocks, track metadata buffers that are not attached to an inode, and to track buffers for journaling them before doing I/O to the filesystem. He wondered which approach made more sense.

Iomap was never intended for metadata use, Dave Chinner said, as a bit of background. It was only ever used for the data path in XFS, which is where iomap came from. He is not sure that trying to use iomap for metadata is the right approach; metadata handling is typically filesystem-specific, such as for journaling. He thinks that looking at a replacement for buffer heads would be the right mechanism for metadata handling. Harjani thought that iomap had features that made it attractive to use for metadata; Chinner was somewhat skeptical but thought it probably could be done.

For ext2, the metadata is the directory contents, Viro said; the indirect blocks are another kind of metadata for the filesystem, but Harjani said he was focused on the directory contents. Viro suggested treating a directory as much like a regular file as possible; it would be strange to use iomap for files, but not for directories, because they are close to the same thing.

Wilcox said that unifying the page cache and buffer cache, and using the page cache for directories, was a mistake. He thinks there should be a separate buffer cache for the directory information; the page cache keeps a "lot of metadata about metadata" that is unneeded. For example, you do not need a map_count for a directory, because it cannot be mapped using mmap().

What is really needed is a buffer cache that is not just an alias into the page cache, which is something that he thinks XFS developer Darrick Wong has already done. Viro said that historically using the same layer for file data and metadata, such as the page cache, has been the norm, possibly going all the way back to Multics. Wilcox said that the page cache exists, in part, to ensure that writes and mmap() do not interfere with each other, which is not a problem for directories.

In conclusion, Harjani said that he would pursue the iomap approach for metadata to see where that goes for ext2. He would also like to see the buffer-head interface get stripped down to what is essential and be renamed to fs_buf or something along those lines.

Comments (none posted)

What's scheduled for sched_ext

By Daroc Alden
May 23, 2024

LSFMM+BPF

David Vernet's second talk at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit was a summary of the state of sched_ext, the extensible BPF scheduler that LWN covered in early May. In short, sched_ext is intended as a platform for rapid experimentation with schedulers, and a tool to let performance-minded administrators customize the scheduler to their workload. The patch set has seen several revisions, becoming more generic and powerful over time. Vernet spoke about what has been done in the past year, and what is still missing before sched_ext can be considered pretty much complete.

A year of improvements

Vernet opened the talk with a bit of background on sched_ext, and then went through a quick list of improvements made in the last year. First on his list was better debugging support, including hooks for dumping information about the running sched_ext scheduler to user space. He gave a demo, showing how to dump the list of tasks being managed by the scheduler and their states. He then spoke about the work done to integrate sched_ext more tightly with the CPU-frequency and scaling code. That work involved adding kfuncs to let BPF control the CPU frequency, but also improving the tracking code to make use of the current frequency when accounting how long a task ran for. The pieces are all in place now for sched_ext schedulers to implement per-entity load tracking (PELT), Vernet said.

Other improvements include some changes to the dispatch API, hotplug support, and better backward compatibility. The sched_ext code has also motivated expanding BPF's support for kptrs (pointers to kernel objects).

Vernet spoke next about the improvements to the schedulers themselves, highlighting two in particular. scx_rusty is a hybrid BPF/user-space scheduler; it handles load balancing and statistics in user space, with the hot-path decisions made in BPF. scx_rusty improves on the new EEVDF scheduler for latency-sensitive tasks, such as gaming, by using slightly different heuristics. EEVDF uses a task's eligibility and time-slice length to determine when to schedule it. Unfortunately, slice length is not responsive to the actual workload. It can be configured by an administrator, but often isn't, particularly because it can often be hard to tell what slice length is appropriate for a workload. scx_rusty uses the length of time for which a task actually ran, instead of its static slice length, which lets it schedule tasks that only run for a short time before blocking more frequently than tasks that use their whole time quantum. The scheduler also considers whether the task often blocks waiting for other tasks (indicating a consumer), wakes other tasks (indicating a producer), or both (indicating the middle of a pipeline). It factors this information in to the virtual deadline calculated for the task, slightly prioritizing tasks that block frequently, and greatly prioritizing tasks that wake other threads.

These improvements combine to make scx_rusty noticeably more performant for games. Vernet showed another demo of a game experiencing lag under EEVDF, which immediately disappears upon enabling scx_rusty. Normally, interactivity is a tradeoff against throughput. Vernet claimed that scx_rusty actually also has slightly better throughput than EEVDF, but didn't cite specific numbers.

The other scheduler Vernet highlighted was scx_layered, a statically-configured scheduler. Users can sort processes into "layers" by factors like the name of the executable, the process's control group, or other metadata. Each layer can then be statically configured with different properties. He said that Meta uses the scheduler for a lot of web workloads, where the ability to have fine-grained control over scheduling was valuable.

There are other sched_ext schedulers in active development, including scx_lavd, a scheduler designed specifically for the Steam Deck, and scx_rustland, which delegates scheduling decisions to user space. Since last year, many of the sched_ext schedulers were moved into a separate GitHub repository from the main sched_ext code. Vernet said the scheduler repository was intended to have a low barrier to entry, calling for people to "submit your scheduler, do whatever", hopefully leading to an exchange of ideas. The repository also has some helpful libraries and Rust crates for writing schedulers. Vernet was careful to point out that the schedulers moved to the separate repository remain GPLv2 licensed, even if they are no longer intended to become part of the kernel sources — and, in fact, the BPF verifier refuses to load code that does not claim to be GPLv2-licensed.

What's coming

Vernet then turned to what remains to be done. "It's incredible what you can do in schedulers already," but there are still areas to work on, he said. Some examples of rough edges right now include the facts that BPF programs can't hold a spinlock around a kernel function call, nested structures that contain pointers to kernel objects confuse the verifier, and BPF programs still have an incredibly small stack size (512 bytes), which is sometimes hard to work around. He also commented that it would be nice if the build system could be made to produce bindings that are more amenable to safe backward compatibility.

Despite that, Vernet called the situation on the BPF side "really really encouraging," and said that the most helpful thing people could do at this point would be to improve existing schedulers, particularly adding tests. He also highlighted a few different areas of upcoming work, including restructuring things so that a scheduler can be attached to a specific control group. That is one use case for larger BPF stacks, because it means that schedulers could be called recursively (since control groups can be nested). Another upcoming item is managing idle policies — currently, these are managed separately from the scheduler, but Vernet is "not convinced that they should be". All of the recent improvements have had some spillover benefits, as well. The recent cpufreq tuning has uncovered bottlenecks elsewhere in the code, and the sched_ext work has motivated other improvements to BPF that are likely to be useful elsewhere.

Vernet concluded by asking people to join the conversation on the mailing list if sched_ext was useful to them, noting that it will be the users who ultimately cause the feature to be merged. At that point, discussion opened up more widely, with several people having questions about Vernet's summary.

One audience member asked what Vernet was working on next in terms of core BPF programs. He responded with two main items: stack allocation (alloca()) for BPF programs, and making it possible to attach struct_ops programs to a control group, instead of having them all be global. Vernet also intends to make small usability improvements wherever possible. He called out holding a spinlock around a call to a kfunc as an example, saying that this is what prevents people from implementing EEVDF in sched_ext for rapid experimentation.

Another participant asked what Vernet thought about tail-end latency (how long the longest operations take) versus generic latency (how long operations take on average), and whether that presented any areas for scheduler improvement. Vernet clarified that for gaming, people mostly care about tail latency. scx_rusty, the scheduler he demonstrated, reduces the time slices it gives out when the system is overloaded in order to help with that. But Vernet doesn't think that improving tail-end latency needs to come at the expense of generic latency or vice versa: "I think we could get a win on every front." The audience member was skeptical about that, saying that existing schedulers have a lot of optimization put into them. Vernet remained optimistic, however, saying that EEVDF is relatively new, and therefore there may still be places to make improvements.

A third audience member noted that they had looked at running EEVDF on Chromebooks for the upcoming Power Management and Scheduling in the Linux Kernel (OSPM) conference, and had seen a big reduction in tail latency. Vernet asked what version of the kernel that was, to which the audience member responded that they had actually backported EEVDF to 5.15 and disabled the eligibility checks. Vernet replied that EEVDF's eligibility mechanism hurts the latency of the system, but without it the scheduler doesn't actually bound lag. He continued to say that ideally the perfect scheduler would adapt to what the system is actually doing.

Sched_ext has continued improving, becoming increasingly tempting for workloads that have been less well-served by traditional schedulers, but there remain some bumps in the road before it becomes as useful as it could be. Minor BPF improvements will open up much more flexibility for future schedulers, but the big remaining obstacle is getting the code merged at all — even though the topic was hardly touched on.

Comments (6 posted)

Recent improvements to BPF's struct_ops mechanism

By Daroc Alden
May 24, 2024

LSFSMM+BPF

Kui-Feng Lee spoke early in the BPF track at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit about some of the recent improvements to BPF. These changes were largely driven by the sched_ext work that David Vernet had covered in the previous talk. Lee focused on changes relevant to struct_ops programs, but several of those changes apply to all BPF programs.

There are several mechanisms to attach BPF programs to the kernel at various points. One such mechanism is struct_ops, which lets a subsystem define a structure full of function pointers that can then have functions defined in BPF (only from the same compiled program) attached to them. When a user writes a BPF program, they declare an instance of that structure in a special section of the compiled program. When the BPF program is loaded, the kernel uses the contents of that section to populate the structure on the kernel side. BPF uses a different calling convention than the kernel, so the struct_ops structure is actually filled with function pointers to a set of newly allocated trampolines that perform the conversion. This is a flexible mechanism, but sometimes not quite flexible enough — occasionally, the user wants to override the value of some member of the structure at run time, based on the current state of the system. The first new feature Lee spoke about addresses that problem by allowing user space to "shadow" members of the structure. The user-space loading code now has functions available to override struct_ops members before loading the BPF program.

The size of the struct_ops structure has been fairly limited for a while, because BPF function pointers can't just be put in the structure directly. The BPF subsystem uses trampolines to convert between the kernel's calling convention and BPF's calling convention. Until recently, the BPF code has only allocated one page for trampolines. On x86, this limits struct_ops structures to 20 entries. Now, Lee said, the code supports up to eight pages for trampolines, greatly increasing the usable size.

Another small feature is support for verifier-tracked null pointers as arguments. Previously, the verifier assumed that arguments passed to BPF functions by the kernel were valid pointers — so it would let those values be dereferenced without a check, potentially causing problems if the kernel passed a program a null pointer instead. Now, developers can annotate arguments to BPF functions as being nullable, and the verifier enforces that they must be checked before they can be dereferenced.

The BPF code has also been changed to allow more flexibility in where struct_ops structures can be defined. Initially, only non-modular kernel code could define the structures, Lee said. Recently, that restriction has been relaxed, and now kernel modules can define their own struct_ops types. He called out one of the kernel selftests — bpf_testmod.c — as a good example of how that works.

Lee wrapped up by talking about mechanisms to support compatibility. APIs and types evolve over time, and BPF programs need to be able to cope with that. In the case of struct_ops, two backward-compatible ways to make changes are to add new operators, or to add arguments to existing operators. Lee made the point that the verifier checks a program's behavior, but it does not actually check the program's signature. So in the case of adding new arguments to a function, old programs won't touch the new arguments, which is valid behavior. In the case of adding new operators, things are slightly more tricky. But as long as they are added to the end of the structure, everything will still work out — the type in the kernel will have more fields than the corresponding type in the BPF program, but libbpf zeroes out the entire structure before loading. It also ignores trailing fields that are zeroed out in the BPF program but absent in the kernel. So subsystems and modules are free to add to struct_ops interfaces without requiring existing BPF programs to be rewritten.

One member of the audience asked whether there was any existing tooling to check function signatures as opposed to behaviors. Lee replied that there was not.

That isn't the only way BPF supports backward-compatible interfaces, however; another somewhat magical feature is names with suffixes. Specifically, if libbpf sees a suffix attached to a type with three underscores, Lee explained, it ignores everything after the underscores. This means that a BPF header could define two structures player___v1 and player___v2, and they would both be mapped to player in the kernel. This lets a BPF program implement multiple versions of the same interface, should that turn out to be necessary.

A remote participant noted that all of this supported decoupling kernel versions from BPF programs, but asked Lee whether there were any mechanisms to support decoupling in the other direction, i.e. to let a BPF program not need to know what module is loading it in order to call generic functions from that module. Another member of the audience replied that there was no special functionality to support uses like that, but that it may be achievable in practice. Different kernel modules can define kfuncs with the same name and signature, so long as only one is loaded at any given time. A BPF program that communicated only through such kfuncs could potentially be used by multiple different kernel modules.

While these features are individually fairly small, they still represent an increasing amount of attention being paid in the BPF space to forward and backward compatibility. We will have to see whether this represents a change in the position that BPF remains an unstable kernel-to-kernel interface.

Comments (none posted)

LLVM improvements for BPF verification

By Daroc Alden
May 27, 2024

LSFMM+BPF

Alan Jowett gave a remote presentation at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit about what features could be added to LLVM to make writing BPF programs easier. While there is nothing specific to LLVM about BPF code (and the next session in the track was led by GCC developer José Marchesi about better support for that compiler), LLVM is currently the most common way to turn C code into BPF bytecode. That translation, however, runs into problems when the BPF verifier cannot understand the code LLVM's optimizations produce.

Jowett began by talking about how LLVM processes code internally. First, the C code is translated to LLVM intermediate representation (IR). Then, several passes of optimizations gradually turn the IR into a more efficient version. Finally, the code generator creates BPF bytecode corresponding to the IR. The problem with this process is that LLVM has had many years to develop sophisticated optimizations. It is not uncommon, Jowett said, for LLVM to produce code that is correct, but that the BPF verifier cannot understand — a problem LWN has covered before. For that reason, developers sometimes have to use inline BPF assembly to circumvent the optimizer in order to have their programs accepted.

Before opening up discussion of possible solutions to the problem, however, Jowett first covered some other things that would be nice to see from LLVM. One example is likely/unlikely branch hints that may be present at the source level, but which are lost by the time a program is translated to bytecode. Another possibility Jowett raised was support for code-coverage information, possibly as a prelude to supporting profile-guided optimization of BPF programs.

Jowett then presented a few rough ideas for how it might be possible to prevent LLVM optimization passes from breaking the verifiability of BPF code. One solution might be to move the verifier into the compiler, not as the authoritative source, but as a check to prevent optimization passes from making changes that the verifier cannot understand. Jowett did not propose using the kernel's BPF verifier, however — perhaps because of the licensing problems that would pose — but rather the PREVAIL verifier, an MIT-licensed verifier produced as an academic project that runs in polynomial time (as opposed to the kernel's exponential time).

Using a different verifier is not a perfect solution, however. Jowett pointed out that PREVAIL's design means that it will not verify programs with "correlated branches", where taking one branch always implies that another branch should be taken as well. This is a somewhat common pattern in BPF programs that conditionally acquire and release locks, for example. This pattern can also be introduced artificially by the LLVM optimizer when it tries to avoid repeating tests.

I asked Jowett how he planned to use the PREVAIL verifier inside LLVM's optimization passes, since the former operates on BPF bytecode and the latter operates on LLVM IR. Jowett acknowledged that it would be a problem. Marchesi noted that something like that might be possible in GCC, which supports undoing optimization passes — the compiler could run code generation repeatedly during optimization, and back out the results of any passes that made the generated code fail the verifier. Another audience member noted that they had code that had to run on many possible kernels with different verifiers, and that therefore including any one verifier was insufficient. Dave Thaler indicated that cross-platform BPF compatibility was something that he had a session about later in the day.

Jowett suggested some less intrusive alternatives, such as permitting more fine-grained control over which passes the optimizer runs, allowing developers to assert that some code compiles in a certain way, or just making the optimizer smarter, before opening the floor for suggestions. Yonghong Song, an LLVM developer, said that he had discussed allowing fine-grained control over the optimizer passes with the upstream project, and with the GCC developers at the 2023 Linux Plumbers Conference. In short, it would be hard. The compilers could add a flag to let the optimizer know it needs to do special verifier-friendly things, but that has not yet been implemented.

Jowett asked whether Song had any thoughts about code-coverage instrumentation, or whether perhaps BPF programs were too small to benefit. Song thought code coverage was not likely to be extremely useful, but it may still be useful, and invited people interested in the idea to talk with him about it. An audience member suggested that perhaps BPF programs could get code-coverage information already by analyzing the verifier log — which records every instruction it analyzes. Jowett indicated that this was not sufficient, because it does not actually provide any information about whether a particular branch is covered at runtime. Another audience member indicated that they had written a tool that increments counters in a BPF map for each instruction executed, but that the tool had various limitations.

During the general discussion toward the end of the session, Marchesi asked whether Jowett's idea of preserving likely/unlikely branch hints would require changes to BPF bytecode. Jowett indicated that it would. Another participant noted that they were unsure whether a smarter JIT would be worth it, saying that a more complicated JIT was "really scary" from a memory-model perspective. The BPF JIT runs after the compiler's and verifier's safety checks, so a bug in the JIT is much more likely to break things than an optimization done in the compiler.

Since the group didn't appear to come to much of a consensus, it seems likely that this will remain a topic of discussion. Modern C compilers are, in some ways, a bad fit for BPF; the verifier cares about many properties of programs that have not been a concern for historical targets. Whether and how the BPF developers will be able to overcome this wrinkle remains to be seen.

Comments (6 posted)

Supporting BPF in GCC

By Daroc Alden
May 28, 2024

LSFMM+BPF

The GCC project has been working to support compiling to BPF for some time. José Marchesi and David Faust spoke in an extended session at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit about how that work has been going, and what is left for GCC to be on-par with LLVM with regard to BPF support. They also related tentative plans for how GCC BPF support would be maintained in the future.

Marchesi started with a brief overview of some of the recent work in GCC. In December 2023, the project rewrote the BPF-generation code to not use GCC's venerable CGEN library, which generates code generators from a description of the CPU. Marchesi says that the hand-written implementation of BPF code generation is much better; CGEN is abstract and concise, but BPF is "so weird [...] that torturing CGEN into supporting it was challenging".

GCC has also added support for BPF's pseudo-C syntax (a representation of BPF assembly that looks more like C than a traditional assembly language, such as the other BPF representation, does), BPF v4 instructions, converting short jumps into long jumps where appropriate, and platform-specific flags. GCC now puts the version of the BPF CPU into the platform-specific flags of the ELF object it produces. The disassembler uses that information to show the correct version of instructions, and readelf displays that information when inspecting an ELF file. Marchesi asked whether LLVM recorded that information anywhere. LLVM developer Yonghong Song replied that it didn't. Marchesi asked whether Song objected to recording version information in this way; Song did not.

Marchesi continued the list of recent changes to GCC. He noted that bpf-helpers.h (a header file providing some macros for writing portable BPF programs) had been removed, since GCC now supports BPF's special three-underscore type suffixes. GCC also supports "compile once — run everywhere" (CO-RE), where the user-space BPF loader performs relocations on the program before loading it. CO-RE is now enabled by default.

That isn't the only change that brings GCC's output closer to LLVM's; GCC now produces BPF Type Format (BTF) debugging information by default (when debugging information is enabled). It also emits pseudo-C BPF code by default, which Marchesi said "I despise with my whole soul". A lot of inline assembly in BPF programs uses the pseudo-C syntax, however, so GCC has to support it.

Another place where Marchesi had questions for the assembled developers was around support for memmove(), memcpy(), and memset(). On most platforms, GCC generates a call to the library implementing the C language runtime. This isn't possible in BPF (which lacks run-time libraries) so currently GCC inlines the functions instead. Unfortunately, BPF also doesn't have unrestricted loops, so this only works when the loops inside the functions can be unrolled. But that can cause quite large code when operating on large structures, possibly much more than programmers are expecting.

GCC has a new option to emit an error if inlining these functions will produce code larger than a user-specified threshold, but Marchesi wanted to know how Clang handles this. Song indicated that Clang does the same inlining, but currently has a hard-coded limit for the size of the generated code. Marchesi suggested that if the Clang developers do ever wish to make it user-configurable, that they adopt the same name as GCC.

David Vernet suggested that GCC could perhaps emit BPF loops using bounded iterators (which the Linux verifier understands, but which some other BPF implementations do not) on platforms that support it. Marchesi agreed that this was possible, saying that the compiler doesn't really care what the generated code looks like as long as it can be verified.

Marchesi then went on to say that GCC now defines the same BPF feature macros — based on whether a particular class of instructions is available on a given BPF CPU version — that LLVM does. He asked the room whether those feature macros were covered by the work being done to standardize BPF, saying that now that GCC implements them, they need to be documented in the GCC manual, but he was unsure if there was an authoritative source to refer to. Dave Thaler indicated that he would talk about that in his session, right after the current one. That session was about the recent efforts by the IETF working group to standardize the BPF ISA, including standardizing ways for BPF implementations to advertise different optional functionality to compliant compilers.

Having covered many small compatibility features, Marchesi now arrived at "the exciting part" of the session. With all of this work combined, GCC now compiles 100% of the kernel's BPF self-tests, as well as the BPF components of several other projects such as systemd and DTrace. There are still 108 run-time failures in the kernel self-tests, but "it looks like GCC is actually generating code that can be verified".

This news was well received, and a member of the audience suggested it might now be appropriate to add GCC to the BPF continuous integration (CI) system to prevent regressions; Marchesi agreed, asking whether it was worth having test runners for different BPF CPU versions. The consensus seemed to be that it was not. Marchesi indicated that the next milestone for GCC would be to actually eliminate those run-time failures.

Marchesi indicated that the work he had been discussing was currently not in a GCC release, but that it would be incorporated into the next binutils release in June or July, and the GCC 14.2 release in August.

The future

At this point, Faust took over and the session turned to the future of BPF development in GCC. First on the agenda was improvements to inline assembly.

Inline assembly isn't as simple as just dropping some assembly code directly into the compiler's output. When information needs to pass between C and assembly, the programmer needs to indicate which registers correspond to which variables. Currently, GCC warns about using a variable that is shorter than the given register — even if the assembly code never actually touches the upper part of the register. To fix this, Faust proposed adding "w" and "R" register suffixes to indicate the lower 32 bits or full 64 bits of a register, respectively. These could also be used with immediate values, to indicate how large an of immediate value GCC should use when assembling the output. Faust asked the audience what they thought of that design, and there were no particular objections.

Other future work includes ensuring that, once the IETF standardization process actually produces a formalized memory model, GCC follows it, adding support for BPF's may-goto instruction, and pruning excess BTF debug information. BTF has a few needed improvements, because it currently doesn't work alongside link-time optimization. Binutils is also missing support for BTF, meaning it doesn't show up in objdump, nor is it deduplicated by the linker like other debugging-information formats are.

Improving GCC's BTF support is no easy task, however. Internally, GCC treats DWARF as the canonical debugging-information format, and generates BTF from that, Faust explained. One audience member asked whether that was really the case — does GCC not have an internal representation for debugging information? Marchesi clarified that GCC actually uses a slightly tweaked version of DWARF that he called "internal DWARF", but that otherwise GCC really is limited to what can be represented in DWARF. Unfortunately, the upstream DWARF developers are pretty resistant to accepting new features found in BTF, such as type and declaration information. They believe that doing so would bloat the DWARF format, for no real gain.

Vernet noted that DWARF is already a fairly heavy format, so it's funny that size would be the basis of the objection. Marchesi elaborated that the way DWARF is designed makes it nearly impossible to extend without breaking backward compatibility, which means adding new features requires hacky workarounds that introduce extra bloat. Faust noted that implementing BTF type tags in the natural way would cause any DWARF reader that didn't know how to deal with them to be unable to parse the file.

Marchesi then turned the topic away from the new planned features, and toward the ongoing maintenance of BPF support. He began by stating that the GCC developers take producing verifiable programs seriously — a challenging prospect since both the BPF verifier and GCC are moving targets. Ideally, Marchesi said, the GCC developers would like to avoid people needing to make private forks or maintaining and packaging one toolchain per kernel. In order to do that, GCC needs to adopt a maintenance process that works for BPF.

The GCC developers are considering introducing a special maintenance branch for BPF where bug fixes are applied, but only those that seem unlikely to interfere with producing verifiable programs. He emphasized that this is just an idea, not yet set in stone, and asked everyone else what they thought about the issue.

Alexei Starovoitov noted that the BPF CI already catches similar regressions in LLVM, usually with plenty of time to fix them before a release. He also said: "I don't think we've ever had a case" of a bug fix breaking existing BPF code's verifiability. Marchesi asked how often Clang is released, noting that GCC is released once a year. Song said that Clang has two releases per year. Starovoitov said that once GCC is covered by the CI, the GCC developers will have plenty of time to fix any issues. He thought dedicated maintenance branches sounded nice, but that they were probably not worth the effort.

Thaler said that there was a problem with compilers breaking BPF programs — with eBPF for Windows. Vernet replied that people who need to worry about that should be running Clang in their CI and catching it early. Thaler said that this doesn't fix the problem — if they upgrade the compiler and the tests fail, then they remain stuck on an older version.

Another audience member brought up a different use case for an out-of-tree compiler version: Android. They noted that Android user-space software (including BPF loaders and BPF programs) can remain on a device for years. This would not be as much of an issue if Android were not also working to update kernels more often. They said that this had required Android to stick with a specific Clang version, one requiring out-of-tree patches. Marchesi asked whether, having had that experience, they had suggestions for how to make the situation better. The audience member replied that they hadn't thought about it. Marchesi asked everyone to please let him know if they did come up with any clever solutions.

Marchesi then went through some last notes before time for the session ran out. He announced that the Godbolt Compiler Explorer now supports GCC BPF, and then briefly covered GCC's support for non-C languages compiling to the BPF backend. In short, it may work, but isn't supported. "In practice, every BPF program needs BTF", which isn't available for other languages.

Thaler noted that this wasn't really true, because eBPF for Windows doesn't use "call by BTF-id", the core instruction that makes BTF mandatory, which the Linux kernel uses to call kfuncs. Vernet asked whether BTF could be extended to support other languages. Marchesi said that he had spoken to some Rust developers last year, and that they had said you would need to be careful to only use the subset of the type system that BTF can express. Vernet asserted that the IETF standardization process should probably consider non-C languages when it gets around to standardizing BTF.

At that point the session was threatening to put the BPF track, which had previously been running on time, behind schedule. After a few more quick questions from the audience, the session wrapped up.

Comments (1 posted)

A new swap abstraction layer for the kernel

By Jonathan Corbet
May 23, 2024

LSFMM+BPF

Swapping may be a memory-management technique at its core, but its implementation also involves the kernel's filesystem and storage layers. So it is not surprising that a session on the kernel's swap abstraction layer, led by Chris Li at the 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit, was held jointly by all three of those tracks. Li has some ambitious ideas for an improved subsystem, but getting to a workable implementation may not be easy.

Li started by looking at the current swap state maintained by the kernel to get a sense for what needs to be kept in a new implementation. The key datum is the swap offset — the location in the swap file where any further information about a swapped-out page can be found. Any other information is optional within the kernel. This scattering of information is flexible, but can also be a source of pain, he said.

The current swap design is memory-efficient, but complex. It could be improved at the cost of using more memory — getting worse in order to get better. David Hildenbrand said that all of the resources needed by the swap layer are preallocated, since trying to allocate memory when the system needs to swap is failure-prone. That preallocation is why minimizing the overhead is so important; if a way could be found to do less preallocation, overhead would be less of a concern. It would be nice to consume less memory when swap is not being used, but it is not good to have to allocate memory when swapping is necessary.

Li agreed that systems often do not swap; any preallocated memory is simply wasted in that case. On the other hand, high memory consumption by the swap layer also hurts when a lot of swapping is happening.

He proposed — initially — to add one byte to each swap entry; that would be used to hold some flags. The full swap map (used to track the usage of space in the swap device) would not be preallocated, but would be grown as needed. The problem with adding a single byte, though, is that it would turn a four-byte entry into five bytes, which will create alignment problems. So, instead, the entry should grow by four bytes, which would allow the addition of pointers. But, then, if eight bytes are added, more things become possible, including dynamic allocation of the swap-entry structure. Its size could vary, as has been proposed for memory descriptors. Compound swap entries could share this descriptor, which would, in the end, more than pay back the cost of those extra eight bytes.

Support for directly swapping multi-size transparent huge pages (mTHPs) has been added to the mm-unstable tree, he said. Swapping 64KB mTHPs to zram devices significantly improves the compression ratio and saves nearly two-thirds of the CPU time needed when swapping single pages. But, as usual, there is a cost, in the form of increasing fragmentation in the back end. As time passes, the ability to allocate mTHP-sized chunks degrades, to the point that it becomes unusable after five hours, even with less than half of swap space in use.

The problem lies in how swap clusters are handled, he said. The cluster size is set equal to the full THP size (typically 2MB). Any single-page allocation will be taken from the first cluster on the per-CPU list, leading to a partially empty cluster that cannot be used to swap even mTHP-sized chunks, which are smaller than the full THP size; he is not sure why. Clearly there is a need for a better allocator. In the short term, his plan is to make note of the half-empty swap clusters and allocate mTHP-size chunks from there. The longer-term plan is to create a buddy allocator for swap entries.

But, he said, a better allocator is not enough. Since the swap layer does not control the lifecycle of swap entries, fragmentation can still happen. A malicious user could selectively free memory, leading to a situation where a lot of swap space is available, but none of it can be allocated. The solution to this problem is non-contiguous swap entries, managed by way of a compound swap structure. The head entry would contain the order of the structure, which would suffice for the simple case. The more complex case would be handled by dropping the alignment requirement for swap space, and allowing it to not be contiguous.

Li noted that this would be an invasive change. Matthew Wilcox agreed, warning Li that he was setting himself up for "a world of pain". This plan is, Wilcox said, a reinvention of the filesystem, and the tragic results of a memory-management developer trying to design filesystems are well known. He suggested that Li find a filesystem developer to work with if is truly necessary to follow this path.

Jan Kara said that existing filesystem designs are not suited to this task, since they are not written with the goal of minimizing memory overhead. But, he said, managing that kind of complexity will have its cost. He suggested that an easier solution might be to set a minimum size for swapped-out data as a way of reducing fragmentation. Large ranges of anonymous memory tend not to be used, he said, so it should be possible to swap it out in bigger chunks, reducing both overhead and fragmentation.

At the end of the session, Hildenbrand said that this plan was introducing too much complexity. Instead, he said, the swap-in and swap-out granularity should be decoupled from each other. If swap-space fragmentation is an issue, folios should just be split prior to swapping out. Folios could be reassembled at swap-in time. Li answered that his current design allows for partial inward swapping; it is not necessary to bring in an entire folio.

The next step, as always, will be to wait for patches to show up implementing some of these ideas.

Comments (none posted)

Large-folio support for shmem and tmpfs

By Jonathan Corbet
May 24, 2024

LSFMM+BPF

The kernel contains a pair of related filesystems that, among other things, can be used for shared-memory applications; shmem is an internal mechanism used within the kernel, while the tmpfs filesystem is mounted and accessible from user space. As is the case elsewhere in the kernel, these subsystems would benefit from the addition of large-folio support. During a joint storage, filesystem, and memory-management session at the 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit, Daniel Gomez talked about the work he is doing to add that support.

Gomez started by saying that he had posted a patch series for shmem and tmpfs. It will cause a large folio to be allocated in response to a sufficiently large write() or fallocate() call; variable sizes, up to the PMD size (2MB on x86) are supported. The patch implements block-level up-to-date tracking, which is needed to make the SEEK_DATA and SEEK_HOLE lseek() options work properly. Baolin Wang has also posted a patch set adding multi-size transparent huge page (mTHP) support to shmem.

David Hildenbrand said that the biggest challenge in this work may be that many systems are configured to run without swap space. The shmem subsystem works in a weird space that is sometimes like anonymous memory, and sometimes like the page cache; that can lead to situations where the system is unable to reclaim memory. Using large folios in shmem, he said, could lead to the kernel wasting its scarce huge pages in mappings where they will not actually be used.

Returning to his presentation, Gomez said that his current work only applies to the write() and fallocate() paths. But there is also a need to update the read() path. That can be managed by allocating huge pages depending on the size of the read request, but it is also worth considering whether readahead should be taken into account here. Then, there is the swap path; large folios are not currently enabled there, so they will be split if targeted by reclaim. With better up-to-date tracking, though, the swap path can perhaps be improved as well. Finally, he is also looking at the splice() path; currently, if a large folio is fed to splice(), it will be split into base pages.

When making significant changes to a heavily used subsystem like this, one needs to be worried about creating regressions. Gomez said that he has a set of machines running kdevops tests, and the 0day robot has been testing his work as well. He is not sure what performance testing is being run; he did say that tmpfs is being outperformed by the XFS filesystem, and large-folio support makes the problem worse. The cause is currently a mystery. Hildenbrand said that, if the use of large folios is causing the memory-management subsystem to perform compaction, that could kill any performance benefit that would otherwise accrue.

Gomez concluded by saying that, in the future, he plans to work on extending the swap code to handle large folios. He needs better ways to stress the swap path, and would appreciate hearing from anybody who can suggest good tests.

Comments (none posted)

The twilight of the version-1 memory controller

By Jonathan Corbet
May 23, 2024

LSFMM+BPF

Almost immediately after the merging of control groups, kernel developers set their sights on reimplementing them properly. The second version of the control-group API started trickling into the kernel around the 3.16 release in 2014 and users have long since been encouraged to migrate, but support for (and users of) the initial API remain. At the 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit, memory-management developers discussed whether (and when) it might be possible to remove the version-1 memory controller. The session was led by Shakeel Butt and (participating remotely) Roman Gushchin.

Deprecation process

The first step toward an eventual deprecation is to move the version-1 code into a separate file with its own configuration option. That option would also control the presence of some internal structure fields. Michal Hocko immediately suggested making the old version disabled by default; if it remains enabled, he said, the community will never manage to get rid of it. There are, he said, two classes of users for the old interface: intentional users who have a reason to stick with it, and accidental users who are unaware that a better interface exists. Disabling the old interface will motivate the second group to migrate away from it.

Most distributions, Butt continued, are using systemd these days, so they could easily handle a deprecation of the old interface. Hocko cautioned, though, that there are a lot of containers out there stuck on version 1 that nobody has ever bothered to fix. David Hildenbrand said that there was no need to worry about distributions; they will enable this option if it is needed. Gushchin said that, if the option is disabled by default, kernel developers will have to pay special attention to avoid breaking it while making changes elsewhere.

The proposed deprecation process, once the code separation is done, involves adding a warning to be emitted when the old interface is used; that code would be backported to the stable kernels as well. The next step is "wait a while", defined as two or three long-term-support cycles (referring to the long-term stable kernels, which have a one-year cycle). After that, the interface would remain, but it would not actually do anything; the code behind it would be removed.

Hocko worried that, no matter how long the warning period is, it would not be enough. He also said that turning the interface into a no-op is a risky approach that could cause systems to fail silently. Rather that do that, he suggested just setting the version 1 code aside and letting it slowly decay. David Rientjes said that this is the real point: how should features be deprecated? It is necessary to make users take some manual action to continue using the deprecated feature, or it will never go away; he suggested adding a sysctl knob to enable the old interface.

A participant pointed out that there are still features provided by the old interface that are unavailable in version 2. At the top of the list was combined accounting of memory and swap usage; without that, he said, applications simply cannot know how much swap space they need. Gushchin said that there are a number of version-1 features that are no longer used, and that not all features are equally important. The best approach might be to deprecate only those features; that would reduce the pressure to get rid of the rest.

Features specific to version 1

That led to a discussion of specific version-1 features, most of which are tersely described in this document. The first on the chopping block was the move_charge_at_immigrate knob, which controls how accounting for memory is done. A deprecation warning was added in the 6.3 kernel and backported to older ones; should it be turned into a no-op this year? Hocko again wondered if that was the right approach, saying that it might be better to simply fail if somebody tries to use that knob. One way or the other, it was agreed that this feature makes maintenance of the memory controller harder and should be removed.

Then, there is TCP memory accounting, which is controlled by four knobs. This is a separate, opt-in accounting feature. Butt made the claim that nobody is using TCP memory accounting, its performance is terrible, and that the version-2 implementation is far better. Nobody disagreed with that assessment. This set of knobs should be relatively easy to remove; the group agreed to start the deprecation process for them.

Next is soft limits, which are controlled by the soft_limit_in_bytes knob. The old version is broken (and disabled) for systems using realtime preemption. The version-2 API has better-defined semantics, and provides both best-effort and hard protections. Nobody objected to the removal of this feature either.

The failcnt interfaces can be read to see how many times a given control group has run into its limits; they are not exposed in the version-2 interface, and it is not clear that anybody is using them. It would be easy to add failcnt to version 2, but there should be a use case defined first. Hocko said that this feature is not useful, but it is almost free to support and not worth the trouble to remove.

There are a number of notification variables (including usage_in_bytes and oom_control) that notify a registered user when usage goes above a given threshold. They are disabled for realtime, and are not useful for driving the behavior of a process since notification happens before reclaim. But, evidently, Google uses them internally for job control and exposes them to workloads there. This functionality could be had with BPF, but applications would have to explicitly migrate over to that approach.

The oom_control knob also allows disabling the out-of-memory (OOM) killer and reading its status. Its presence enables the creation of user-space OOM killers. The new API provides some of this functionality via the memory.events knob, but does not give a way to disable the OOM killer. The version 2 memory.high knob (documented here) can be used to similar effect, though perhaps less reliably. Johannes Weiner said that Meta is using it that way, and it works; evidently Android also uses memory.high for this purpose.

Hocko said that the oom_control knob has been broken for years. It only controls OOM handling in the page-fault path. It is not a big deal to support, he said, since the overhead is small, but nothing like it should be provided in the version 2 API. There is a need for better control over the OOM killer, he said; perhaps that could be provided as a hook for a BPF program. That approach would allow controlling the OOM killer globally as well.

The next version 1 feature considered was memory-pressure notifications. This feature is not reliable, it assumes that there is reclaimable memory, which might not be the case. Unfortunately, the network-memory-pressure notification has leaked into the version 2 interface. The pressure-stall information API is sufficient for most use cases, but there does need to be an alternative for network-memory pressure in particular.

Toward the end of the session, attention turned to the combined accounting of memory and swap usage. This feature has been an area of concern for some time; it was discussed at the 2018 summit. Google is still using this feature, though, and there does not seem to be a way to create a good replacement. Hocko said that he hoped Google would eventually move to the version 2 interface for the "great features" it provides, and will find a way to move to a newer swap model in the process. There was a suggestion for a "-google" mount option for the cgroupfs filesystem to make this feature appear in version 2, but Hocko said that would cause it to never go away.

The final knob discussed was swappiness, which controls the relative attention paid by the reclaim mechanism to anonymous and file-backed pages. Hocko said that users complain that the knob doesn't work; it can be changed, but the changes do not propagate through the control-group hierarchy, creating confusion. He would rather not see that confusion repeated in the version 2 interface. Weiner disagreed, though, saying that it is possible to define good hierarchical semantics for swappiness. Before proceeding, though, it will be necessary to define the use cases for this knob.

Comments (1 posted)

Allocator optimizations for transparent huge pages

By Jonathan Corbet
May 24, 2024

LSFMM+BPF

The original Linux kernel, posted in 1991, ran on a system with a 4KB page size. Over 30 years later, most of us are still running on systems with 4KB pages, even though the amount of installed memory has grown by a few orders of magnitude. It is generally accepted that using large page sizes results in better performance for most applications, but allocating larger pages is often difficult. During a memory-management session at the 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit, Yu Zhao presented his ideas on improving the allocation of huge pages in the kernel.

It is worth noting that this session was focused on a patch set that was examined here in March. Zhao did not go deeply into the details of how his improved allocator works in the session; reading that article now could provide some useful background.

Zhao started by saying that "some CPU vendor" is planning to drop 4KB pages entirely within the next decade. MacOS on Arm systems uses 16KB pages now, and Google is experimenting with 16KB pages on Android. He made the proposition that 4KB pages are suboptimal for modern user space, but the problem remains that some architectures do not support any other size. Additionally, changing the base-page size is an ABI break that can cause problems for some applications.

Thus, he said, "a forward-looking operating system would offer the opportunity to favor larger logical pages". That system would treat 4KB pages as a legacy feature, but would not require a larger base-page size or break existing ABIs. Favoring huge pages over 4KB pages, he said, brings better performance and lower metadata overhead; that will be even more true once the plan to switch to memory descriptors becomes reality.

The problem is that the ability to allocate 4KB pages fragments system memory; defragmentation imposes a cost, and may be impossible. That results in an economy where 4KB pages are cheap, and huge pages are expensive. The cornerstone of his THP allocation optimization (TAO) proposal is turning that situation around, making huge pages cheap, and 4KB pages expensive.

The ability to assemble huge pages depends partly on the ability to move small pages out of the way. The kernel provides allocation-time hints like __GFP_MOVABLE now so that allocations that can (hopefully) be moved are located together. Unmovable allocations are a problem, though; they block assembly of huge pages, and their lifetime is not predictable. There is a research project at Google (called "Tetris") that is aimed at determining that lifetime, using statistical sampling and estimation, with the goal of grouping unmovable allocations by lifetime.

Low-priority tasks, Zhao said, can fragment memory, impacting the performance of higher-priority tasks. It would be nice to be able to isolate those low-priority tasks, but that needs support from the memory controller and, perhaps, cooperation from user space. But another key component (and a key part of the TAO patches) is memory partitioning. Fragmentation can be irreversible, he said, so it is best to avoid it by isolating the smaller allocations in a separate memory partition. A well-chosen partitioning scheme, he said, can readily provide huge pages while applying a higher level of memory pressure to applications that are making a lot of small allocations.

Shakeel Butt asked whether the zone for 4KB allocations would be limited to movable allocations or not. Zhao replied that it depends on the fallback order that is chosen. If, as he suggests, the kernel attempts to allocate compound (huge) pages before falling back to 4KB pages, then there can be unmovable objects in the 4KB zone.

Setting up partitions raises the issue of sizing. Zhao's proposal sets global minimum and maximum limits on the size of the huge-page partition, but that is only part of the problem. Low-priority tasks could still hog the huge pages, so there will have to be a limit, enforced by a control group, on use of the huge-page partition. It will be possible to resize the partitions based on the workload, but that requires memory hotplugging. Shrinking the huge-page partition should be guaranteed, since those allocations are all movable; moving in the other direction would be a best-effort affair.

A participant asked where the line would be drawn between good (large) and bad (small) allocations. Zhao answered that it depends on the system. For many, it would be the CPU's continuous-PTE size (often 16KB or 64KB); on servers it would be the PMD size, which (at 2MB typically) is rather larger. There was some inconclusive discussion on what the best size to use might be.

Zhao continued, saying that automatic resizing of the partitions will be needed, based on their relative memory pressure. The 4KB partition would be allowed to have a higher pressure as a way of fighting fragmentation. He suggested that memory pressure in the 4KB partition could invoke the out-of-memory (OOM) killer, even if the huge-page partition is not having problems. There are a number of platforms that use OOM kills as part of their ordinary operation; Android, ChromeOS, and cloud providers (to manage batch jobs) are all examples, so bringing in the OOM killer is not necessarily a bad thing. The alternative, he said, would be to watch the huge-page partition fade away due to fragmentation over time.

Zhao presented some plots showing that systems running with the TAO patches benefit from improvements in both huge-page allocation rates and web-browser responsiveness.

David Hildenbrand asked whether the partition resizing could be done using the memory-management subsystem's page-block abstraction rather than hotplugging; Vlastimil Babka replied that page blocks do not have separate free lists, so they cannot be used to direct allocations in the same way. Hildenbrand suggested that perhaps extending page blocks might be the right approach; on big systems, he said, nobody is able to cope with the complexity of hotplugging. He would not be able to convince RHEL users to use the TAO feature. Configuring phones, which run a single workload, is easy; servers are rather harder.

Johannes Weiner pointed out that he had posted a patch set for reliable huge-page allocation last year. Reviewers asked him to split the work apart; some of it is staged to go into the 6.10 release. He was able to get a success rate of 99% for 2MB huge-page allocations; that is good enough, he said. Larger allocations are only of interest to a small group of users.

Zhao concluded the session by speaking briefly about the longer-term goals of his work. They include using TAO to provide huge pages to back up hugetlbfs, and the ability to reliably allocate 1GB huge pages.

Comments (9 posted)

Two talks on multi-size transparent huge page performance

By Jonathan Corbet
May 25, 2024

LSFMM+BPF

Using huge pages has been known for years to improve the performance of many workloads. But traditional huge pages, often sized by the CPU at 2MB, can be difficult to allocate and can waste memory due to internal fragmentation. Driven by both the folio transition and hardware improvements, attention to smaller, multi-size transparent huge pages (mTHPs) has been on the rise. In two memory-management-track sessions at the 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit, developers discussed the kernel's ability to reliably allocate mTHPs and the performance gains that result.

Reliable mTHP allocation

The first session was presented remotely by Barry Song, who has been working at Oppo to improve the availability of mTHPs on Android devices. Large-folio support has been deployed on millions of these devices, he said, but the chances of being able to allocate a large folio drop quickly as memory fragments. After one hour of operation, mTHP allocation attempts succeed about 50% of the time, which is acceptable. After two hours, though, the failure rate exceeds 90%; memory is completely fragmented, and mTHPs are simply no longer available.

Song ran some experiments with the TAO patches (which were discussed in the previous session) applied. The mTHP size was set to order 4 (64KB), and 15% of physical memory was configured for mTHP-only allocations. On that system, the success rate for mTHP allocations remained stable at over 50%. Clearly there is potential here, but Song has tried to push the work further.

Specifically, he has implemented a system using two independent least-recently-used (LRU) lists, one for base pages and one for large folios. There is a kernel thread dedicated to balancing the aging between those two lists so that both types of pages remain available. Reclaiming large folios as large folios is important, he said; otherwise the system can reclaim large numbers of smaller allocations and still never get to the point where it can assemble a large folio. A logical diagram of this allocator can be seen on the right.

A key part of this design, he said, is the ability to keep a pool of large folios in a special page block. When they are not needed elsewhere in the system, these folios can be lent out to drivers; the dma-buf and zsmalloc subsystems can benefit from such loans. This system also uses dual zram devices so that large and small folios can be swapped independently.

There was some inconclusive discussion at the end of the session; one gets the sense that most developers are waiting to see the patches implementing this solution.

Benchmarking mTHP performance

Work on increasing the reliability of mTHP allocation is based on the idea that mTHPs improve performance. As always, though, it is best to put such notions to the test rather than simply assuming them. In the following session, Yang Shi discussed some benchmarking work he has done on 64-bit Arm systems.

This work was not done on a mobile device; he used an Ampere Altra server with 80 CPU cores. The tests were run on a 6.9-rc kernel, and continuous-PTE support (a hardware feature that allows an entire mTHP to be represented by a single translation lookaside buffer (TLB) entry) enabled. The system ran with a range of base-page sizes, and huge pages were otherwise disabled. The benchmarks run used Memcached, Redis, kernel builds, MySQL, and other workloads.

With Memcached, using mTHPs resulted in an improvement of about 20% in the number of operations completed per second, along with a 10-30% decrease in latency, but only for larger base-page sizes. That caused Jason Gunthorpe to question the numbers; he wondered why running with 64KB mTHPs on a 4KB base-page size showed no performance benefit. Shi's answer was that the extra overhead of maintaining the page tables at a 4KB page size overwhelmed any benefit otherwise obtained.

The kernel-compilation numbers were similar, but the 64KB/4KB case showed a 5% performance benefit, which Shi attributed to a reduction in page faults. Again, though, there were concerns in the room about the numbers, which did not make sense to everybody.

Shi pressed through to his conclusions: he suggested that memory allocations should start by attempting to get the largest possible mTHP size; if that fails, the allocator should just fall back immediately to the base-page size. The performance benefits from allocating at the intermediate sizes, he said, do not justify the additional work. He also suggested increasing the transparency of huge pages so that more applications can make use of them without any special work. There is no need for special knobs to let applications specify the allocation sizes they need, he concluded.

Gunthorpe disagreed, saying that the hugetlbfs mechanism works because applications are aware and can obtain the sizes that they need. Control over allocation sizes has been exposed to user space for a long time; applications have used it and shown that it is necessary. He mentioned an unnamed "certain application" that needs 2MB huge pages; nothing else works well. There is no reason to take away the ability to request pages of that size.

Shi answered that hugetlbfs is a special feature, while the use of mTHPs is meant to be transparent. But David Hildenbrand said that the kernel is not yet at the point where mTHPs can be used automatically. The existing transparent huge page feature has always been opt-in for a reason: memory waste from internal fragmentation is a real problem. Things work better if applications can give hints for what they need.

Johannes Weiner agreed, saying that his group (at Meta) had enabled 2MB huge pages for servers, but then immediately disabled them again. Huge pages can be good for performance, but they can't be used everywhere. Hildenbrand added that, someday, there will be an option to automatically enable mTHPs, but that will not happen anytime soon. And, at that point, the session came to a close.

Comments (none posted)

The next steps for the maple tree

By Jonathan Corbet
May 27, 2024

LSFMM+BPF

The maple tree data structure was added during the 6.1 development cycle; since then, it has taken its place at the core of the kernel's memory-management subsystem. Unsurprisingly, work on maple trees is not yet done. Maple-tree maintainer Liam Howlett ran a session in the memory-management track of the 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit to discuss the current state of the maple tree and which features can be expected next.

Howlett has a backlog of requested features that seems likely to keep him busy for some time. Some of them are internal to the data structure itself:

There is a desire for a fast way to get a count of the number of null entries within a node.
"Dense nodes", which contain more pointers per node.
The removal of "big nodes", which are a special structure used when nodes are rebalanced or split. Among other things, removing them will help to improve support for singleton ranges — process IDs, for example.
Finally, he plans to implement index compression.

With regard to externally visible features:

The ability to search marks and tags is at the top of the to-do list. That would allow searching a tree for entries with, for example, the "dirty" bit set.
The ability to prune trees under memory pressure would help the system overall; it could be used with the cache that holds shadow page-table entries for evicted pages.
Filesystem users would benefit from 64-bit indices on 32-bit architectures.
A contiguous iterator that would iterate over a range only as long as there are no gaps.
"Big dense nodes" were described as a large list that could hold up to 4K singleton items.

Overall, he said, he is trying to get maple trees to the point that they can match the features provided by XArrays. The maple tree should be able to do the same things with better performance; once the features are there, it should be possible to implement the XArray interface and switch users without anybody else having to even be aware of it.

Howlett said that the maple tree is getting more users, and he is seeing some common errors when code is converted over. It is possible to use external locks to serialize access to a maple tree, he said, and some users do it, but it is better avoided if possible. He cautioned that anybody using read-copy-update (RCU) read locks should be aware that the lock protects a maple-tree node from being freed, but not necessarily the data contained within that node.

Users of the generic storage API were encouraged to wrap it with a typed interface so that the compiler can catch mistakes. Developers converting from an XArray often are surprised when mas_next() fails to return the first entry. Its job is to get the next entry; to start at the beginning, mas_find() should be used instead.

In general, he said, he is working toward the addition of a type-safe interface and moving away from void * pointers. Eventually there will be a DEFINE_MAPLE_TREE() macro that creates a tree handling objects of a given type.

As usage has grown, the maple tree structure has encountered a number of challenges. Tracking of virtual memory areas (VMAs) is one of those; he is trying to find ways to remove some of the complexity associated with special VMAs. One example is guard VMAs, which define a short range of no-access address space to catch overruns. If guard VMAs are in use, the total number of VMAs in the tree is doubled, which is expensive, but those guard VMAs are never really used. So Howlett is trying to find a way to mark guard regions directly in the maple tree and avoid allocating so many extra structures.

Maple trees should eventually implement upper and lower limits, he said; that would be useful, for example, to implement restrictions on mapping the page at virtual address zero. Currently a maple tree will show gaps in areas that are not actually available for allocation. There are also some challenges in representing the vDSO area.

There were a few comments once Howlett finished. David Hildenbrand said that the kernel contains a lot of checks for gate VMAs, which are a special VMA used to represent the virtual-system-call page; it would be nice to find a way to represent them in the maple tree and remove those checks. Suren Baghdasaryan said that guard VMAs are one of the biggest allocation slabs on Android systems, so removing them would be a welcome optimization. The session wound down with a bit of discussion on the best way to identify guard VMAs within the kernel.

Comments (6 posted)

Fleshing out memory descriptors

By Jonathan Corbet
May 27, 2024

LSFMM+BPF

One of the long-term goals of the folio conversion in the kernel's memory-management subsystem is the replacement of the page structure, which describes a page of physical memory, with an eight-byte "memory descriptor". This change would reduce the overhead of tracking physical memory, increase type safety, and make memory management more flexible. Thus far, though, details on what the memory-descriptor future will look like have been relatively scarce. At the 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit, Matthew Wilcox led a discussion to try to fill in the picture somewhat.

Wilcox started by saying that he has been thinking about what will happen once the folio conversion is done. The ultimate goal, he said, looks like this:

    struct page {
        u64 memdesc;
    };

The lowest four bits would be a type field saying what kind of descriptor it is; the rest would (usually) be a pointer to a type-specific structure. David Hildenbrand immediately said that what is really needed is a type hierarchy; some types have subtypes, and the kernel will surely exceed the 16 types that can be represented in those four bits at some point; 11 types have already been defined. Wilcox disagreed, noting that no new types had been added for some time and questioning whether the kernel would ever run out. I remarked that I was documenting that claim for posterity, to general laughter.

Descriptor type zero, he said, would be a special type indicating "miscellaneous memory with no further data". It would, as it turns out, have a number of subtypes. Pages falling under this type could include those in the vmalloc range, guard pages, offline pages, and others. Bit 11 of the descriptor would be set if the page can be mapped to user space, bits 12-17 would contain the page order, and the higher bits could contain information about which node and zone contain the page.

There was a brief discussion of how memory descriptors would be allocated; Wilcox envisioned an interface like:

    struct page *page = alloc_page_desc(MEMDESC_TYPE_FOLIO);

Jason Gunthorpe remarked that he would like to see more details on what the state transitions for memory descriptors will be.

Wilcox moved on to discussing pages owned by the buddy allocator; they would have a descriptor that looks like:

    struct buddy {
        unsigned long prev;
	unsigned long next;
    };

That design reduces the size of the descriptor to two 64-bit integers, which is "a step in the right direction". That information would be enough to support basic allocator operations like insertion, removal, and merging. The amount of space needed for the descriptor could be reduced by storing page-frame numbers rather than addresses. Given a willingness to limit installed memory to 2TB, the descriptor could be condensed down to eight bytes. The only problem with that idea is that systems with more than 2TB of installed memory are on the market now.

This descriptor could be reduced further by making it contain page-frame numbers relative to the base of the zone containing the pages. At that point, each memory zone could contain 2TB of memory; with enough zones, much larger total memory sizes could be handled. Wilcox thought that this solution might come at the cost of having to add more memory zones (generally seen as undesirable), but Vlastimil Babka pointed out that large-memory machines use a NUMA architecture, so the memory is already divided into multiple zones.

A 30-minute slot is clearly not enough to design the descriptor-based future, so it is not surprising that this discussion did not get much further. Wilcox brought it to a conclusion by saying that his goal for this year is to get rid of the mapping and index fields of struct page; that will require some work to fix the existing users in the kernel. Then the work of splitting the various users of page structures into specific descriptor types can proceed. Once approximately half of users have been converted, he will submit a patch to shrink the page structure; it "should all just work", he said. That will lead to the next important phase of this transition: seeing where the performance regressions are; he admitted that he does not know how that will work out.

(Wilcox has also put together a few wiki pages on the memory-descriptor design).

Comments (9 posted)

The state of the memory-management community in 2024

By Jonathan Corbet
May 28, 2024

LSFMM+BPF

A longstanding tradition in the memory-management track of the Linux Storage, Filesystem, Memory-Management and BPF Summit is a session with maintainer Andrew Morton to discuss the overall state of the community and the development process. The 2024 gathering upheld that tradition toward the end of the final day of the event. It seems that Morton and the assembled developers were all happy with how memory-management work is going, but there is always room for improvement.

In 2022, Morton had described his (then) new plans for a Git-based patch-management scheme, involving separate mm-unstable and mm-stable trees for work in different stages of development. He started the 2024 session by saying that he had hoped that patches would move relatively quickly from the mm-unstable tree into mm-stable, but in reality it tends to take rather longer. As a result, his hopes that developers would be able to work against mm-stable have not really worked out. He is not sure how to make that aspect of the process work better.

A recurring problem, he said, comes about when he is holding onto version 3 of a patch set, and the discussion has made it clear that a fourth revision is needed. In such cases, he asked, should he keep version 3 while waiting, or simply drop it? There is benefit in keeping patches in mm-unstable, even if they will eventually be superseded, because the higher visibility means that new problems and potential improvements continue to turn up. He asked for feedback, but the room was uncharacteristically silent in response.

He observed that he does get a bit tired of asking developers to describe the user-visible effects of the bugs they are fixing (here's an example that came by as this article was being written). He said that he always tries to make the request a little different but, in the end, asking that question has become what he does for a career.

There is, he noted, a new CVE process for the kernel. He wondered who was evaluating potential CVE assignments for memory-management patches, and said that he would like for the memory-management community to help where they can. Security implications, in the end, are user-visible effects of bugs as well. Michal Hocko said that, if developers try to think about this aspect a bit more, they can start noting in their changelogs whether patches are security relevant. But, he said, most of the people in the room are not security experts; the obvious security issues can be called out, but developers often do not know when they are fixing a security bug.

Brendan Jackman said that whether a given problem can be triggered from user space is valuable information to include in a changelog. Liam Howlett said that he had once fixed a use-after-free bug that, at the time, seemed benign, but it turned out that it could be exploited in combination with another bug. He had thought it couldn't be triggered from user space, but he was wrong and the result was a severe vulnerability. Jason Gunthorpe said that trying to score the security impact of bugs has legal implications, and he does not want to go anywhere near that. Saying that a bug is not exploitable is a scary claim to make, he added. Vlastimil Babka said that, even if a changelog notes that a bug can only be triggered by privileged users, that bug will still have a CVE number assigned to it. Jackman said that he would still like to know if capabilities are required to trigger a bug, though.

Changing subjects, Hocko asked developers to refrain from sending new versions of their patches before the discussion on the previous version has completed. He also said that, once a patch lands in Morton's tree, the potential for significant changes drops. It is better, he said, to keep work out of that tree until there is some consensus that it is good. David Hildenbrand said that Morton will often pull in work just to see if it will compile; maybe there is a need for a separate "stabilizing" tree that is not fed into linux-next. The linux-next kernel, he said, still blows up too easily. Morton said maybe he should add an "mm-stupid" tree.

Howlett said that he is being copied on patches far less frequently than before, and is missing work that he should have seen; he wondered if something has changed somewhere. Hocko suggested ensuring that the MAINTAINERS file is up-to-date. Morton said that he spends a lot of time ensuring that the right people see patches. Gunthorpe took a moment to suggest that more developers should be using the lei tool. It can be used to subscribe to a file and will collect all of the patches that touch that file; there is no need to change the MAINTAINERS file to see them. It is even possible to subscribe to specific functions.

Matthew Wilcox pointed out that there is a group of Rust developers trying to get their work in; that work includes things like wrappers for the page structure. What was the plan for getting that work into the mainline? Morton protested "but it's all in Rust!" More seriously, he said that this work should go upstream via the Rust tree.

Wilcox also complained about a series of header-splitting patches that had been posted recently. The author, he said, is exclusively focused on reducing compilation time and does not understand the memory-management subsystem. These changes will make maintenance harder, and he would like to reject them. Morton said that he looked at some of those patches when they went by, saw that they were "broken", and has not looked at them since.

The final topic was briefly raised by Wilcox, who noted that a number of kernel subsystems have moved to a group-maintainership model; he wondered if memory management needed to do that too. He was unsure that such a change would actually fix any problems. Morton said that, for all practical purposes, the subsystem is group-maintained now, even if he solely maintains the tree that eventually goes upstream. Wilcox closed the session by saying that he was happy with the process overall and thanking Morton for doing a great job.

Comments (13 posted)

Measuring memory fragmentation

By Jonathan Corbet
May 28, 2024

LSFMM+BPF

In the final session in the memory-management track of the 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit, the exhausted group of developers looked one more time at the use of huge pages and the associated problem of memory fragmentation. At its worst, this problem can make huge pages harder (and more expensive) to allocate. Luis Chamberlain, who ran the session, felt that people were worried about this problem, but that there was little data on how severe it truly is.

Transparent huge pages, he said, never reached wide adoption, partly as the result of fragmentation fears. But now, the kernel supports large folios, and transparent huge pages are "the stone age". Large folios are being used in a number of places, and multi-size transparent huge pages (mTHPs) are on the rise as well — and "the world hasn't ended". Still, worries abound, so he wondered how the fragmentation problem could actually be measured.

The discussion immediately wandered. David Hildenbrand said that there are people who have been looking into allocation failures and running into the fragmentation problem. SeongJae Park pointed out that, long ago, Mel Gorman had proposed a fragmentation index that was since merged as a debugfs feature, and that some of Gorman's team are using it. Michal Hocko said that it is a question of proactive or reactive responses; at what level should people care about fragmentation? Hildenbrand said that, currently, most allocations will fall back to a base page if larger pages are not available; in the future, if users need the larger allocations, that fallback will no longer be an option. There will be a need to measure the availability of specific allocation sizes to understand the fragmentation problem, he said.

In response to a question from Hocko on the objective for this measurement, Chamberlain said that he wanted to know if the introduction of large block sizes was making fragmentation worse. And, if the fragmentation problem is eventually solved, how do we measure it? Hocko suggested relying on the pressure-stall information provided by the kernel; it is measuring the amount of work that is needed to successfully allocate memory. But he conceded that it is "a ballpark measure" of the problem.

Yu Zhao said that kernel developers cannot improve what they cannot measure; Paul McKenney answered that they can always improve things accidentally. That led Zhao to rephrase his point: fragmentation, he said, is a form of entropy, which is typically measured by temperature. But fragmentation is a two-dimensional problem that cannot be described by a single number. Any proper description of fragmentation, he said, will need to be multidimensional. Jan Kara said that a useful measurement would be the amount of effort that is going into memory compaction, but Zhao repeated that a single number will never suffice.

John Hubbard disagreed, saying that it should be possible to come up with a single number quantifying fragmentation; Zhao asked how that number would be interpreted. Hocko said that there is an important detail that would be lost in a single-number measurement: the view of fragmentation depends on a specific allocation request. Movable allocations are different from GFP_KERNEL allocations, for example. He said that, in any case, a precise number is not needed; he repeated that the pressure-stall information shows how much nonproductive time is being put into memory allocations, and thus provides a good measurement of how well things are going.

As the session wound down, Chamberlain tried to summarize the results, which he described as being "not a strong argument" for any given measure. Zhao raised a specific scenario: an Android system running three apps, one in the foreground and two in the background. There is a single number describing fragmentation, and allocations are failing; what should be done? Possible responses include memory compaction, reclaim, or summoning the out-of-memory (OOM) killer; how will this number help to make this decision? Chamberlain said that he is focused on the measurement, not the reactions at this point. Zhao went on for a while about how multi-dimensional measurements are needed to address this problem before Hocko said that the topic could be discussed forever without making much progress; the session then came to a close.

Comments (6 posted)

A plea for more thoughtful comments

By Jonathan Corbet
May 29, 2024

When redesigning the LWN site in 2002, we thought long and hard about whether the ability to post comments should be part of it; LWN had not offered that feature for the first four years of its existence. There were already plenty of examples of how comments can go bad by then, but we decided to trust our readers to keep things under control. Much of the time, that trust has proved justified, but there have been times where things have not gone so well. This time is quickly becoming one of those others.

When it is at its best, the LWN comment stream is a polite discussion among people who are both passionate and knowledgeable about the free-software community. Increasingly, though, it is dominated by interminable back-and-forth name-calling sessions where few people participate, and most people just wish it would go away. LWN editors are having to intervene more frequently. The quality of the conversation is degrading the quality of the site overall; we need to do better.

Comment moderation is the least fun part of keeping LWN going. It also takes time away from what we would rather be doing: creating more interesting articles to read. But if that is what we have to do, we will do it. That would have the effect of slowing the conversation down considerably; indeed, that would be part of the point. We are contemplating the addition of comment quotas, perhaps selectively applied to the accounts that have been filtered by a lot of readers. Other mechanisms may be considered as well.

But, maybe, we won't have to do that. Maybe, if people posting comments take a moment to think about whether it really matters that somebody might be wrong on the Internet, whether adding another message to the stream will really make the situation better, whether their comment adds something new to the discussion, and whether their comment is polite, respectful, and informative to all the people who will see them, the comment stream will improve by itself and we won't have to do any of those things.

LWN is more than its writers; it is a community that is shaped and supported by its readers. One of the best ways to support LWN at the moment would be to help ensure that our comment stream is respectful, polite, and actually interesting to read. The LWN community has successfully straightened out comment-related problems before; we can certainly do it again now. An advance "thank you" to all of you who will help to make that happen.

Comments (122 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>