Leading items

Welcome to the LWN.net Weekly Edition for April 10, 2025

This edition contains the following feature content:

Debian Project Leader election 2025 edition: there are four contenders in the Debian project's annual project-leader election.
The rest of the 6.15 merge window: finishing our summary of the interesting features merged for the next major kernel release.
More LSFMM+BPF 2025 coverage:
- The future of ZONE_DEVICE: some thoughts on how to bring an "ugly stepchild" of the memory-management subsystem a bit closer to the fold.
- Supporting untorn buffered writes: untorn (or atomic) writes are now supported in the kernel for direct I/O, should that support be extended to buffered I/O?
- An update on torn-write protection: changes and plans for untorn direct I/O writes for XFS and other filesystems.
- Better hugetlb page-table walking: a discussion on making hugetlb look a bit more like the rest of the memory-management subsystem.
- Page allocation for address-space isolation: unmapping sensitive kernel data whenever possible has the potential to increase security, but it will place some interesting demands on the memory-management subsystem.
- The future of guest_memfd: protecting guest pages from the host can be a difficult business.
- Three ways to rework the swap subsystem: swapping in Linux is a complex business; three sessions look at ways to both simplify the swap subsystem and make it more effective.
- Per-CPU memory for user space: per-CPU variables are an effective performance optimization in the kernel; how can similar benefits be brought to user space?
- Using large folios for text areas: the kernel's readahead code works well for data, but tends to break up executable code (text) into small chunks. Some minor changes could improve that situation.
- Two approaches to better kernel samepage merging: a pair of sessions on how the system can do more focused and synchronous deduplication of memory.
- Improving hot-page detection and promotion: the ongoing challenge of figuring out which memory is in the most active use at any given time.
- A strange BPF error message: Yonghong Song explains how undefined behavior can turn a simple uninitialized variable into a puzzling error from the BPF verifier.
- An update on pahole: Arnaldo Carvalho de Melo gives an update on how development of the pahole debugging-information manipulation tool has gone over the past year.
- A new type of spinlock for the BPF subsystem: Deadlocks are a recurring problem for BPF, one that complicates the handling of locks in BPF programs. A new lock design may help prevent them.
Taking notes with Joplin: an introduction to this note-taking application and a look at where it is heading.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Debian Project Leader election 2025 edition

By Joe Brockmeier
April 9, 2025

Four candidates have stepped up to run in the 2025 Debian Project Leader (DPL) election. Andreas Tille, who is in his first term as DPL, is running again. Sruthi Chandran, Gianfranco Costamagna, and Julian Andres Klode are the other candidates running for a chance to serve a term as DPL. The campaigning phase ended on April 5, and Debian members began voting on April 6. Voting ends on April 19. This year, the campaign period has been lively and sometimes contentious, touching on problems with Debian team delegations and finances.

Debian's Constitution defines the duties, powers, and responsibilities of the project leader. Part of the role includes serving as a public face for Debian, giving talks and attending events on behalf of the project, and managing Debian's relationships with other projects. The DPL role is not merely ceremonial: the project leader appoints delegates to various committees, such as the Debian Technical Committee, Debian System Administration (DSA) team, Treasurer team, and others.

The DPL serves a one-year term, with no limits placed on how many consecutive terms a person may serve. Jonathan Carter, who declined to run again last year, currently holds the record for most terms served: he held the position for four consecutive terms, from April 2020 to April 2024. Two candidates stood for election last year: Chandran and Tille. Out of more than 1,010 Debian developers who were eligible to vote in 2024, only 362 voted; Tille won the election by a healthy margin.

Andreas Tille

Tille has been involved in Debian for more than 25 years. He initiated the Debian Med project to create a Debian-based operating system that fits the requirements of medical practices and biomedical research. His platform for 2025 is significantly shorter than his 2024 platform and is less ambitious in scope. He notes that he has learned a lot during his first term and realizes that "making significant changes within Debian in a single DPL term is not feasible". This is illustrated by one of his platform planks, "supporting infrastructure teams and key package maintainers":

As I looked deeper into our processes, I identified issues that need improvement to better support our volunteers. The reason I am running for a second term is that I initially hoped these issues could be resolved quickly—but I was wrong. Now, with the experience I have gained, particularly in the social aspects of these challenges, I am committed to addressing them more effectively.

He also wants to focus on face-to-face meetings and diversity. Tille said that his "main approach to addressing geographic diversity within the project" has been face-to-face meetings, but it is difficult to measure success there. He thought that his "tiny tasks" initiative, an attempt to highlight easy ways for new contributors to become involved with Debian, would help to increase diversity. He started by creating a bug of the day project to choose five random bugs that might be good entry points to contributing to Debian. In March, he declared that it had not worked as intended.

The original goal was to provide small, time-bound examples for newcomers. To put it bluntly: in terms of attracting new contributors, it has been a failure so far. My offer to explain individual bug-fixing commits in detail, if needed, received no response, and despite my efforts to encourage questions, none were asked.

However, the project has several positive aspects: experienced developers actively exchange ideas, collaborate on fixing bugs, assess whether packages are worth fixing or should be removed, and work together to find technical solutions for non-trivial problems.

Sruthi Chandran

Chandran has contributed to about 200 packages since she began contributing to Debian in 2016—though her platform page notes that she is "not very active nowadays"—working on packages. Some of her other activities in Debian include working with the Outreach Team, the DebConf Committee, and as an Application Manager. Chandran was the chief organizer of DebConf 23, held in Kochi, India, in 2023. She also notes that she has been mentoring people to contribute to Debian for many years and is involved in organizing "numerous packaging workshops and other Debian events throughout India" as well as other free-software events like Free Software Camp.

Chandran's primary goal in running for DPL is to bring diversity issues to the fore:

I am aware that Debian is doing things to increase diversity within Debian, but as we can see, it is not sufficient. I am sad that there are only two women Debian Developers from a large country like India. I believe diversity is not something to be discussed only within Debian-women or Debian-diversity. It should come up for discussion in each and every aspect of the project.

DPL elections [are] an important activity in our project and I plan to bring up the topic of diversity for discussion every election till we have a non-(cis)male DPL.

Gianfranco Costamagna

Costamagna began contributing to Debian in 2014, focusing on the Berkeley Open Infrastructure for Network Computing packages. He became a Debian developer in 2015, and an Ubuntu core developer since 2016.

Professionally, I work in the embedded Linux domain, allowing me to bridge my expertise between Debian and Ubuntu. This dual involvement has enabled me to contribute significantly to both communities, particularly in package maintenance and architecture support.

His short platform page indicates that his major goals as DPL would be strengthening the collaboration between Debian and Ubuntu, improving quality and maintenance of packages, supporting emerging architectures like riscv64 and loong64, encouraging more community participation, and simplifying Debian's processes. "As an engineer, I aim to simplify complex processes, making it easier for contributors to engage and for users to benefit from Debian's offerings."

Julian Andreas Klode

Klode has been a Debian developer since 2008, and is currently the primary maintainer of Debian's Advanced Package Tool (APT). He has been employed by Canonical since 2018, and is involved with APT and other package-management topics for Ubuntu. According to Klode, being "deeply rooted in both Debian and Ubuntu gives me the means to bridge gaps between the communities". He would like to try to make Debian more attractive to users from Ubuntu and other operating systems.

One of Klode's platform proposals is delegating responsibilities away from the DPL. One possibility that he suggests is to establish a DPL advisory council that could make binding decisions for the DPL. He, like Tille and Chandran, emphasizes diversity, equity, and inclusion (DEI) in his platform, but from a slightly different perspective. Klode talks about not just geographic and gender diversity, but also providing accommodations for contributors with disabilities, dietary needs, or those "not having enough money to attend DebConf". He said that it was not a complete list, but all contributors' voices deserve to be heard, "and we need to ensure that we provide safe environments for them to thrive in".

He notes that he cares very deeply about Debian, and sometimes overcommits. "This results in a bit of burnout and in me not picking the best words when trying to respond as quickly as possible".

Questions for the candidates

It is traditional for Debian developers to put questions to the candidates, town-hall style, on the debian-vote mailing list. It is up to the candidates to respond, or not, to questions during the campaign phase. Tille, Costamagna, and Klode all participated in the discussion. Chandran sent a message on April 5 apologizing for "being totally absent from the DPL campaign scene" due to personal issues and said "I do not wish to come at the last moment and ask for votes".

As a former DPL, Carter thanked the candidates for running and threw out several questions and ideas for them to chew on. One of his prompts for candidates was on version-control requirements for Debian. He wanted to know if there should be a requirement to maintain packages in Debian's Git forge, Salsa, and their opinions on dgit and tag2upload.

Klode said that he disagreed with some of the technical choices in dgit, but "that shouldn't be a particular surprise, given I filed bugs about it". He did think that every package in Debian should be in a git repository, and that maintainers should be expected to participate in merge-proposal workflows for their packages and merge patches submitted that way.

I mentioned this in another email recently, Ubuntu has automatic git imports of most of its packages, and treats merge requests in those git repositories as equal or more recommended than debdiffs; and I think this has tremendously improved quality of life, both in submitting changes as well as reviewing changes.

Tille's response was along the same lines: Debian should require packages to be maintained in a version-control system and hosted on a unified platform—in other words, Salsa. But, he said, that won't realistically be accomplished within the next DPL term.

For now, I will focus on migrating packages that remain outside Salsa _for_ _no_ _good_ _reason_-for instance, cases where the maintainer is inactive, and nobody truly cares.

He said that tag2upload would present a "positive step towards modernization", though he had not felt a need to use it himself, and might serve as a stepping stone on the path toward unified Git-based workflows within Debian. If a Debian member did not agree that all Debian packages should be maintained in a common Git-based platform, "you should not vote for me".

Costamagna answered that he really liked to use Git and did not like to use anything that is not Git. He also liked Salsa, but did not go so far as to suggest that all packages should be hosted in Salsa or with a version-control system.

The ftpmaster situation

Debian's ftpmaster team has a somewhat misleading and archaic name. The team is responsible for the infrastructure required to support Debian's archive of packages, as well as reviewing new packages, setting the installation priority of packages, and more. The Debian wiki page has an in-depth description. The team is often simply referred to as the "FTP team" on Debian mailing lists. The ftpmaster team is appointed by the DPL, or delegated in Debian-speak. The current ftpmaster team delegation was last updated in 2017 by then-DPL Chris Lamb and has not been changed since.

In his 2024 platform, Tille had said that one of his reasons for running for DPL was to improve the team's process of integrating new packages. Sean Whitton sent an email asking Tille why nothing had changed, and why developers should vote for Tille again if he was unable to make progress on one of his core goals for his first term as DPL. He detailed his discussions with Tille at DebConf24 last year about problems with Debian's ftpmaster team and proposed solutions. He and Tille "agreed on almost everything that needed to be done", but nothing had changed in the interim. "The DPL has the responsibility to ensure the core teams are fit for purpose, and it is far from clear that the FTP Team is fit for purpose."

Whitton asked the other candidates whether they agreed about the seriousness of the issues he described, and what they would do differently to achieve more.

Tille responded that he disagreed with the general statement that the ftpmaster team was not fit for purpose, and did not find it to be a helpful characterization. He did agree with Whitton's suggestions for improving the team, but did not agree with the way Whitton proposed to achieve them. Tille pointed out that there was "known friction" within the ftpmaster team between Whitton and others around the tag2upload project. LWN covered some of that saga in July 2024. Tille included a timeline of his activities related to the ftpmaster team from May 2024 through March 2025, and said that if Whitton felt another candidate would be more effective, "you absolutely shouldn't vote for me":

I have great respect for the other candidates and will gladly support them - especially with anything I've learned around the ftpmaster situation - if they are elected. But if you trust that I've learned from the overly high expectations I had going into my last DPL term, and believe I can now put that experience to better use, then I'd appreciate your support.

Tille also said that during his term he "never got the impression that there was any consensus within the FTP team on the need for improvements". Whitton agreed that there isn't consensus, and that was his point. The team should not be allowed to "decide that things should be a certain way" if the rest of the project disagreed. If making suggestions "which will mostly be ignored" does not work, then it is up to the DPL to fix problems with delegations.

Former DPL Ian Jackson said that it seemed Tille was laying the blame on Whitton. "As Project Leader you had all the management tools, and the whole resources of the Project, available to you." Jackson said that the ftpmaster team had become a "powerful institution" within Debian that needed to be kept in check, and it is up to the DPL to hold delegates responsible to the project.

To renew this institution, we need to get rid of the toxicity first. That means getting rid of the toxic people.

Yes, that is disruptive and risky. But the alternative is to allow the current situation to persist, as you have allowed it to persist.

Tille did not respond to those points. Costamagna replied that he thought the ftpmaster team should be split in two: one team should handle the review of packages in the new queue, and the other should handle the infrastructure for the Debian archive. That would allow people to focus on one or the other, and not require people interested in license reviews to be able to code or require people interested in improving the architecture to deal with copyright reviews.

Like Costamagna, Klode thought that the ftpmaster team had too many responsibilities tied together, and that members needed to satisfy all of them. He floated the idea of splitting the team into three new teams: an archive-license-auditing team for copyright review of packages, an archive-infrastructure team that would develop infrastructure such as dak, and an archive-management team that would handle adding and removing packages to the archive.

Debian finances

A member of Debian's treasurer team, Hector Oron, quizzed the candidates on their knowledge of Debian's finances and plans for them. He wanted to know if they knew how Debian funds were spent last year and how much yearly income Debian has. He also asked what areas the candidates would prioritize for spending, and what ideas the prospective DPLs had for fundraising, "if you think this is needed at all".

Costamagna did not weigh in on the budget threads. Klode responded that "the current DPL has an advantage in responding to this" but that he believed the majority of expenses are for conferences and travel. He mentioned ideas for cost savings, such as emphasizing smaller regional events like MiniDebConfs to save on travel, or establishing relationships with hardware vendors to have hardware donated rather than spending Debian funds on it.

He also floated the idea of "a future where we actually have a Debian Foundation that perhaps actually employs the DPL", and pays for other work in other areas where there are not enough volunteers or "where it's critical for the job to be done right because there's legal consequences otherwise". As far as donations and fundraising, it seemed to him that donations were "flowing in steadily over the years" and that Debian had a "healthy buffer" of about $500,000 with Software in the Public Interest (SPI). Klode did raise a concern about dependence on big donors "especially in the volatile political and economical situation we've been thrown into in the last couple of weeks", and had some ideas about making it more obvious how people could donate money and adding more trusted organizations to take in money for Debian around the world.

It may perhaps all be easier to start setting up a Debian Foundation and regional outlets of it; not that I particularly have a strong knowledge in the complexity of international tax law involved.

Actually, the amount with SPI currently is lower than Klode believed, Oron said. SPI only has about $300,000 right now. He added that the project would need to be managing at least $1 million per year for the idea of a foundation to make sense.

Tille said that he did have some advantage answering the question, but rather than simply answering, he wanted to focus on whether Debian needed to improve its transparency and readability of financial information for Debian members and possibly the general public as well. He noted that there are monthly reports from SPI, "though they are admittedly difficult to interpret". He added that understanding Debian's finances was essential for the DPL, since the project leader makes decisions about funding requests, but the actual financial management is handled by the treasurers.

As for prioritizing, Tille said it did not make sense to imagine hypothetical conflicts between different spending categories; those should be made on a case-by-case basis. "In short: For complex financial decisions, I will consult the treasurers first and also speak with those directly affected by the decision." He did acknowledge that it was no longer as simple as approving every funding request that is received.

In his Bits from the DPL talk, Neil [McGovern] once mentioned at DebConf15 that he approved every single funding request he received, and Debian's financial reserves still grew during his term. Unfortunately, these simpler times seem to be over, and the need for careful financial planning has increased. I'd love to be in Neil's shoes, and I hope that future DPLs will see those times return.

Improving fundraising, he said, "is a necessity if we want to continue running the project as we have in the past". That reply got the attention of former DPL Lucas Nussbaum, who asked if the need for careful spending constituted a problem, and what solutions the candidates envisioned. "Also, why do you think we aren't anymore in the comfortable situation we were in ten years ago?"

Pierre-Elliott Bécue, a trusted organization administrator, responded that he had part of the answer, "but it's not my place to disclose the full situation publicly". Essentially, Debian's funds have roughly halved since 2019/2020, and while the project is not broke, its funds could be depleted quickly "if we are not vigilant". Part of the problem is that Debian did not receive enough sponsorship for DebConf 2023 and DebConf 2024, which made a significant dent in funds. The project also needed to buy hardware, he said, and the time when companies gave away hardware to Debian "seems far away".

Klode replied that it would be optimal if Debian had a buffer of two years' worth of expenses, and for its expenditures to match regular donations. He did not, however, offer any suggestions on cost-cutting or how to achieve the regular donations. Tille did offer a few quick ideas, such as working with Debian's partners team to explore sponsorship opportunities. "I admit that fundraising isn't my strongest skill - but I'm very open to suggestions". He also said that finding sensible places to cut the budget is "something I personally find difficult".

In the hands of the voters

This is merely a summary of some of the important and interesting discussions that surfaced during the campaign period. Debian members have ample information at their fingertips to sift through in deciding who they would like to serve as DPL during the next term. It is clear that there are some thorny problems in need of solving, no matter who wins. The results of the election should be available on April 20.

Comments (11 posted)

The rest of the 6.15 merge window

By Jonathan Corbet
April 7, 2025

Linus Torvalds released 6.15-rc1 and closed the 6.15 merge window on April 6. By that time, 12,633 non-merge changesets had found their way into his repository; that is substantially more than were merged during the entire 6.14 development cycle. Just under 6,000 of those changesets were merged after the first-half merge-window summary was written.

Some of the most interesting changes from the second half of the 6.15 merge window are:

Architecture-specific

The RISC-V architecture has gained support for the BFloat16 floating-point extension, the Zaamo and Zalrsc extensions, and the ZBKB extension.

Core kernel

The function and function-graph tracers have gained the ability to record the arguments to called functions; those arguments can then be examined in the trace output.
The io_uring subsystem now supports zero-copy reception of network data, with eventual plans for allowing reception directly into device memory. See this commit for documentation.
It is also now possible to read epoll events via an io_uring operation. From the merge message:

While this may seem counter-intuitive (and/or productive), the reasoning here is that quite a few existing epoll event loops can easily do a partial conversion to a completion based model, but are still stuck with one (or few) event types that remain readiness based.
For that case, they then need to add the io_uring fd to the epoll context, and continue to rely on epoll_wait(2) for waiting on events. This misses out on the finer grained waiting that io_uring can do, to reduce context switches and wait for multiple events in one batch reliably.
With adding support for reaping epoll events via io_uring, the whole legacy readiness based event types can still be reaped via epoll, with the overall waiting in the loop be driven by io_uring.
The BPF subsystem has gained improved verification of programs with loops, some new load-acquire and store-release instructions, the ability to change extended attributes on files within BPF programs, and a timed may_goto instruction.
Also new in BPF is a "resilient queued spinlock" locking primitive. Its main purpose is to detect deadlock conditions at run time, enabling the loading of programs for which the verifier is unable to prove locking correctness. See this merge message for an overview of these locks, and this white paper for details.
The try_alloc_pages() allocation function, which allows for memory allocation in any context (with a relatively high chance of failure) has been merged. It is intended to support BPF programs that may be running in highly restricted contexts. (See also: this article from LSFMM+BPF 2025).
It is now possible to check for the existence of guard pages in /proc; this addresses a regression (of sorts) experienced in some systems. It is also now possible to place guard pages in file-backed memory. See this article for information about both changes.
The tracking of mapping counts for large folios has been significantly reworked. The end result should be better tracking overall (and hopefully fewer security problems), but possibly slightly fuzzier working-set statistics.
The page allocator has seen some significant changes that are intended to greatly increase the reliability of huge-page allocations.

Filesystems and block I/O

The filesystems in user space (FUSE) subsystem can now enforce timeouts on requests, enabling recovery when the user-space server becomes unresponsive. This commit describes the sysctl knobs that control this feature. FUSE can also now handle file names longer than 1,024 characters.

Hardware support

Clock: Rockchip RK3528 and RK3562 clock controllers, Allwinner A523/T527 clock-control units, and Qualcomm IPQ9574 NSS clock controllers.
GPIO and pin control: Sophgo SG2042 and SG2044 pin controllers, Amlogic SoC pin controllers, AMD isp41-based pin controllers, and Allwinner A523 pin controllers.
Graphics: Raydium RM67200-based DSI panels, Visionox RM692E5 display panels, Apple Summit display panels, Apple touch bars, and pre-DCP Apple display controllers. The "nova-core" stub driver has also been merged; this is the framework for what will eventually be the nova driver for NVIDIA GPUs.
Industrial I/O: Silicon Labs Si7210 Hall-effect sensors, Broadcom APDS9160 ambient light and proximity sensors, Analog Device AD4851 data-acquisition systems, Analog Devices AD4030 and AD7191 analog-to-digital converters, Texas Instruments ADS7128 and ADS7138 analog-to-digital converters, Analog Devices ADIS16550 inertial sensors, and Dyna Image AL3000a ambient light sensors.
Input: Apple Z2 touchscreens.
Miscellaneous: AMD MDB Versal2 PCIe controllers, Broadcom BCM2712 MSI-X interrupt peripherals, Inside Secure SafeXcel EIP-93 crypto engines, Rockchip RK3588 random-number generators, Maxim MAX77705 battery chargers, Maxim MAX77705 power-management ICs, Maxim MAX77705 LED controllers, Samsung S2MPU05 regulators, Apple DWI 2-Wire interface backlight controllers, Intel timed IO PPS generators, CoreSight TMC control units, Qualcomm PCIe UNIPHY 28LP PHYs, Rockchip Samsung MIPI DCPHYs, SpacemiT K1 I2C adapters, Lenovo SE30 watchdog timers, and National Instruments 16500 UARTs.
USB: Parade PS883x Type-C retimers.

Miscellaneous

The "fwctl" subsystem, designed to pass command data directly through to complex firmware systems, has been merged. This subsystem has been controversial in the past, but a 2024 Maintainers Summit session came to the conclusion that it should be merged. There are three drivers included with this merge, one for CXL devices, one for mlx5 adapters, and one for AMD/Pensando distributed services cards. There is some documentation included with this new subsystem.
The perf subsystem has gained the ability to perform latency profiling using scheduler information. See this commit for details, and for information on other perf enhancements added this time around.

Security-related

The Landlock security module has a new auditing mechanism that is intended to make it easier to understand access denials. This commit contains documentation.
The kernel may now optionally seal a number of memory mappings against changes as a hardening measure. This feature was somewhat controversial because it is likely to break some applications, so it is disabled by default. Whether any distributors will dare to enable it remains to be seen. See this documentation commit for some more information.

Internal kernel changes

The runtime verification subsystem has a new feature called "scheduler specification monitors" that allows multiple monitors to run concurrently and interact with each other. This commit contains documentation for this new feature.
The new traceoff_after_boot command-line parameter will cause tracing to be disabled once the kernel has booted and started the init process. It is intended to help with the tracing of boot-related problems, ensuring that the trace data is not overwritten by the time a human is able to look at it.
There is now support for running unit tests within Rust code with the new #[kunit_tests()] macro.
New Rust abstractions cover high-resolution timers and coherent DMA mapping, which was the focus of some disagreement earlier this year.
There were 451 exported symbols removed and 447 added, for a net reduction of four. There are also twelve new kfuncs in 6.15. See this page for the full list of additions and removals.

This kernel now goes into the stabilization period, with the final 6.15 release happening on May 25 or June 1.

Comments (none posted)

The future of ZONE_DEVICE

By Jonathan Corbet
April 4, 2025

LSFMM+BPF

Alistair Popple started his session at the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit by proclaiming that ZONE_DEVICE is "the ugly stepchild" of the kernel's memory-management subsystem. Ugly or not, the ability to manage memory that is attached to a peripheral device rather than a CPU is increasingly important on current hardware. Popple hoped to cover some of the challenges with ZONE_DEVICE and find ways to make the stepchild a bit more attractive, if not bring it into the family entirely.

There are five different types of memory managed under ZONE_DEVICE; for the curious, they are:

MEMORY_DEVICE_PRIVATE: device-hosted memory that is not directly accessible by the CPU.
MEMORY_DEVICE_COHERENT: memory that is directly accessible and maintains cache coherency on both the CPU and device sides. CXL memory is one example of this type.
MEMORY_DEVICE_FS_DAX: memory set aside for use with the DAX (direct file access) subsystem.
MEMORY_DEVICE_GENERIC: relatively normal-looking memory, hosted on a device, that is often used for DAX as well.
MEMORY_DEVICE_PCI_P2PDMA: memory accessible on the bus used for direct memory transfers between devices.

About the only thing these types have in common is that this memory is not allocated directly by the memory-management subsystem. Popple is most interested in the device-private type, associated with devices like GPUs. This memory, being controlled by the relevant driver, has a lifetime that is tied to that of the driver. It may be possible to map it into user space on demand, but, since it is not normal memory, it cannot be tracked on the kernel's least-recently-used lists. As a result, drivers have used the lru field of the page structure, which is destined for eventual removal, for other purposes.

Some of Popple's recent work in this area has been to improve the reference counting of ZONE_DEVICE pages; most types are now properly tracked this way. One problem, which took 32 patches to fix, was that FS-DAX pages had reference counts that were off by one, meaning that a reference count of one actually meant that a page was free. That has (as of 6.15) been fixed, freeing a page bit in the process.

Having solved that problem, Popple is wondering what to be working on next. He would like to support huge device-private pages; only the DAX types of ZONE_DEVICE pages support anything larger than base pages now. His cleanup work should enable the use of huge pages more widely. There was some discussion on just how folios might be created for larger ZONE_DEVICE pages. Meanwhile, Balbir Singh has posted a patch series adding transparent huge page support for device-private pages.

Currently, Popple said, drivers are using the migrate_vma interface to move data to and from device memory. It is a simple interface, working with arrays of page-frame numbers, but it gets increasingly unwieldy as page sizes get larger. He would like to enable the splitting and merging of huge pages on migration; huge pages are often easily available on the device side.

A bigger problem, he said, is enabling file-backed, device-private pages; currently, those pages only work as private, anonymous memory. It would be nice, he said, to be able to just pass a virtual address to a device and let it access whatever contents are there. This "sort-of works" with file-backed data on the host, but those pages cannot be migrated to the device. So memory has to be accessed remotely over the bus, which is slower. Things would work a lot better if devices could work with file-backed pages locally.

Matthew Wilcox asked what the use case was for file-backed device-private pages. The answer was, inevitably, working with AI training data.

Popple continued that he is looking at letting device-private pages exist in the page cache. There are only a few lookup functions for the page cache, so it should be relatively easy to create special cases for device-private pages. The read side is especially easy, the data can just be reread from storage. For anything requiring writing to shared data, the kernel would handle a page fault, then call into the driver to put the relevant pages back into host memory.

David Hildenbrand said that this scheme did not sound entirely crazy, and asked how many hoops Popple was having to jump though to implement this; Popple said there weren't that many. As more information moves into folios, he said, the task will get even easier in the future.

Wilcox was more dubious about the shared case. Shared, writable mappings are "a terrible programming model", he said, and the error-handling is difficult. "What if we just don't do any of this?". To migrate data to a device, let the device just have the copy, and mark it accordingly on the host. If something has to be done on the host side, just invalidate the device's copy, he said, but do not ever let devices write to this data. If writes absolutely must happen, they can be done over the bus; it will be slow, but it really just has to work. He wondered whether anybody writes to training data on the device side.

Josef Bacik concurred, saying that he has recently spent a lot of time working with AI models; his opinion on allowing shared writable mappings was: "this whole thing will suck". Nobody wants to copy data into user space, then to the device; they would much rather copy the data directly to the device and avoid allocating host memory at all. Managing sharing will be difficult, he said; the implementation should be as simple as possible, and stop at providing read-only access.

Hildenbrand suggested using the kernel's MMU notifiers to catch faults on shared memory as a way of catching host-side write attempts. It might be worth the trouble he said; but perhaps it is easier to just say that the device will not see host-side modifications. Popple said that in such cases it is better to invalidate pages on the device for correctness, even if the result is slow.

Bacik closed the session with what seemed like a consensus view. Kernel developers want to anticipate all possible use cases, he said, but sometimes it is better to stop with implementing what can be done correctly. Writing to data shared in device-private pages is not a use case to worry about; any use case that might be proposed would have to be "pretty compelling" to be considered. Otherwise, he said, the only reasonable course is to stick with the simpler case that can be implemented correctly.

Comments (10 posted)

Supporting untorn buffered writes

By Jake Edge
April 4, 2025

LSFMM+BPF

At last year's Linux Storage, Filesystem, Memory-Management, and BPF Summit (LSFMM+BPF), there was a discussion about atomic writes that was accompanied by patches to support the feature in the block layer, and for direct I/O on XFS. That work was merged, but another piece of that discussion concerned adding the feature for buffered I/O, in part because the PostgreSQL database currently has to jump through hoops to ensure that its writes are not "torn" (partially written) when there is an error or crash. Luis Chamberlain led a combined storage and filesystem track at this year's summit to revisit the idea of providing atomic (or untorn) writes for buffered I/O.

Chamberlain suggested that there was a belief that it did not make sense to work on buffered atomic I/O simply to work around a missing feature in PostgreSQL; some think that the database should just support direct I/O. It turns out that the default storage engine for MongoDB supports both buffered and direct I/O, he said, but MongoDB recommends using buffered. The reason is that MongoDB compresses data on disk by default and keeps the data uncompressed in its cache. The data can be accessed via mmap(), which is not compatible with direct I/O.

He thinks that the database developers should be able to decide on the architecture that works best for their needs. Providing untorn buffered writes allows the databases to eliminate the double-buffering they are doing now as a workaround. There are configuration options to turn off the double-buffering for MySQL and PostgreSQL, which can be used to test the impacts of the change.

The atomic-write API could eventually be used by databases to provide the torn-write prevention, but a prototype can be run without it to verify that the databases are writing with the correct sizes and alignment needed by that API. In his slides, Chamberlain showed graphs of running MySQL and PostgreSQL with and without the double-buffering options. Both showed higher average transactions per second, with much less variability, without double-buffering. To reproduce these results, Chamberlain recommended using blkalgn, which he called "the bees knees for I/O atomic-alignment verification and visualization". The tests are integrated into his kdevops kernel development and testing tool.

Chamberlain wondered what the next steps might be. John Garry said that he thought the testing needed to validate the idea should also be run with more threads because, at least for MySQL, his testing showed some contention when multiple threads were writing. Chamberlain agreed with that, noting that the tests can easily be run in kdevops, so doing so with various numbers of threads needs to happen; the tests run for a long time (12 hours), so he has not yet gotten to further testing. He said that various members of the community will need to do their own homework to decide whether it makes sense to support the feature; if so, then there is the question of what the API should be.

As Chamberlain had mentioned earlier, Ted Ts'o noted that various large cloud vendors (hyperscalers) already have hosted MySQL solutions that are taking advantage of untorn writes. They are doing so "without any upstream patches, just auditing code paths, and it mostly works as long as you are really, really careful". So he agrees that the community needs to do its homework, but vendors have made it clear that they see advantages, at least for MySQL and direct I/O.

His main concern regarding atomic buffered I/O is the semantics of the RWF_ATOMIC flag. The database people only need untorn writes for 8KB, 16KB, and, maybe someday, 32KB sizes, but there is a contingent in some parts of the filesystem-development community that believes a 1MB write with the atomic flag should be fully supported. That would be painful to do for direct I/O, but it is exceedingly difficult for buffered I/O. The kernel needs to track different atomic-write sizes as they make their way through the page cache and onto the storage medium. There may be "fancy ways to do that [with] large folios and making darn sure you don't break apart a large folio when that happens", but he thinks it makes more sense to restrict the size of untorn writes.

One additional concern that Ts'o has is with writeback when a page is locked because, for example, a page fault is in progress. Currently, the writeback thread simply skips any locked pages, which could result in a torn write. He thinks the XFS implementation for atomic writes takes care of that problem, though he has not looked closely, but a more general solution is probably required.

David Howells asked about the interaction between atomic writes and mmap(). The alignment and length of the mapped part of the file need to be the same as that required by the atomic writes, Ts'o said, which is presumably what was done for XFS. Another problem comes when buffered and direct I/O are both being done to the same file, Howells said. Ts'o said that the filesystem community has always recommended against combining buffered and direct I/O, but, since it is known that MySQL and its backup program already do so, the community "made it safe in some circumstances"; it all just works 99.9% of the time, he said.

Chamberlain said that because the RWF_ATOMIC flag is marking the I/O, filesystems can prevent problematic combinations. Amir Goldstein suggested that files be opened for either atomic direct I/O or atomic buffered I/O, which is similar to the restriction added for FUSE passthrough mode; there is a flag on the inode of the file while it is open that indicates its mode. Jeff Layton pointed out that RWF_ATOMIC is a flag on the I/O operation, not the open, but it could be used to simply return EINVAL for operations that would violate the combination rules.

Ts'o thought it made sense to switch to an open flag, and suggested that O_UNTORN would be the right name; the granularity for the untorn writes could be placed in the inode. One of the problems he sees is that the developers have been using the term "atomic" because that is term used by SCSI and NVMe, but then people wanted to make 1MB atomic writes work, which is not at all what the database developers care about. Switching to untorn and being clear about the granularity supported will help simplify the API and lead to the feature landing much sooner, he thinks.

Chamberlain asked Jens Axboe about the RWF_UNCACHED flag for uncached, buffered I/O and whether it would be suitable to use for untorn writes. Axboe did not really have an opinion, as it is largely a filesystem, rather than a block layer, concern, but could see that some of the effects of the flag might be useful—immediately kicking off writeback, for example. Ts'o cautioned that the PostgreSQL developers actually want the page cache to manage the caching for the database, as he understands it. One of the reasons that they have not switched to using direct I/O is that they would need to do their own user-space cache management; he suggested talking with the PostgreSQL developers to assess their needs.

Chris Mason said that the need for cached, untorn writes does not mean that uncached, untorn writes should not be supported as well. Chamberlain and others agreed with that. Christian Brauner noted that adding an open flag and marking the inode would mean that other users of the file might be precluded; that implies that privileges of some sort should be required. The "deadly combination" is a file that is open for untorn and then gets opened for direct I/O, Ts'o said; that is a rare enough combination that the second open could just fail.

The session had run out of time at that point, but Chamberlain said that it would seem that developers are interested in supporting untorn buffered writes, but only with some restrictive rules that had not been determined yet. Goldstein suggested starting slowly with some kind of opt-in, perhaps via a mount option or filesystem-creation flag, then possibly growing the feature from there.

Comments (3 posted)

An update on torn-write protection

By Jake Edge
April 9, 2025

LSFMM+BPF

In a combined storage and filesystem track session at the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit, John Garry continued the theme of "untorn" (or atomic) writes that started in the previous session. It was also an update on where things have gone for untorn writes since his session at last year's summit. Beyond that, he looked at some of the plans and challenges for the feature in the future.

Garry called the feature, which he has been working on for a year or two at this point, "torn-write protection". The idea is to prevent the system from "tearing" a write operation by only writing part of it. It is important for database systems, he said, so that they do not need to double-buffer their data to protect against partial writes when there is a power failure or system crash.

The RWF_ATOMIC flag has been added to the pwritev2() system call to specify that a given write should either be committed in full to the storage device—or that none of it should be. But, in order to guarantee persistence, RWF_SYNC (or similar operations) are still required. The supported minimum and maximum sizes for atomic writes can be queried using the STATX_WRITE_ATOMIC flag for the statx() system call. Those values will be powers of two and any RWF_ATOMIC writes will need to also have a power of two length value between the minimum and maximum; the buffer will need to be aligned, as well.

Chuck Lever could "hear the eyeballs rolling in the back of the room" but wondered about databases that use network filesystems for their storage; he does not think that untorn writes are supported there. The Linux NFS client supports direct I/O, at the request of database developers, he said, but it would require support in the NFS protocol to handle untorn writes. Garry said that the iomap layer sets a flag on untorn writes that the block layer can use, but Lever pointed out that the block layer of interest would be on the server, so the two of them agreed that some work will need to be done to support that.

Hardware based

Garry said that SCSI and NVMe have non-complementary feature sets. For NVMe, all writes are implicitly atomic as there is no dedicated command to request an atomic write, unlike SCSI, which has the WRITE_ATOMIC_16 command for that purpose. But NVMe writes are only actually atomic if they follow the rules on write lengths and do not cross an atomic-write boundary if one exists. That's not great, he said, because there is no indication if those rules are broken; the write could end up torn. The Linux NVMe driver detects that condition and returns an error, however. SCSI will reject the atomic command if its conditions are not met.

The virtual filesystem (VFS) and block-layer support for untorn writes on SCSI and NVMe were merged into the 6.12 kernel. There is also support for XFS, but it is limited to writes with a size of a single filesystem block. At roughly the same time, though, support for filesystem blocks larger than the page size was merged, so XFS filesystems with an 8KB or 16KB block size can do untorn writes for those sizes using direct I/O.

Due to the way the iomap layer works, an atomic write cannot currently be done for a mixed range, such as a range containing both data and holes; that could be solved, but it is would be painful to do, he said. A bigger problem for XFS is that there is no guarantee that the disk blocks in an atomic write are contiguous or that they are properly aligned. The filesystem blocks could be "backed by disk blocks that are sparsely spread out through the filesystem".

He and others had been pushing the addition of an XFS forcealign attribute as a solution to those problems. It would guarantee that filesystem blocks were allocated and aligned correctly. The XFS realtime device has "large allocation units" that can be used to provide the needed guarantees, so forcealign extended that feature to the regular filesystem, but other XFS developers did not seem to think that was the right thing to do. The forcealign feature is, thus, not being pursued currently.

So he turned to the large-block-size work. Switching the filesystem block size to 16KB would inherently provide the needed alignment guarantees and XFS already supported writing a single filesystem block atomically. But, when testing MySQL performance using a 16KB filesystem block (which is the same as the database block size) and atomic writes, he and his colleagues found "significant performance impact" in some tests, particularly those with a lower number of threads. Using double-buffered writes for the database performed better than atomic writes.

The problem was diagnosed to be from writes of the "redo log", which is a buffered 512-byte write followed by an fsync(). With a 4KB filesystem block, that is inefficient because it is updating much less than the block, but raising the block size to 16KB only makes that worse. There have been efforts to improve the performance of the redo log, but this kind of pattern is seen in lots of different kinds of applications; it is not just a MySQL problem.

Filesystem based

So far, only the hardware-based solutions for atomic writes have been pursued, Garry said. A while back, Christoph Hellwig worked on atomic writes for filesystems, which was (originally) based on opening files using an O_ATOMIC flag. Filesystem-based atomic writes would not have the alignment and single-extent requirements that come with the hardware-based variant. In addition, since hardware with atomic-write support is uncommon, a filesystem-based variant would bring the feature to many more users.

So he is currently working on XFS atomic writes. A write of that kind would allocate staging blocks where the data gets written, non-atomically, and then the block mappings will be updated atomically in a single transaction. The XFS copy-on-write (CoW) fork can be used for the staging blocks, but it will mean that a write requires an allocation, block remapping, and a free operation, so it will be slower. Unlike the hardware-based solution, though, it would not require a reformatting of the filesystem to increase the block size as long as the existing filesystem supports XFS reflink.

When the CoW blocks are allocated, aligned blocks are requested so that hardware-based atomic writes can be used if that is available. It would be a hybrid approach, that would first try to do the atomic write via the hardware. If the alignment or write-size are not suitable, it would fall back to the filesystem-based atomic writes.

Amir Goldstein asked if user space can test for which mode it will get or request that only a particular mode is used. Is there a way for the application to know which will be used, he wondered. The idea is that once the database, for example, has been running for a while, everything will be aligned and all of the atomic writes will be done with the hardware, Garry said.

Mike Snitzer noted that it made sense to do this work for XFS, since it is widely used, but he is concerned that the work is XFS-specific. He wondered if the feature would be useful for any filesystem, including NFS and other network filesystems, returning to Lever's question. Is there a plan for a more generic mechanism to join multiple blocks into an atomic write? Garry said that there are features like "bigalloc" for ext4 that can be used for that filesystem; that work is currently ongoing. But Snitzer said that was just another filesystem-specific scheme and not something generic at the virtual filesystem (VFS) layer that any Linux filesystem might be able to use.

SCSI and NVMe support for atomic writes requires that the blocks involved in the write be contiguous on the storage device, Ted Ts'o said, which is ultimately a filesystem-specific attribute. He has chosen not to use the forcealign approach for ext4 because it would require that the database files be restored onto a new filesystem, with different attributes and less fragmentation, which is not popular for production databases.

He has funding from his company (Google) to support untorn 16KB writes on ext4 for databases, but nothing further than that. He hopes to have that support get merged as part of the Linux 6.16 release. It uses the bigalloc feature with a 16KB cluster size in order to get the required alignment for the underlying hardware. There are vendors "shipping product today, they're just simply relying on the fact that 'we desk-checked the block layer'" and that the vendors' testing has shown that in practice writes will not be torn. "Yes, this is terrifying, this is why I want all of this stuff to land for real."

Chris Mason returned to Goldstein's point, noting that silently going from the fast, hardware-based atomic writes to slow, filesystem-based writes is not what his customers want. They want to use the hardware-based mechanism and to get an error if that cannot be done. Garry said that in the testing that has been done, the software-based writes do not "typically" occur.

But Mason said that means that in production once in a while, writes will start being slow; he would rather get an error. Garry said that is not a good user experience, however. Josef Bacik agreed with Mason, saying that something that randomly slows down once in a while will cause him to "lose my mind and we just won't use it"; he noted that "typically" means that "across 100,000 nodes it happens once a day on a random machine".

Garry asked what the user is supposed to do if they get an error instead. Bacik and Mason said that it will then be clear that something in the environment was misconfigured or otherwise broken so that it can be addressed. Hellwig said that the developers should fix the instrumentation of the system "instead of creating stupid interfaces". The information on whether the fallback has been used is easily available from a tracepoint, but Mason pointed out that his users cannot run all of the systems with tracepoints enabled.

Ts'o said that even with bigalloc, there are ways that users could mess up the atomic-write requirements; for example, by punching holes in every other block in the cluster. Ext4 will do the right thing, he said, by writing zeroes into the holes and fixing up the extent trees so that atomic writes can be submitted to the block layer, but "will log a message saying 'user space did a stupid'". That is so he can close bugs with performance complaints when that happens; he could do it with tracepoints, but those are generally not enabled in production.

Garry closed by mentioning a statx() change he would like to make to report the maximum length for the hardware atomic write. That would allow applications to try to ensure their atomic writes go as fast as they can without falling back to the slower filesystem-based version. His last item was the upstream status of the work. The iomap changes for the filesystem-based atomic writes were submitted for 6.15 and the XFS support for the feature is targeted for 6.16.

Comments (4 posted)

Better hugetlb page-table walking

By Jonathan Corbet
April 3, 2025

LSFMM+BPF

The kernel must often step through the page tables of one or more processes to carry out various operations. This "page-table walking" tends to be performed by ad-hoc (duplicated) code all over the kernel. Oscar Salvador used a memory-management-track session at the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit to talk about strategies to unify the kernel's page-table walking code just a little bit by making hugetlb pages look more like ordinary pages.

"Hugetlb" refers to an early huge-page implementation in the kernel that has often been thought of as an independent memory-management subsystem. It works with memory that has been reserved specially; the hugetlbfs filesystem must be used to gain access to that memory. Many applications are better served by transparent huge pages, which require no special code, but hugetlb users remain. It gives more reliable access to huge pages for some applications, and can reduce memory usage by sharing page tables across multiple processes.

The existence of hugetlb as an independent memory-management mechanism has long grated on the development community. The 2024 Summit featured a session focused on hugetlb unification, and some progress has been made in that direction. The 2025 session limited its scope to page-table walking in particular, in the hope of getting rid of some duplicated code and special cases. Salvador posted an RFC patch set unifying the hugetlb page-walking code in July 2024, but the reviews were mixed, and that work has not proceeded further.

Since then, David Hildenbrand has proposed, in general terms, a new page-walking API that could be considered instead. (That initial proposal happened in this email, but most of the discussion about the implementation has evidently happened privately. Salvador has an implementation in his repository.) The core idea is an API that walks through a virtual-memory area (VMA) and manages locking and batching of operations, telling the caller what type of pages were found. This new API would make implementation of /proc/PID/smaps much simpler, he said. If the group agreed, he said he would like to start converting some of the /proc code over, then move on to some of the other page-table walkers in the kernel.

Lorenzo Stoakes asked whether Salvador intended to replace all of the page-table walkers in the kernel with calls to the new API. Salvador said he did not intend to do that right away; there are a lot of special cases in many of those walkers, so the conversion is not always straightforward. Hildenbrand said that, for now, it is best to focus on the lower levels of the page tables.

Ryan Roberts expressed concerns that the performance of many system calls is sensitive to small changes. Adding a page-table walker with indirect calls could introduce an unacceptable slowdown, he said. But, as it turns out, the proposed API is implemented as an iterator with no indirect calls, so that should not be a problem. At the end of the session, Matthew Wilcox asked how this API will handle a copy-on-write operation in the middle of a large allocation. For now, apparently, it does not handle that case at all; in the future, it will be able to return ranges of compatible page-table entries.

Comments (none posted)

Page allocation for address-space isolation

By Jonathan Corbet
April 3, 2025

LSFMM+BPF

Address-space isolation may well be, as Brendan Jackman said at the beginning of his memory-management-track session at the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit, "some security bullshit". But it also holds the potential to protect the kernel from a wide range of vulnerabilities, both known and unknown, while reducing the impact of existing mitigations. Implementing address-space isolation with reasonable performance, though, is going to require some significant changes. Jackman was there to get feedback from the memory-management community on how those changes should be implemented.

The core idea behind address-space isolation (last covered here in March), he began, is to run as much kernel code as possible in an address space where sensitive data is unmapped, and thus invisible to speculative-execution vulnerabilities. It is like the kernel page-table isolation that was introduced in response to the Meltdown hardware vulnerability, but with a higher degree of protection. Kernel page-table isolation created a new address space with most of the kernel removed; the new work adds a restricted address-space that has holes in it where only the sensitive data has been removed.

The address-space isolation patches are deployed on a significant subset of Google's fleet, he said. Their current (public) form can be seen in this patch set posted in January. This version adds protection from bare-metal attackers, while previous versions had only protected the kernel from virtual machines. There are still two blockers that need to be addressed though. One is a better design for page allocation — the intended topic for this session — while the other is a 70% degradation in file-I/O performance.

In order for the kernel to unmap memory containing sensitive data, it needs to know where that memory is, so kernel code must indicate sensitivity at allocation time by way of a new __GFP_SENSITIVE flag. There are some performance considerations here; for example, mapping pages may require first zeroing them, since they may have previously contained sensitive data. He would also like to avoid fragmenting the kernel's direct map if possible. Mike Rapoport, who has analyzed the cost of direct-map fragmentation, said that it is best avoided if possible, but is not that critical.

Avoiding fragmentation, Jackman continued, requires grouping nonsensitive pages together in physical memory. He also preallocates page tables for restricted data down to the PMD (2MB) level, and maps or unmaps entire 2MB page blocks at one time. That helps to minimize fragmentation and translation lookaside buffer (TLB) invalidations.

The patch set adds two new migration types to distinguish sensitive and non-sensitive data, and a new constraint that disallows the allocation of pages across the two sensitivity types.

There are some challenges that come with unmapping page blocks while allocating memory. The unmapping requires a TLB invalidation, but that cannot be done while the zone lock (needed to allocate the page block) is held. The invalidation must be done, though, before other CPUs are allowed to see the block as being sensitive. So the current code allocates the entire page block, even if only one page is needed, releases the zone lock, performs the invalidation, then reacquires the zone lock. After that, the memory can be marked sensitive and any unneeded pages can be freed.

That technique works, but leads to a possible worrisome scenario. If all CPUs on the system decide to allocate a sensitive page at the same time, they will all end up doing the above dance, and they will all hammer each other with TLB invalidations. Jackman said that he is not sure that this case is worth optimizing for, but Matthew Wilcox said that database workloads could possibly act in just that way.

Mapping a block while allocating is easier, Jackman said; it is just a matter of populating the page tables and changing the migration type of the affected memory. It is essentially a normal case of migration-type fallback. But pages that might have held sensitive data have to be zeroed to prevent the possible exposure of that data; he wondered if the allocator should just zero pages unconditionally. The cost of doing so, he said, is not that bad. Jason Gunthorpe said, though, that zeroing can indeed be painful on systems with slower memory, and Suren Baghdasaryan said that there had once been a maple-tree performance regression caused by page zeroing.

If the zeroing overhead is too much, Jackman continued, then the allocator will have to repeat the unmapping dance described above, or handle zeroing one page at a time, using a page flag. Somebody asked what the performance of the allocator was at Google; Jackman said that it worked well, but the version of address-space isolation running there does things differently than the version that has been posted for upstream consideration.

Wilcox asked if the kernel's Spectre mitigations can be safely disabled once address-space isolation is in use. For now, Jackman said, the isolation only protects user-space pages; the task of marking kernel allocations for sensitivity is far from complete. Once that has been done, it should be possible to turn off the mitigations, and to never need another one again. The mitigations are off at Google, and the patch yields a performance gain overall.

Since he had some time at the end of the session, Jackman launched into the problem of the file-I/O performance regression caused by address-space isolation. The problem is that all file pages are marked sensitive within the kernel, so every read causes a fault and an address-space transition. It is pointless to protect pages that a process is about to read anyway, but the page cache as a whole cannot be marked non-sensitive since it likely contains data that any given process cannot access. Earlier versions of the patch set had a separate "local nonsensitive" marker for data that processes could leak to themselves but, even with that, the kernel does not know at allocation time where file pages should be mapped.

Thus, he said, the kernel needs a process-local mapping for file pages. One solution would be to map entire files when a process opens them; that is relatively easy, but it is harder to know when to unmap file data. A process can lose access to a file in a number of ways; a security module might change its mind, or fanotify permission events can revoke access, for example. There must also be action taken when file pages are removed from the cache, either through reclaim or because a file is truncated.

The alternative, he said, is ephemeral, per-CPU mappings that are in place only as long as the operation is ongoing. Once the operation completes, the page tables would be torn down right away, but the TLB flush could be deferred to minimize the performance impact.

At that point, the session was truly out of time and the discussion ended with no conclusions on the file-I/O problem.

Jackman has posted the slides from this session.

Comments (8 posted)

The state of guest_memfd

By Jonathan Corbet
April 4, 2025

LSFMM+BPF

A typical cloud-computing host will share some of its memory with each guest that it runs. The host retains its access to that memory, though, meaning that it can readily dig through that memory in search of data that the guest would prefer to keep private. The guest_memfd subsystem removes (most of) the host's access to guest memory, making the guest's data more secure. In the memory-management track of the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit, David Hildenbrand ran a discussion on the state and future of this feature.

Once upon a time, he said, virtual machines (VMs) were easy; they only had one type of memory provided by the host. In confidential-computing circles, this memory is deemed "shared", since it is accessible by both the guest and the host. More recently, the advent of confidential VMs has added the concept of private memory, which is only accessible to the guest. In hardware-backed implementations, an attempt to access guest-private memory by the host might even cause a machine check, stopping the show entirely. Private memory cannot (on the host) be mapped into user space, swapped out, or migrated.

A participant asked whether DMA access by devices attached to the host was supported. In general, Hildenbrand said, that access is not allowed, but there is a "private device" concept that is being developed. Dan Williams said that the entire security model around private devices is that the device sets a "trusted bit" — to general laughter in the room.

Hildenbrand said that private memory differs from the usual variety in numerous ways. It cannot be mapped, read from, or written to by the host, and often can be managed with small folios only. Jason Gunthorpe noted that every architecture implements private memory differently. Moving memory between the private and shared states, Hildenbrand said, can be challenging, and often can double the memory consumption of the guest. Conversion between types is usually done on individual 4KB base pages, splitting up huge pages.

Current upstream work, he said, is aiming to integrate the concepts of both private and shared memory within guest_memfd; that would facilitate conversion between the two types. Fuad Tabba has been doing some work in this area. Getting there requires support for host-side memory mapping in guest_memfd; it would allow the host to easily access shared pages, but the host will still get a bus error if it attempts to access a private page.

Converting pages from private to shared is always possible, Hildenbrand said, but the other way is harder. It is important to avoid having any private pages mapped into user space on the host, so the host must take pains to ensure that there are no unexpected references to any pages that are about to be made private. That is easy with small folios, because there is a single reference count to check, but the situation is more complicated with huge pages.

There is some work underway, he said, by Ackerley Tng and Vishal Annapurve to improve huge-page support. It will allocate memory from the hugetlb subsystem, but then convert it into normal folios that can be mapped or split, if need be, to change smaller pieces between shared and private. Once the memory is freed, the huge folio is reconstructed and returned to hugetlb.

Lorenzo Stoakes asked what the use case was for converting private memory back to shared; the answer is that VMs do need to make memory available to the hypervisor, often for device I/O. The discussion wandered for a while after that, including a suggestion that hugetlb should eventually be removed — a task that is being worked on.

In the absence of hardware support, guest_memfd can still support some device privacy by removing the private memory from the host's direct map, leaving the host with no way to address it. The result is not really confidential, but it provides some protection, Hildenbrand said. There is a problem, though: the memory-management developers do not want to expose the APIs that modify the direct map to loadable modules. But guest_memfd is implemented within KVM, which can be built as a loadable module. There were some suggestions of using the restricted namespaces feature to limit access to this API. Restricted namespaces have not yet found their way into the mainline, though.

As the session ran out of time, Hildenbrand said that there would eventually need to be some sort of callback that could intercept folio-freeing operations. If the folio being freed has been shared out of a guest_memfd, the kernel will have to put it back where it came from, rather than making it generally available. This interception is currently done by testing for a specific folio subtype, but there are locking-related problems with that solution.

Comments (2 posted)

Three ways to rework the swap subsystem

By Jonathan Corbet
April 7, 2025

LSFMM+BPF

The kernel's swap subsystem is complex and highly optimized — though not always optimized for today's workloads. In three adjacent sessions during the memory-management track of the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit, Kairui Song, Nhat Pham, and Usama Arif all talked about some of the problems that they are trying to solve in the Linux swap subsystem. In the first two cases, the solutions take the form of an additional layer of indirection in the kernel's swap map; the third, which enables swap-in of large folios, may or may not be worthwhile in the end.

Simplifying the swap subsystem

There are some good things about the kernel's swapping code, Song began. Since swapping is done to make memory available to the kernel, it must use as little memory as possible itself; the swap code manages to get by with a single byte of overhead per page of swap space. Most swapping operations are fast and lightweight; swap entries point directly to physical locations. The per-CPU swap-cluster design manages to avoid contention in almost all cases.

On the other hand, the swap system is complex. To show just how complex, Song put up this slide:

There are a lot of components with complex interactions, he said. New features are hard to add to this subsystem; the long (and ongoing) effort to add swapping for multi-size transparent huge pages (mTHPs) is one case in point. The whole thing is built around the one-byte-per-page design for the swap map, meaning that there is no space to store anything beyond a reference count and a pair of flags. Various optimizations have been bolted on over the years, increasing the complexity of the system.

As an example, he mentioned the SWP_SYNCHRONOUS_IO optimization, which was ~~added to 6.14 by Baolin Wang~~ added in 4.15 by Minchan Kim. When the kernel is swapping from a fast, memory-based device like zram, any extra latency hurts; in such cases, the kernel can simply bypass most of the swapping machinery and copy the data directly, preserving larger folios as well. Song said that this is a nice optimization, but there are now four different ways to bring in a folio from swap. He has tried to unify them all, but that work failed due to performance regressions.

The distinction between the swap map and the swap cache adds complexity as well. The swap map is the one-byte-per-slot array that tracks the usage of swap slots in a swap device. It is a global resource, requiring locking for access, so it can be a contention point on systems that are doing a lot of swapping. The swap cache, instead, is a ~~per-CPU~~ data structure that holds a set of swap slots allocated in bulk from the swap map; it allows many swap-related actions to be done locklessly, and enables batching of swap-map changes when a lock must be taken. When a swap-map entry is in some CPU's cache, the SWAP_HAS_CACHE bit is set in the swap map to indicate that some CPU owns the entry. But, Song said, that bit has acquired other meanings over time, again making it harder to make changes to the swap machinery.

Any redesign of this system, he said, is destined to use more memory; the one-byte design of the swap map just does not allow for much flexibility. There is, however, memory that could be used for this purpose. The swap cache currently uses eight bytes per slot, and control groups can add another two (duplicating some data in the process), so the actual memory consumption for swapping can be eleven bytes per slot. If that memory were to be repurposed, the swap map could be transformed into a "swap table" with eight bytes per entry, which would be enough for everything that he has in mind. Swap entries would still be managed in clusters, he said, and the total memory use of the swap subsystem should drop as some of the existing complexity is removed.

Song's proposed layout for this swap table (taken from his proposal for the session) looks like this:

    | -----------    0    ------------| - Empty slot
    | SWAP_COUNT |---- Shadow ----|XX1| - Swapped out
    | SWAP_COUNT |------ PFN -----|X10| - Cached
    |----------- Pointer ---------|100| - Reserved / Unused yet

There would be a table for each swap cluster, spreading out the accesses and mostly eliminating locking contention; that, in turn, should allow the elimination of the separate swap cache. The eight-bit SWAP_COUNT is the same reference count that is kept in the current swap map, but it no longer needs to dedicate a couple of bits to flags like SWAP_HAS_CACHE. This design, he says, resolves many of the problems with the current swapping subsystem, and performs better as well, yielding a 10-20% improvement in the all-important kernel-build benchmark. There is only one swap-in path, and it never bypasses the table. Memory usage is lower, he said, and this design allows for the removal of a lot of complexity from the swap subsystem.

A participant asked how the new design could be faster in the absence of the bypassing optimization currently used for zram. The answer was that the unification of the swap map and the swap cache means that there is no need to check or maintain both, making the swap subsystem as a whole faster.

Future steps, Song said, include the addition of a virtual swap device that has no storage associated with it. Instead, it contains only entries pointing to slots in other swap devices. This new layer of indirection is intended to facilitate the intact swapping of larger folios, which currently becomes difficult when the swap devices become fragmented. The virtual device could be much larger than the physical swap space, making it resistant to fragmentation.

Time was running out, so Song concluded with an idea to consider further in the future: swap migration. The list of swap clusters already works as a sort of least-recently-used (LRU) list, he said, so it could be used as a way of detecting folios that have been swapped out for a long time so that they could be moved to cheaper storage. Perhaps compaction could be performed at the same time. There was no time for the discussion of this idea, though.

Virtual swap devices

The concept of a virtual swap device returned in the next session, though, as the topic that Pham wanted to talk about. His original motivation, he said, was to separate the zswap compressed swap cache from the rest of the swap subsystem. There is heavy use of zswap at his employer (Meta), which is good, but the current design of the swap layer requires that there be a slot assigned in an on-disk swap device for every page that is stored in zswap. That disk space will never be used and is thus entirely wasted; he has seen hosts running with an entirely unused, 100GB swap file. In an environment where hosts can have terabytes of installed RAM, it just is not possible to attach (and waste) sufficiently large drives for swap files.

Once a swap area has been added to the system, its size is fixed; the only way to increase swap space is to add another swap area. That is a slow operation, though, that a heavily loaded production system cannot afford, so Meta has to provision a suitably sized swap file ahead of time for each host type. There have been ongoing problems with machines running out of memory just because the unused swap device is "full". Pham appeared to be of the opinion that this was not an optimal way to run things.

The problem, he said, is the tight coupling between swapped pages and the backing store behind them. The page-table entry for a swapped page points to the physical location for its data. It is, he said, a design oriented toward the sort of two-tier swapping system that was common some time ago. Beyond capacity problems, this design leads to other challenges; if, for example, zswap rejects a page, its page-table entry has already been changed and recovery is difficult.

Solving this problem, he said, requires decoupling the various swap backends. A page stored in zswap should not take space in the other backends — unless that has been dictated by policy, as can happen with write-through caching. The system needs to be able to support multi-tier swapping; that would also help with the addition of new features, such as discontiguous swap-space allocation for large folios, or swapping in folios at different sizes than they were at swap-out time.

Thus, he is proposing the implementation of a virtualized swap subsystem, providing swap space that is independent of whichever backend any given page is stored to. Each swapped-out page is assigned a virtual slot; that is what is stored in the page-table entry. Virtual slots can then be resolved to a specific backing store as needed, where that backing store could be zswap, a disk drive, a cache like zram, or something else. Such a system would eliminate the wasted space problem and allow pages to be moved between backends without having to change all of the references to them. That would make it easy for zswap to write pages back to another device, for example; it would also make removing a swap device much easier than it is now.

He has a working prototype now, he said, that adds two new swap maps. There is a forward map that turns a virtual swap slot into a swap descriptor describing the actual placement of a page; it uses the XArray data structure, so lookups are lockless. The reverse map turns a physical swap slot into a virtual slot; that is useful to support cluster readahead or the movement of pages between backends. The metadata for a swapped page is placed in a dynamically allocated swap descriptor that is stored in the forward map.

The prototype is getting close to the point where he can post it, he said. It is a big change, though, and he is worried about how he will be able to land it. Johannes Weiner suggested that it could perhaps operate in parallel with the existing swap subsystem until the performance is shown to be at least as good.

At the end, a participant asked whether this system would be able to swap in a single page from a larger folio that has been swapped out; Pham said that he has considered that use in the design. Matthew Wilcox asked whether the virtual swap space would be used sparsely or densely; Pham, like Song, said that a large, sparse virtual space would be better for fragmentation avoidance.

Pham has posted the slides from this session.

Large-folio swap-in

The final episode of the swapping trilogy began with Arif reminding the crowd of the advantages of using large folios. They allow for better translation lookaside buffer (TLB) usage, reduce the number of page faults the system must handle, shorten LRU lists, and allow page-table entries to be manipulated in batches. Large folios often do not survive their encounter with the swap subsystem, though; they end up being split into base pages. Arif was there to talk about how the swap subsystem might be improved to better handle larger folios.

He mentioned work done by Ryan Roberts around a year ago to enable swapping out mTHPs without splitting them. That helped to avoid the cost of splitting these folios and avoid the fragmentation of memory. Work has been done to store large folios to zswap, and to be able to bring in large folios from zram. Compression of large folios in zram (which yields better compression) is being worked on, but has not been merged yet. One problem with compressing large folios, though, is that swapping in a single base page from that folio becomes difficult — the entire folio must be decompressed to make that base page accessible.

Arif's large-folio swap-in work builds on these previous efforts. Specifically, at swap-in time, it checks to see whether the swap count is one (meaning there is a single reference to the page) and whether the page is in zswap. If so, the swap cache is skipped entirely, and zswap will be checked to see if it holds a larger folio containing the page in question. If so, the folio will be swapped in one page at a time and assembled into a proper folio at the end.

This patch speeds 1MB folio swap-in by 36%, but also slows kernel builds. It resulted in an overall increase in zswap activity, with a lot of thrashing and folios being swapped in and out repeatedly. Thus, he concluded, perhaps large folios are not good for all workloads; would the group be happy with a change that yielded such different results for different workloads? Some workloads benefit; Android, for example, swaps out background tasks entirely and does not see this performance regression. Perhaps a control knob could be provided to tell the system whether to swap in large folios from zswap, but most users never change these knobs and would not see the benefit. Perhaps this behavior could be switched off automatically if the refault rate is seen as being too high.

Another alternative would be to combine large-folio swapping with large-folio compression; that might offset the regression with kernel builds, he said. But the inability to swap in base pages out of large folios could get in the way here.

As the session ran down, he wondered if there was a need for large-folio swap-in at all. Perhaps the system should continue to swap in base pages and let the khugepaged thread reassemble larger folios afterward. Wilcox said that there is a need to gather more statistics to understand what is really going on here. At this point, the topic was swapped out for something entirely different.

Comments (51 posted)

Per-CPU memory for user space

By Jonathan Corbet
April 8, 2025

LSFMM+BPF

The kernel makes extensive use of per-CPU data as a way to avoid contention between processors and improve scalability. Using the same technique in user space is harder, though, since there is little control over which CPU a process may be running on at any given time. That hasn't stopped Mathieu Desnoyers from trying, though; in the memory-management track of the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit, he presented a proposal for how user-space per-CPU memory could work.

Desnoyers started by saying that his objective is to help user-space developers make better use of restartable sequences, which facilitate some types of access to per-CPU data by interrupting a process if it is migrated during a critical section. User-space applications generally use thread-local storage for this kind of code, but that becomes inefficient if there are more threads running than CPUs to run on. Thread-local storage must also be defined statically, making it inflexible to work with, and it can slow down thread creation if the area is large.

So he would like to provide true per-CPU data as an alternative. One way not to do that, he said, is to structure per-CPU data in an array indexed by the CPU number. The implementation would be relatively simple; code could just obtain its current CPU from sched_getcpu() or from the restartable-sequences shared-memory area. But if the array is packed, the result will be cache-line bouncing between CPUs, eliminating much (or all) of the performance benefit. If, instead, array entries are aligned to cache lines, there will likely be a lot of wasted space between them.

The kernel's per-CPU allocator, he said, maps a range of address space on each CPU to provide access to that CPU's local memory space; allocations just return a single address that can then be used on all CPUs. He has implemented a similar approach in the librseq library. The allocator creates a set of memory pools, one per CPU; allocations then return an offset that is the same on every CPU. It is essentially the cache-line-aligned array approach, but the allocator packs allocations within each CPU's area, reducing the wasted memory between those areas. It can support multiple pools, thus isolating users from each other.

The kernel's memfd feature is used to create the per-CPU memory pool. It is about the only way, he said, that allows him to create the various mappings into the same area that the feature needs.

There are some potential problems with this approach, though. What if a four-thread process is running on a system with 512 CPUs? Allocating and initializing memory for all of those CPUs would be wasteful, most of it will never be used. So, instead, the library initializes one CPU's worth of memory in a special "initialization area", then creates a copy-on-write mapping of that area for each CPU. Any CPU reading from its area will read from that single copy; if a CPU writes to its area, the page will be copied and will become truly CPU-local.

Another concern relates to what happens when a process forks; the per-CPU area will be shared across the fork, which may not be what is wanted. He is considering adding a memfd flag (MFD_PRIVATE) that would make a memory area per-process; a fork would then result in the child getting a separate copy of that area. For now, he is using an "inconvenient workaround" consisting of a couple of madvise() operations to detect and handle forks. As part of that, the library maps a special "canary page" that is set to be cleared when a fork happens; the contents of that page can then be checked to detect forks.

In the future, he is considering adding the ability to allocate variable-sized elements within the per-CPU memory pool. The placement of guard pages between each CPU's area would prevent some cache-line bouncing caused by one CPU prefetching into another CPU's area. He also has thoughts about improving the control-group CPU controller to allow for maximum-concurrency limits, which would make it possible to put tighter limits on the number of entries needed in the pool.

Desnoyers concluded his presentation by returning to the MFD_PRIVATE idea, which he sees as the best way of solving the fork problem. This feature would be useful in other contexts, he said. The MESH allocator needs this kind of feature, as do Google's dynamic-analysis tools. David Hildenbrand said that MFD_PRIVATE could be a reasonable addition, but thought that its use should also imply behavior like that obtained with MADV_WIPEONFORK, where the memory is zeroed when a fork happens. Desnoyers answered that this behavior might not be wanted; a child process could still make use of the per-CPU data from the parent, but would have its own copy for any changes it made.

Suren Baghdasaryan commented on the possible use of guard pages to prevent cross-CPU prefetching, noting that this behavior is architecture-dependent. He wondered if Desnoyers has considered how this work interacts with cpusets. Desnoyers said cpusets and per-CPU memory do work together, but there are some challenges. Since his library will not get CPU-hotplug notifications, it has to be ready for unexpected changes in the CPU topology.

Hildenbrand asked how processes can be sure that the CPU does not change underneath them while accessing per-CPU data; Desnoyers answered that restartable sequences are the usual way to do that. I followed up, asking whether restartable sequences were the only safe way to work with this memory; he said that there are other options, including atomic operations or rseq locks.

The session concluded at that point. Desnoyers has not posted the slides from this session, but the slides from his February FOSDEM talk cover the same points.

Comments (4 posted)

Using large folios for text areas

By Jonathan Corbet
April 8, 2025

LSFMM+BPF

Quite a bit of work has been done in recent years to allow the kernel to make more use of large folios. That progress has not yet reached the handling of text (executable code) areas, though. During the memory-management track of the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit, Ryan Roberts ran a session on how that situation might be improved. It would be a relatively small and contained operation, but can give a measurable performance improvement.

Roberts began by saying that his objective is to make a kernel built for a 4KB page size to perform as well as one built for 64KB pages, at least when it comes to the handling of text. By mapping larger folios by default, he said, the kernel could take advantage of translation lookaside buffer (TLB) coalescing on newer processors and reduce memory-management overhead. The kernel's ability to use large folios has improved considerably; both anonymous memory and the page cache can work with them now.

Text pages are managed through the page cache too, but the way they are accessed tends to prevent the use of large folios. When a file is being read from user space, the kernel's readahead system will detect sequential access and allocate large folios as data is read into the page cache. Execution tends not to be sequential, though; instead, it bounces randomly around the text section. As a result, sequential access is not detected, and the kernel, seeing random access, sticks with smaller folios. But, Roberts said, large (64KB) folios could be used for text without significantly increasing memory consumption.

David Hildenbrand said that various people have tried using larger folios for text, but have not gotten as much of a performance improvement as had been expected. He wondered what sort of improvement Roberts expected from this kind of change. Roberts answered that he would get to the results, but that the short answer was that it depends on the workload, with some workloads seeing big improvements.

For file readahead, Roberts continued, the kernel will bring in some data synchronously, then speculatively start an asynchronous read further ahead in the file. If that data is eventually used, the readahead size (and the size of folios used for that data) will be increased. For text, though, it is unusual for that asynchronous area to be accessed quickly, so everything ends up in small folios. A better approach, he said, would be to say that the asynchronous readahead just is not useful for text areas. Instead, the kernel could simply read the 64KB folio around the fault, without the speculative read beyond that folio. Then, he said, most text would end up in larger folios, which would be mapped together in the fault handler; that, in turn, makes it easy to set the page-table-entry bit needed for TLB coalescing on Arm systems.

A participant asked whether it would be better to just use large folios unconditionally throughout the page cache; Roberts answered that the readahead code gets to that point anyway when large folios appear to make sense. Shakeel Butt pointed out that a lot of applications are built with their binaries organized into sections; perhaps there would be some useful hints there? Matthew Wilcox said that the compilers don't leave that information in the resulting binary, so those hints are not really available.

Wilcox went on to say that he had learned a lot about the readahead code from the presentation — leading Roberts to interject that Wilcox had written the readahead code. Wilcox said that the proposal makes sense in general, that the kernel should use 64KB pages for text. The readahead code, he said, is optimized for data, but text does not behave like data.

Roberts moved on to the performance results, making it clear that he was only showing tests that improve with the new behavior. There were no performance regressions, though, just some workloads that did not show any difference. Overall, he said, most workloads saw a 4-8% performance improvement; that is less than the 12% that comes from going to a 64KB page size overall, but still worthwhile.

He offered a few options for how this feature could be controlled, with the first being that each architecture would provide a preferred folio size for text mappings. The readahead code would gain a special case for memory areas that are mapped with execute permission; it would just perform a 64KB synchronous read in that case. If any of the pages in that folio are already in the page cache, though, then smaller reads would be performed. This solution would be entirely contained within the kernel, he said. He posted an implementation of this option in February 2024, but received an objection that architectures should not be setting the folio size, so this work went cold.

The second option would be to add a sysfs knob to allow the administrator to set the preferred folio size for text. He expressed a lack of enthusiasm for more knobs, though. The third option would be for the dynamic linker to make this decision at load time; a process_madvise() call could be made to inform the kernel of its decision. This moves the responsibility to user space and, he said, would create a new ABI, so he thought that this option was best avoided.

Hildenbrand asked whether the khugepaged kernel thread could assemble larger folios after the fact; Wilcox said that it can do that now, but it tends not to run when it would be most useful. It would be far better, he said, to create the larger folios from the beginning. Hildenbrand asked if it would make sense to use a larger size, perhaps even 2MB, but Wilcox said that most executable segments are smaller than that.

The fourth option, Roberts said, was not the right solution, but it had been raised on the list: the filesystem holding an executable could set the folio size. It could use the same infrastructure as the recently added large-block-size support. But, he said, there is no way for a filesystem to know which files should receive this treatment, or what a reasonable value would be. A filesystem-set size would also apply to the entire file, not just the text segments with it. As an extra bonus, he said, this option would have to be implemented in every filesystem separately.

The session closed with a suggestion from Wilcox that the first option should be implemented. The others, he said, can always be added later if they seem to make sense. Roberts has subsequently posted a new version of this work.

Comments (2 posted)

Two approaches to better kernel samepage merging

By Jonathan Corbet
April 9, 2025

LSFMM+BPF

The kernel samepage merging (KSM) subsystem works by finding pages in memory with the same contents, then replacing the duplicated copies with a single, shared copy. KSM can improve memory utilization in a system, but has some problems as well. In two memory-management-track sessions at the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit, Mathieu Desnoyers and Sourav Panda proposed improvements to KSM to make it work better for specific use cases.

Supporting user-space text patching

Desnoyers has come to KSM to help with a seemingly unrelated problem that he is working on: code patching for user-space processes. He works with instrumentation like the LTTng tracing framework, which allows the placement of tracepoints within an application. In the current implementation, each tracepoint has a controlling variable and a branch to determine whether the tracepoint fires. Some of his customers, he said, have applications with 30,000 tracepoints in them; at that scale, the extra overhead for each tracepoint adds up. He would like to improve this situation by using code patching, as is done in the kernel now. There are other use cases, including tuning code for available features or selecting application features to enable, that would benefit from code patching as well.

Since this is performance-oriented work, he is concerned about the code patching creating new performance problems of its own. Specifically, patching code by writing to a text page will cause that page to be copied; code that was once shared no longer is. That will be the case even if all of the processes sharing that page of text patch it in the same way. KSM can perhaps help to undo this duplication, but its page scanning brings overhead of its own, which Desnoyers would like to avoid.

Another problem is that KSM is focused on deduplication on systems hosting virtual machines. It requires configuration by the system administrator to work effectively, and brings security concerns. He would rather have a simpler solution that just works.

He had proposed such a solution prior to the Summit, but had been told by Linus Torvalds that there is not room in the kernel for two implementations of KSM. He is not looking to replace the current KSM implementation, so that has led him to start thinking about other approaches.

One possibility, he said, would be to add the concept of per-user file overlays. The kernel's uprobes mechanism can patch a running process now, but it changes all processes in the system that are running the targeted code, while he would like to limit the effects to a single process. So he is thinking he could add a new system call for code patching that would create a new overlay, tracking each user's changes to a given binary. It would apply on top of the files an application uses, and changes would apply immediately to all processes (owned by that user) that are running the affected program. The downside of this approach, he said, is that it would make it hard to instrument different parts of a process hierarchy differently.

So a better solution might be to create a text_poke() system call that would be provided a vector of instructions to patch. The kernel would track the altered pages for each address space (mapped file) at several levels — altered pages can be further altered later on. Whenever a process modifies one of its pages in this way, the kernel would attempt to find other altered copies of the same page and, if it finds one containing the same alterations, the two users would share the page. The altered pages would be cached even if all users exit, meaning that the patched pages would persist for short-lived applications that will want them again in the future; they could be reclaimed as needed when memory gets tight.

It is fair to say that this idea did not evoke a great deal of enthusiasm in the room.

Matthew Wilcox asked if Desnoyers was familiar with the "reflink" concept, which he described as a sort of copy-on-write hard link for files. There have been efforts over the years to create a generalized reflink capability for Linux without success, but some filesystems implement that functionality internally. Wilcox suggested that code patching could act like reflink under the covers without exposing the changed files to user space. When code is patched, he said, a new inode could be created for the altered file and stored in the associated virtual memory area.

The hard part, he said, is that the kernel does not have an efficient way to cache reflinked files. Also, the same (unchanged) page in two reflinked files will be stored as separate pages in the page cache; fixing that has been on the wishlist for years. Desnoyers asked how it might be possible for the kernel to map a list of modifications to the correct inode; Wilcox pondered for a moment, then answered: "Oh well, it was a nice idea".

David Hildenbrand, though, expressed interest in pursuing the reflink idea further. What is needed, he said, is a high-level description of the changes to be made. The kernel can then generate a new file from that list on demand, and reclaim it when needed. The idea sounds easy and clean, he said, "except it won't be easy". The session concluded with Wilcox saying that it was an interesting problem.

Selective KSM

In the following session, Panda briefly presented two proposals to address some of the problems with KSM. The feature is useful, he started, but it requires a lot of adjusting of parameters to work well, adds run-time overhead for the page scanning, and has been seen as a security problem as well.

The first idea is "synchronous KSM", where the merging of pages would be directed synchronously by user space. The merging of pages would only happen when requested (and the time taken would be charged to the process requesting it), and only the specific memory areas indicated would be considered for merging. The actual request could be made by way of sysfs, madvise(), or some other system call. Security would be improved, since the caller has control over which pages are considered for merging, and CPU efficiency would be improved over the background scan that KSM currently uses. The biggest limitation would be that, once two pages diverge from each other, they will stay separate, even if they come to have the same contents in the future.

The second proposal is "partitioned KSM", where processes would be divided into sensitive and non-sensitive partitions. This partitioning would be controlled via a sysfs hierarchy; new partitions can be created as needed. Merging would be controlled by writing a process ID and an address range to a partition's control file; the kernel would add the process to the partition, then synchronously scan the given address range for merge candidates. Hildenbrand said that this idea is similar to using madvise(MADV_MERGEABLE) to control merging, except that it acts synchronously. He suggested using process_madvise() rather than sysfs to control this feature.

An alternative, Panda said, would be to create a new ksm_open() system call that would accept the name of a partition to join and return a file descriptor representing that partition. There would be a ksm_merge() to request the merging of duplicate pages within that partition. Other system calls would be added to undo the merging of pages or detach from the partition entirely. Hildenbrand said that dropping the current KSM implementation is not an option, so a mechanism that simply adds partitions is potentially interesting.

As the session (and the day) came to an end, Panda asked whether any such feature should be configurable at compile time; Michal Hocko advised against that, since KSM is already an opt-in feature. He said that he likes the file-descriptor idea, which provides a clear namespace for KSM operations. Hildenbrand said that the global KSM functionality could remain too, it would just have to be carefully disabled for any process that joins a partition.

Comments (2 posted)

Improving hot-page detection and promotion

By Jonathan Corbet
April 9, 2025

LSFMM+BPF

Tiered-memory systems feature multiple types of memory with varying performance characteristics; on such systems, good performance depends on keeping the most frequently used data in the fastest memory. Identifying that data and placing it properly is a challenge that has kept developers busy for years. Bharata Rao, presenting remotely during a memory-management-track session at the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit, led a discussion on a potential solution he has recently posted; Raghavendra K T was also named on the session proposal. It seems likely, based on the discussion, that developers working in this area will not run out of problems anytime soon.

There are two aspects to the memory-promotion problem: detecting which pages should be moved, and actually migrating them. On the detection side, there are a number of sources for data that can be used to detect hot (frequently accessed) pages; Rao is mostly focused on approaches that involve scanning page tables to see which pages have been accessed recently. He thinks that the current scanning implementation needs an overhaul; it is tied to the NUMA-balancing code, and its operation tends to create latency spikes for applications. Since both the page-table-entry (PTE) scanning and page migration are done in process context, they interrupt user-space execution in potentially disruptive ways.

Additionally, he said, there is an increasing number of ways to detect memory access — new data-temperature sources are becoming available. That suggests a need for a way to gather this data together and act on it in one place. His proposal is a new kernel thread, kmmscand, that would take over this task. It would maintain a list of process address spaces, then perform "A-bit" scanning (clearing the "accessed" bit in the PTEs, then scanning later to see which PTEs have had the bit set, indicating an access in the meantime) on each. The results of the scan are then used to create a list of pages to promote to faster memory.

In Rao's implementation, the migration task is separated into its own thread that runs independently of the scanning thread. The performance impact of the scanning can be regulated by adjusting the rate at which PTEs are scanned or by exempting some processes from scanning altogether. There is still room for improvement, though, he said. In particular, there is always a need for better hot-page detection, and there is a need for throttling and batching of the migration work as well.

A participant asked how often address spaces are scanned; Rao answered that, initially, a scan is performed every two seconds. Over time, the rate is adjusted according to conditions; it can end up anywhere between 0.5 and five seconds.

Another participant said that effective scanning requires flushing the translation lookaside buffer (TLB); otherwise, it will short out the address-translation process that sets the A bit in the first place. TLB flushes, of course, can hurt performance. Rao answered that the current implementation promotes pages the first time an access is observed, so missing data on subsequent accesses is not a problem. SeongJae Park said that DAMON, which also performs this sort of scanning, is not currently doing TLB flushes. That drew another question about whether this scanning should just be integrated into DAMON rather than being implemented as a separate thread; Rao answered that integrating the systems is something he will be looking into. Davidlohr Bueso pointed out that enterprise distributions do not normally enable DAMON, but Park answered that Red Hat and Debian are indeed enabling it.

Gregory Price said that the scanning thread will have to hold locks while passing over the page tables, and asked whether that has been observed to cause latencies for the processes being scanned. Rao said that the overall result of his implementation is a significant reduction in latencies experienced by applications, but did not address the locking issue specifically. Jonathan Cameron asked how the scanning of huge pages was handled; Rao said that the thread is just scanning PTEs and not doing anything special for huge pages.

Another participant worried about the cost of scanning especially large address spaces and asked if the overhead had been measured. Rao answered that there is still a need to understand the full cost of A-bit scanning. If it turns out to be a problem, there are optimizations, such as scanning at the PMD level first before dropping down to the PTE level in hot areas, that can be considered. There have been some experiments done with 64GB address spaces, he said, that have shown an improvement over current kernels.

A remote participant pointed out that the multi-generational LRU also optimizes its scanning by looking at higher-level page tables first; among other things, that helps it to avoid scanning unmapped memory. Cameron pointed out that scanning the higher-level tables will only be useful if the TLB is regularly flushed. But Yuanchu Xie said that the multi-generational LRU scanning seems to be optimized well, and is integrated with the reclaim system as well. Rao agreed that this avenue needs more exploration.

At the end of the session, Rao raised one last problem: while A-bit scanning can identify hot pages, it provides no information about which NUMA nodes have accessed any given page. As a result, there is no way to know which node a hot page should be promoted to. There needs to be a way to maintain home-node information for pages he said. Alternatively, the system could also scan pages in the fastest tiers to see where the pages in a given address space are clustered now, then promote other pages to the same nodes. The group seemed to think that this could be an interesting heuristic to explore.

Rao has posted the slides from this session.

Comments (5 posted)

A strange BPF error message

By Daroc Alden
April 4, 2025

LSFMM+BPF

Yonghong Song brought a story about tracking down the cause of a strange verifier error message to the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit. He then presented some possible ways to improve Clang's user experience for anyone running into the same class of error in the future. Toward the end of his allotted time, he also discussed the problems with optimizations that change the signature of functions — a problem that José Marchesi had also brought up in the previous session.

An unhelpful error

Song started by presenting an example taken from a real program. The example was a bit dense, but the problem basically comes down to this code:

    bool icmp6_ndisc_validate() {
        __u8 nexthdr;
        // ...
        int offset = ipv6_hdrlen_offset(&nexthdr);
        // ...
    }

    static __always_inline int ipv6_hdrlen_offset(__u8 *nexthdr) {
        __u8 nh = *nexthdr;
        // ...
        switch (nh) {
        case NEXTHDR_NONE:
            return DROP_INVALID_EXTHDR;
        // ...
        }
    }

The code features an uninitialized variable (nexthdr) that was passed by reference into another function. This is not invalid in C, because the other function might initialize the variable by writing to it, so Clang doesn't issue a warning. In this case, though, ipv6_hdrlen_offset() does not initialize it, and instead reads from it in order to decide which branch of a switch statement to take. Clang doesn't warn in that function either, because it assumes the function argument points to initialized memory.

At that point, the code is passed to the optimizer, and everything goes wrong. The optimizer inlines one function into the other function, notices that the program is reading from an uninitialized variable (which would be undefined behavior, which it assumes cannot happen), and decides that this code must be unreachable. It turns the entire tail of the function into a single unreachable instruction in the LLVM IR, and then hands that off to the BPF code-generation backend. That backend ignores the unreachable instruction, but since the function's original return has been subsumed into it, the code generator ends the function without emitting a return instruction. That leads in turn to this somewhat confusing error from the BPF verifier:

    last insn is not an exit or jmp

While this error makes sense with an understanding of the sequence of events that led to it, at first Song found it a good deal more puzzling. It's not intuitive that an uninitialized variable would cause this error message, he said. He actually ran into this same problem helping someone with another program — so this isn't an isolated incident. People are seeing this message and being justifiably confused.

Marchesi and David Faust said that GCC does pretty much the same thing, and therefore has pretty much the same problem. One audience member asked why LLVM was generating an unreachable instruction instead of inferring that the value of the variable was undef (LLVM's representation of a value which could be anything). Song answered that LLVM's undef has "a lot of interesting semantics" that made it not always the right fit.

There have been a few attempts to avoid this kind of error, Song said. One option is to use -ftrivial-auto-var-init=zero to make the compiler initialize all variables with zero, where possible. This sort of works, in the sense that the generated program is no longer rejected, but it may hide a real bug. It's also a performance problem for some express data path (XDP) programs that may need to initialize lots of IP headers.

Another approach that he tried was to have the BPF backend recognize the unreachable instruction and emit an error at that point. This is better, but it's not an airtight defense, because there's no guarantee that the optimizer won't do something else in the future that results in different code being generated. For example, it could have just chosen to assume that the value of the variable matched whichever switch statement it found most convenient.

If the presence of unreachable could be relied on, the BPF backend could emit a useful error message when it sees it. So the approach Song is currently pursuing is to try and make it so that the optimizer will not use transformations that can eliminate unreachable instructions when compiling for BPF. He also has a pull request for LLVM open that tries to generate unreachable in more cases, although it looks unlikely to be accepted in its current form.

One attendee suggested using LLVM's poison value, which is subtly different from undef (as a presentation from the 2020 LLVM developer's meeting explains). Song agreed that it was possible in theory, but it wasn't likely to be accepted by the LLVM maintainers for various reasons.

Marchesi wondered whether this same kind of behavior could manifest in other verifier errors, or whether it was always the same message. Song answered that he had only observed this specific error in testing, but that in general there was no reason to assume that other verifier errors were impossible. Eduard Zingerman said that he had actually seen some sched_ext code that did not result in the "last insn is not an exit or jmp" message, but had caused a verification failure in a different place in the program. Marchesi suggested that this specific case could be caught by examining the program's control-flow graph at compile time. Song said that was not possible, because LLVM's BPF backend doesn't have access to the control-flow graph. Marchesi asserted that this was a problem with LLVM's design, and that the backend needs access to the program's control-flow graph for several reasons.

As a partial solution that would at least deliver better error messages, Song proposed having the BPF backend generate a call to the non-existent bpf_unreachable() kernel function when it sees a unreachable instruction. This would still result in a verifier failure on existing kernels, but hopefully one that is more specific and therefore easily searched for. Future kernels could recognize calls to bpf_unreachable() and supply a nicer failure message. Specifically, he proposed:

    last isns marked as unreachable, maybe due to uninitialized variable?

Some other alternatives he considered included adding an unreachable instruction to the BPF virtual machine, adding a bpf_unreachable() kernel function, or actually making the Clang frontend detect all uninitialized variable usage across functions. The first two are not really necessary, he said. Someone working at Google actually had a patch that implemented the latter option, but it never got merged. At the time, the project didn't consider it a priority because people normally use a sanitizer to detect problems with uninitialized variables. Unfortunately, that's not really an option for BPF programs.

Faust commented that this sounded like another use case where it would be helpful to have the rules of the verifier extracted out of the kernel so they could be run elsewhere. If that were done, the compiler could check the binary itself, and then use its context on the program to produce a more helpful error message.

Signature changes

With the time remaining in the session, Song turned to another topic: how optimization can change the signatures of functions, and how to represent that in BPF's debugging information format, BTF. According to an analysis of the DWARF debug information of a recent kernel, there are 64,129 functions in the kernel. Of those, 635 have arguments changed, 306 have the return value removed, and 18 have both.

The DWARF debug format does actually have a way to represent that information, in the form of the DW_AT_calling_convention tag, but it's not specific enough — it only tells the user that something changed, not what changed. Song then briefly described two proposed new ways of representing the original signature of an optimized function in BTF. Unfortunately, the group didn't have much time to dig into the the topic before it was time for the next session.

Comments (13 posted)

An update on pahole

By Daroc Alden
April 7, 2025

LSFMM+BPF

Pahole (originally "Poke-a-hole") is a Swiss Army knife for exploring and editing debug information. Pahole is also currently involved in the kernel's build process to rearrange the information produced by various compilers into a form useful to the BPF verifier, although there are plans to render it unnecessary. Pahole maintainer Arnaldo Carvalho de Melo shared some status updates about the project at the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit. Interested readers can find his slides here.

Pahole has several uses in kernel development, including inspecting the layout of kernel structures, finding cache-line misalignment problems, and collecting statistics from the kernel's debugging information. The reason that Melo was presenting in the BPF track, however, is that pahole is also currently responsible for taking the debugging information generated during a build of the kernel and (if the user has enabled BPF) converting it into BTF for the verifier to make use of when loading a BPF program. Because the verifier, GCC, and Clang have all been evolving rapidly in this area, the BPF developers have turned to pahole to paper over the gaps in expectation between the different tools. [Melo later clarified that the compilers had not supported emitting BTF for non-BPF targets for many years, which was the initial motivation for including pahole in the build.]

Melo began with the announcement that pahole has a new co-maintainer: Alan Maguire, who will be helping Melo process patches. Melo hopes to use the resulting free time to be able to get pahole's release cadence sped up to match the kernel's; this would let a particular kernel be associated with a particular version of pahole, which would simplify toolchain management for many kernel developers. He would also like to set up continuous-integration testing for pahole.

Since last year, Melo said that there had been lots of work contributed to the project, including about 140 patches. Having testing infrastructure to catch problems with proposed patches automatically would be helpful in dealing with that volume of changes, he said. He intends to start by adapting the design of libbpf's continuous-integration setup, although he was warned by the libbpf maintainers "not to repeat the same mistakes".

The pahole test suite is in reasonably good shape, but Melo wanted to encourage people to add more tests, especially ones comparing the BTF generated by different compilers.

Various updates

At that point Melo quickly went through a long list of in-progress pahole features, with relatively little discussion of each one. ~~Kees Cook~~ [Correction: the work was based on Cook's ideas, but not done by him.] The pahole developers have been working on adding support for flexible arrays to pahole; the program has all of the needed information to calculate sizes for structures that contain them at this point. Improvements to BTF handling include new metadata, support for bpf_fastcall, and resilient split BTF (which greatly improves the quality of BTF in loadable modules).

Rust support has been "fixed" in the sense that pahole now ignores most Rust debug information and will no longer crash upon encountering some. It will attempt to reconstruct as much information as it can, based on the existing C++ support, but Melo called the resulting information "pretty useless". Pahole will also show what is skipped because it could not be understood and issue a warning if the user attempts to encode BTF from a kernel object written in Rust.

Pahole can also now transform information related to global variables (although this support is off by default), which makes the kernel's debugging information about 30% larger and covers around 76,000 variables. The most common variables are associated with tracepoints (~3,000 variables), trace events (~2,000 variables), or static keys (~1,000 variables).

There are still large numbers of variables that are "uninteresting" and therefore not included in pahole's output, Melo said. Currently, pahole has a hard-coded set of filters to determine whether a variable is worth including, but he would like to move that to a separate configuration file. Perhaps such a configuration file could even live in the kernel sources, so that kernel developers could tweak it for their needs.

Tweaking BTF

The current role of pahole in the kernel's build system is to read in DWARF debug information and output BTF. But that is far from the only workflow it supports. The tool can also ingest or output C Type Format information (a format with the goal of being a BTF superset suitable for use with programs other than the kernel, although the current version is not quite a superset). As of recently, pahole can parse BTF as well. This means that the tool can be used to modify BTF, Melo said. This can be used to do deduplication, correcting the output of buggy compilers, etc.

Melo proposed that, in the future, the kernel should be compiled with GCC's -gbtf option, which causes it to emit debugging information in BTF format. Then binutils will handle deduplicating the BTF while linking kernel objects together, before pahole performs final fixups. In this way, the conversion of DWARF to BTF will eventually be removed, which will speed up the process of building the kernel.

Alexei Starovoitov asked how far away GCC was from being able to build the kernel with -gbtf; David Faust said that compiling was possible right now and that the process would only fail at the linking step because ld doesn't yet know how to link BTF. Starovoitov asked whether anyone was working on that and Elena Zannoni confirmed that someone was. She expected that work to be complete in about a week.

That answer seemed to please Starovoitov, who thought that having compilers generate BTF natively, without using pahole to convert debugging information from DWARF format, would be a substantial speedup. The rest of the attendees agreed, although José Marchesi said that the real benefit was not tying BTF to information that is representable in DWARF. Melo agreed, saying that the conversion step using pahole was always a temporary measure. Starovoitov asked what would happen to the other fixes that pahole applies. Melo answered that pahole would still be part of the kernel build process, it will just have less to do.

[When originally published, this article referred to Arnaldo Carvalho de Melo as "Carvalho" after the first appearance of his name. In response to a reader question, I reached out to Melo, who informed me that his last name should be "Melo". The article has been updated accordingly.]

Comments (2 posted)

A new type of spinlock for the BPF subsystem

By Daroc Alden
April 9, 2025

LSFMM+BPF

The 6.15 merge window saw the inclusion of a new type of lock for BPF programs: a resilient queued spinlock that Kumar Kartikeya Dwivedi has been working on for some time. Eventually, he hopes to convert all of the spinlocks currently used in the BPF subsystem to his new lock. He gave a remote presentation about the design of the lock at the 2025 Linux Storage, Filesystem, Memory-Management, and BPF summit.

Dwivedi began by providing a bit of background on existing locking in BPF. In 2019, Alexei Starovoitov introduced bpf_spin_lock(), which allowed BPF programs to update map values atomically. But the lock came with the serious limitation that a BPF program could only hold one lock at a time, and could not perform any function calls while the lock was held. This let the verifier ensure that BPF programs could not deadlock, but was awkward to use, Dwivedi said.

In 2022, sched_ext led to the introduction of more kernel data structures to BPF, including linked lists and red-black trees. The verifier was tasked with ensuring that the BPF program could lock and unlock those data structures correctly while manipulating them, but still only supported holding one lock at a time, and only allowed restricted operations while it was held. Some algorithms are much easier to express if the program is allowed to take two locks, Dwivedi explained. So this was a lot of friction to impose on BPF users, all for the sake of avoiding deadlocks.

The thing is, the syzbot kernel fuzzing system regularly finds deadlocks in the BPF runtime. The verifier cannot help with those, either, since they're occurring in kernel code and not in the loaded BPF programs. This is "an endless source of issues" that the BPF developers need to find some way to deal with, he said.

So the problem that Dwivedi set for himself was: how to guarantee forward progress in the kernel, when applying more static analysis to the kernel is infeasible? Furthermore, he wanted to find a solution that was scalable, and that could isolate faults to the specific BPF program that introduced the problem, without causing other programs to suffer. This is clearly a tall order, but Dwivedi believes that he has a preliminary solution.

Resilient queued spinlocks

His solution is to introduce a new kind of lock — a resilient queued spinlock — that will dynamically check for deadlocks (and livelocks) at run time, and encourage its use in areas of the kernel that frequently suffer from locking problems. The basic structure of the lock is (for most kernel configurations) four bytes:

locked — a one-byte value that is either one, when the lock is held, or zero, when it's free.
pending — a one-byte value that is used to indicate that a second thread is waiting on the lock.
tail — a two-byte index into a table of wait queues that is used when more than two threads want the lock.

When a thread is waiting on the lock (and would otherwise be wasting CPU time spinning), it checks a per-CPU table of held locks against information stored in the lock's queue in order to detect deadlocks. If waiting for the lock takes too long without the lock becoming free, an error is also returned indicating a livelock.

Dwivedi went through what the process of locking and unlocking the lock looks like at different levels of contention. When the lock is uncontended, it works exactly like the kernel's current spinlocks: the thread that wants to acquire the lock uses a compare-and-exchange instruction to set locked to one. When it's done, it sets it back to zero. When two threads want the lock, the sequence is similar, except that the second thread will set pending to indicate that it is waiting and start spinning on the lock.

When more than two threads want the lock, things get interesting. The third thread to try to acquire the lock sees that the pending bit is set, adds itself to a new queue, and then points the tail field of the lock to the entry for that queue. Then, as the head of the queue, it becomes responsible for running checks for deadlocks. Any new threads that attempt to acquire the lock add themselves to the queue.

Eventually, the thread holding the lock will give it up, prompting the pending thread to grab it and reset pending to zero. Because the thread at the head of the queue was handling deadlock checks, the pending thread was free to spin on the lock, so the latency from the lock being freed until the pending thread can acquire it should be low. If the thread holding the lock doesn't give it up in a reasonable amount of time, the thread at the head of the queue also notices that, and the waiting threads all remove themselves from the queue and return error messages.

The design of the lock is intended to have performance competitive with the kernel's existing spinlocks, while still letting deadlocks be detected. To support this, Dwivedi showed a number of graphs from benchmarks such as locktorture and will-it-scale, which are also available in his commit message. In short, the new lock looks as though it performs only slightly worse than the existing spinlock on an Intel x86_64 CPU, and pretty much identically on an arm64 CPU — although there was the normal amount of noise in his performance data making the difference somewhat hard to detect.

Next steps

Now that the lock is available in the kernel (although it wasn't at the time of Dwivedi's presentation), he has plans for how to make use of it. The place he most wants to see the lock used is in the BPF runtime, both to help eliminate the existing deadlock problems there, and because that will let the BPF developers relax some of the restrictions on BPF programs that hold locks. He also wants to work on reporting detected deadlocks to user space, which is currently not done.

He pointed the audience at his and Starovoitov's white paper for anyone interested in additional details of how the deadlock-detection algorithm works. I asked how long he expected it to take to adapt the whole BPF runtime to use his new lock; Dwivedi said that the work was not complicated, he just wanted to avoid doing it in the initial patch set. "Once the whole thing lands in mainline, we will want to convert more parts."

Starovoitov asked about his ideas for replacing the existing spinlocks. Specifically, the existing bpf_spin_lock() function returns void, and has no possibility of failure, which means adding error paths to use the new lock. Dwivedi said that there was some additional needed work to allow multiple locks to be held at once, but that existing BPF programs should just get canceled if they cause an error. Part of the point of his work is that misbehaving programs should get kicked out of the kernel. In response to some clarifying questions, he said that the goal was that "the kernel won't break; your program might break", and that any new BPF programs going forward could check the return value of the lock explicitly if they needed to.

Starovoitov expressed hope that the new lock would eventually displace all uses of existing spinlocks in the BPF subsystem. Anton Protopopov was concerned about the performance impacts from the changeover, and was unconvinced by Dwivedi's measurements, although the other BPF developers did not seem overly concerned.

I also asked whether this was the only kind of lock Dwivedi thought the BPF runtime would need. He thought this was the only one needed for now, although he acknowledged that some people have occasionally wanted mutexes.

With the lock now accepted into the mainline kernel, it seems likely that Dwivedi will be pushing forward with his quest to adapt the BPF subsystem. If that conversion does actually successfully eliminate deadlocks without an undue performance impact, perhaps other areas of the kernel will also find the new lock useful.

Comments (none posted)

Taking notes with Joplin

April 8, 2025

This article was contributed by Andrea Ciarrocchi

Joplin is an open-source note-taking application designed to handle taking many kinds of notes, whether it is managing code snippets, writing documentation, jotting down lecture notes, or drafting a novel. Joplin has Markdown support, a plugin system for extensibility, and accepts multimedia content, allowing users to attach images, videos, and audio files to their notes. It can provide synchronization of content across devices using end-to-end encryption, or users can opt to stick to local storage only. Joplin even offers a command-line version for terminal-based usage. Joplin 3.2, the most recent feature release, brought long-awaited multi-window support, multi-column layouts, enhanced accessibility, and theme detection.

History

Laurent Cozic started work on Joplin in 2016, with an Android version released in July 2017, and the first desktop release (v0.10.19) made available in November 2017. The complete changelog of the project is available on the Joplin web site. The application is named in honor of ragtime composer and pianist Scott Joplin, whom Cozic admires. More than 650 people have contributed code to the project over its lifetime; based on the activity on GitHub and the steady stream of releases, it is an active and thriving project.

In 2022, Joplin's user applications switched from the MIT license to the GNU Affero General Public License v3 (AGPLv3) with the 2.9 release. The Joplin project also makes its Joplin Server available for users who want to run their own synchronization server. However, that application is released under a source-available, non-free Personal Use License that forbids commercial use and seems to disallow modification as well.

Getting started

Joplin is a multi-platform application, with versions for Linux, macOS, and Windows. It is built with Electron for cross-platform compatibility. Mobile applications are also available for Android and iOS. It is available for Linux as an AppImage from the download page of the web site. The page also provides instructions for installing the terminal application and WebClipper, a browser extension for Chrome and Firefox that allows saving web pages directly to Joplin as notes. The command-line version is not included with the Joplin download, and is installed using Node.js's npm package manager.

Joplin's editor supports writing content in Markdown or using a Rich Text editor for a WYSIWYG experience. However, the project warns that the Rich Text editor has some limitations in working with Joplin plugins that use Markdown. Notes are stored in the Markdown format whether the user chooses the standard Joplin editor or Rich Text editor. All notes, notebooks, and attachments are stored in an SQLite database. Editing with external text editors or file management tools is possible by selecting "Edit with external editor" from the application. However, this approach is only feasible when mediated by Joplin. Directly editing Joplin's notes can disrupt synchronization.

It also offers an export feature that supports multiple formats, including its own Joplin Export Format (JEX), as well as JSON, PDF, and plain Markdown, to back up or transfer data. Similarly, it provides import capabilities, allowing for migration from other note-taking applications. Users can import from JEX, Markdown files, and Evernote data via ENEX files while preserving note structure and attachments.

It includes a to-do list function that allows creation and management of tasks similarly to regular notes. Each to-do item is a special type of note with an additional "completed" status field. When creating a new to-do note, users can assign a title, add detailed content, and set a due date. To-dos can be marked as completed, of course, which updates their status and changes their sorting order in the interface, typically moving completed items to the bottom. Notes themselves can include Markdown to-do lists.

Joplin supports synchronization via a number of FOSS and proprietary synchronization methods. The easiest open-source option is likely Nextcloud, a widely used self-hosted cloud-storage solution, with synchronization via its WebDAV interface. It requires a properly configured Nextcloud instance and authentication credentials. WebDAV support extends beyond Nextcloud, permitting synchronization with any WebDAV-compatible server. This requires manually entering the server URL and credentials.

Joplin offers an account-synchronization feature that allows access from multiple devices and enables note sharing with other users. Joplin Cloud is a paid cloud-storage service. It also supports synchronization methods besides Joplin Cloud, making it possible to choose based on infrastructure and privacy requirements. It also has Dropbox and OneDrive integration, which relies on their respective APIs to store and retrieve encrypted note data from these services. Users must authenticate through OAuth, and Joplin manages synchronization automatically. Each method operates based on Joplin's delta-based sync API, which minimizes data transfer by only updating modified items.

Local filesystem synchronization is another option. This method is suitable for those using external synchronization tools like Syncthing, rsync, or a shared network drive. Since this approach does not involve cloud storage, it offers complete data control but requires additional configuration for multi-device synchronization.

Joplin notes can be encrypted by providing a master password. This gives users end-to-end encryption if they are using any of the synchronization providers. Note that this does not encrypt Joplin's data stored in SQLite or other user data.

Interface and workflow

Joplin's interface is structured as a three-pane layout. The left pane serves as the navigation panel, displaying a hierarchical list of notebooks and sub-notebooks, along with user-defined tags. This panel also provides quick access to synchronization status and configuration settings. The application requires no special configuration, so one can start writing immediately.

The central pane is the note list, where notes within the selected notebook are displayed. Notes are sorted by configurable criteria such as modification date, creation date, or title. Each note entry in this pane includes metadata such as title, timestamp, and an optional preview of the content. Joplin supports both list-based and board-style task management, with drag-and-drop functionality for reordering or changing task states.

The rightmost pane functions as the editor and viewer. This content area dynamically updates based on the selection, showing a list or board-style view of tasks, notes, or documents. Sorting and filtering options allow organization by parameters. Joplin supports multiple editor modes, including a rich-text WYSIWYG editor and a Markdown editor with a split-view preview. The Markdown editor supports CommonMark syntax with extensions for functionality such as checkboxes, footnotes, and mathematical expressions. Code blocks are rendered with syntax highlighting using highlight.js.

The interface includes a toolbar with essential formatting controls, a search bar supporting full-text search with filtering options, and a command palette for quick access to features via keyboard shortcuts. Additionally, Joplin supports a distraction-free mode, which can be activated by toggling individual panes. The entire UI is configurable, with support for themes and a customizable CSS file for further styling adjustments.

The latest version, v3.2.13, contains some minor bug fixes and was released at the end of February 2025. Joplin version 3.2 introduced several new features. The multi-window support allows notes, but not notebooks, to be opened in separate windows. This means users can finally have multiple notes open at a time, which makes it much easier to compare content, cross-reference materials, and organize projects. It may not be immediately obvious how to do this, however, as the option to open new windows is buried in a context menu. To open notes in a separate window, right-click on a note and select the "Edit in…" option. This launches the note in a new, dedicated window for editing or viewing. The feature required significant architectural changes, which included enabling independent operation of multiple windows with synchronized states to ensure real-time updates.

The note list now supports multi-column layouts, allowing sorting and display based on various metadata fields. By selecting "Detailed" as the "Note list style" from the "View" menu, users can view and edit notes along with their metadata (last modification date, completion status, etc.) directly in the central pane.

System theme detection has been implemented, which automatically aligns the application's appearance with the desktop's dark or light theme. In addition to the generic "Light" and "Dark" themes, it is possible to choose from other color combinations, such as "Dracula" or "Solarized Light". Accessibility enhancements include improved keyboard navigation and optimized user-interface contrast. For example, previously, pressing Shift+Tab from the notebook list would cause the focus to jump unexpectedly to the editor. This problem has been addressed by changing the behavior of the shortcut, which now moves focus to the "Add new notebook" button. Moreover, the contrast of scroll bars has been improved, making them more distinguishable.

Developers can extend Joplin's functionality using its JavaScript-based Extension API, which integrates with the Electron framework. The API allows access to core features like note management, UI modification, and data synchronization. Plugins are defined via a manifest file and can be developed using JavaScript or TypeScript. Those already available offer a wide range of features. For example, the Math Mode plugin allows using a note page as a calculator.

Joplin supports drawing functionality through the default inclusion of the Freehand Drawing plugin. It can be used to create and manipulate free-form sketches or annotations directly within the interface, using vector-based rendering. However, its feature set remains limited compared to dedicated graphic-design software.

Joplin's optical character recognition (OCR) feature scans documents like PDFs that are attached to notes. It offers reliable performance for scanning documents with printed text, but is not yet ready for handwritten notes. The project is looking to expand its OCR functionality to support handwritten text recognition. Joplin's OCR yields acceptable results when applied to high-resolution documents with a linear layout. Unfortunately, in real-world scenarios where I have tried to use it, its usefulness has been limited for documents that have complex formatting like tables. Users may have better luck with other OCR systems on Linux that perform better than Joplin's built-in system, though it is convenient to have an integrated option.

The Help section of the Joplin web site and the Discourse forum serve as good starting points to find help, if needed.

Drawbacks

Joplin has several limitations that may impact its usability, performance, and extensibility. Since it is based on Electron, Joplin has higher resource consumption compared to native alternatives. Memory usage can be high compared to native applications, especially when handling large note collections or extensive synchronization operations. For example, Joplin consumes more than 500MB of RAM right after startup on my system according to Ubuntu's System Monitor. Joplin may feel sluggish on lower-end or older hardware due to Electron's overhead.

Synchronization, though flexible, depends on external storage providers or self-hosted solutions. Performance varies based on the selected backend, with WebDAV implementations often experiencing latency issues due to inconsistent server performance. Joplin's delta-based sync mechanism reduces data transfer, but initial synchronization of large note sets can be slow. Additionally, its conflict resolution is limited, requiring manual intervention when multiple edits occur on different devices before synchronization.

The WYSIWYG editor can introduce inconsistencies when switching between rich text and Markdown modes, particularly for complex formatting structures. Table editing is limited, requiring manual adjustments in Markdown mode. Joplin's plugin ecosystem faces limitations due to the constraints imposed by the underlying Electron framework, which affect the scope and performance of potential extensions.

Development

As an Electron application, Joplin is mostly written in JavaScript and TypeScript. The project releases three major versions per year and has a roadmap with feature-freeze dates and tentative release dates through May 2027, though the roadmap does not indicate specific features planned for those versions. The development approach is adaptive, with priorities shaped by user feedback, pull requests, and community discussions. This means that while major release dates are scheduled, the exact features and improvements included in each version are determined closer to their release.

When a new stable release of Joplin is published, older versions do not receive further updates. So, for example, the 3.2.x line is still receiving updates, and 3.3.x is the development branch. The 3.1.x versions are no longer being updated. The Contributing to Joplin page provides a starting point for contributing. There, users can choose an area to get involved in.

To enable access to pre-releases, navigate to the Configuration menu, open the "Application" section, and enable the option labeled "Get pre-releases when checking for updates". Bugs encountered in pre-release versions can be reported either on GitHub or the Joplin forum. The latest Joplin desktop pre-release, version 3.3.3, focuses on accessibility, performance, and usability improvements. Key updates include new features such as multiple instance support, an upgrade to Electron 35.0.1, and various bug fixes.

Conclusion

While Joplin provides flexibility, performance can vary depending on the synchronization method, and users may have problems with delayed content updates across devices. The interface is functional, supporting both Markdown and rich-text editing, but it may feel less polished or intuitive compared to proprietary alternatives like Obsidian. The plugin system is still developing, and mobile performance can be limited. Overall, Joplin is a solid choice, but users may encounter a few rough patches depending on how they use the application.

Comments (2 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>