|
|
Log in / Subscribe / Register

Leading items

Welcome to the LWN.net Weekly Edition for May 14, 2026

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Friction in Fedora over AI developer desktop initiative

By Joe Brockmeier
May 13, 2026

A push by Red Hat employees to create a Fedora "AI Developer Desktop" with support for out-of-tree kernel drivers and AI toolkits has been met with objections from some long-time members of the Fedora community. After more than a month of sometimes heated discussion, the Fedora Council had voted to approve the initiative; however, a last-minute change to vote against the proposal by council member Justin Wheeler has (at least temporarily) sent it back to the drawing board.

The proposal

On March 31, Gordon Messmer, a senior software engineer at Red Hat, proposed the AI Developer Desktop initiative on the Fedora discussion forum. The aim of the initiative is to "build a thriving community around AI technologies" within Fedora. The initiative would focus on the technical hurdles in the way of shipping AI developer tools, enabling hardware; more than that, though, Messmer's proposal is meant to make AI development a major priority for Fedora as a project.

Community initiatives are projects that do not fit "neatly into the biannual Fedora Linux release cycle" and may span several releases. Initiatives are also meant to be goals for the entire project that align with Fedora's mission statement. One example of an initiative is the work to replace Bugzilla and Pagure with Forgejo as Fedora's "Git forge"; the Fedora wiki has a list of completed initiatives as well. Note that initiatives were previously called "objectives", which is the term that Messmer uses. For consistency, we'll stick with "initiative", as that is the current terminology.

Messmer said that much of the work Fedora does is to package applications so that the software requires minimal post-installation configuration to be usable; however, AI tooling often requires more than minimal setup on Fedora. He wants to make things easier for users who wish to work with AI tools by minimizing the amount of post-installation hassle required to get them up and running on Fedora. The platform deliverables he identifies to provide "an operating system image that would improve Fedora as a platform for AI software" would require Fedora to accommodate, if not actually include, out-of-tree kernel modules (NVIDIA's OpenRM, "until the Nova driver is ready") and support for NVIDIA's proprietary CUDA Toolkit. He was clear that there was no plan for adding "applications that inspect or monitor how users interact with the system or otherwise place user privacy at risk", or applications pre-configured to connect to remote AI services.

Fedora's rolling-release kernel, Messmer said, is not well-suited for "the AI space", and called for a Fedora long-term-support (LTS) kernel to avoid problems with out-of-tree kernel software and user-space components that "can be impacted by the changes typical of a kernel minor release". The Fedora project follows the upstream Linux kernel closely and has policy against maintaining multiple kernels. A Fedora release will usually receive many kernel updates, including major versions, during its release cycle. For example, Fedora 42 shipped with the 6.14.0 kernel in April 2025; the current updated kernel for that release (which is almost at the end of its life) is 6.19.14. Fedora's kernel policies currently discourage, but do not entirely prohibit, out-of-tree modules. He said that the initiative would require asking the Fedora Engineering Steering Committee (FESCo) to "revisit policies that prohibit the option of a stable Fedora kernel".

Additionally, he wanted to publish Fedora Atomic variants to support AI workloads, with the CUDA toolkit, around the same time as the Fedora 45 release scheduled for October 2026. "If Fedora cannot distribute this image due to license or policy issues that we can't resolve, I'd like to ask NVIDIA if they would publish the image we build". The Atomic desktops are image-based, which means that it is more complicated for users to install NVIDIA's CUDA package separately. Including the package when the image is built would be much simpler. He linked to a preview build of the desktop along with the configuration files used to build it, as well as a Copr repository with a 6.12 Linux kernel for Fedora 43 containing the out-of-tree NVIDIA module. He proposed himself as the lead of the initiative.

Discussion

Fedora's initiative process requires a discussion phase, which Messmer had initiated with his post on Fedora's forum; if the proposal is well-received by the community, then its lead can proceed to opening a ticket with the council for consideration. So far, the conversation has generated more than 140 comments from more than 30 participants; whether it has been "well-received" has been called into question.

Steve Milner said he liked the idea and proposed plan overall. He wondered if the LTS kernel would be specific to the AI desktop, or if it would be available to other Fedora variants. Messmer replied that he thought it would be useful for many people, not only AI desktop users. He said he often heard complaints about hardware-support regressions after kernel upgrades, and that a stable kernel could also benefit users who needed other out-of-tree modules for VirtualBox or ZFS. He admitted that an additional kernel would present more work for Fedora's quality team, but argued that "the testing process around Fedora kernels today has serious flaws" because the rolling-release kernel "does not align well with the concept of a stable release". Even if users participate in testing days and report regressions, "there just isn't any realistic alternative to shipping the new release series as an update".

Neal Gompa had a number of thoughts to share. He objected to changing Fedora's policy around supporting out-of-tree kernel modules: "the likelihood our user's systems will be considered tainted and ineligible for support from upstream kernel developers goes up significantly." Kernel developers prefer Fedora, he said, because it does not currently support out-of-tree modules.

He had reservations about equating AI specifically with CUDA; Fedora initiatives should encourage a fully open-source-software stack, not to endorse a proprietary one. Building it around CUDA would send "a dark signal that we don't care enough to push for open source driven AI technology stacks". He added that, with his FESCo hat on, he would be strongly against a policy change in favor of a stable Fedora kernel: "your rationale for this is rather weak, since it isn't even needed for OpenRM".

Messmer responded that the OpenRM module works well today, "but there is no guarantee that will be the case at any given point in time". That was the reason he was given for why OpenRM would not be built in Fedora's kernel package. NVIDIA, he said, was mentioned specifically in the proposal because there was work needed to enable NVIDIA hardware, not because the initiative was intended to be CUDA-specific. Other vendors had already provided "more active support or better aligned support". Gompa had also complained that Red Hat was not allocating kernel developers to do significant development of Fedora's kernel; Messmer said that was a reason why the stable kernel was needed.

I'm actually quite surprised to see anyone argue simultaneously that there are not enough developer resources assigned to the rolling kernel release and that the stable kernel isn't useful or desirable. Those seem like contradictory points of view, to me. The latter is the solution to the former.

The Gompa-and-Messmer discussion went on for some time; Gompa continued to emphasize that, unlike the openSUSE and Ubuntu distributions, which had similar corporate sponsorship, Fedora had a single kernel maintainer who is "massively overworked and isn't able to engage on Fedora kernel bugs". It did not matter what kernels Fedora had because Fedora does not have people to fix the problems that users discover: "it doesn't matter if the problem is in 7.0-rc6, 6.19-stable, or 6.18-longterm. They are still not getting fixed". The additional complexity of the packaging, installation, and bootloader infrastructure for multiple kernels did not make sense, he said. "I'm saying this as someone who is maintaining kernel trees and alternative kernel flavors for Fedora Asahi Remix and CentOS Stream Hyperscale: it's a bad place to be, and I would rather not be here if it wasn't absolutely required.".

Clement Verna said that there would be a lot of overlap between Messmer's proposal and what the Universal Blue community was already doing. That project develops Fedora-based images custom-tailored for specific use cases; for example, the Bazzite gaming distribution and Bluefin workstation distribution are part of the project. He said that Fedora could learn a lot from the automation tools being used by Universal Blue, and there could be an "opportunity to consolidate the maintenance effort for an LTS build" as well.

FESCo member Kevin Fenzi asked why Messmer would not do the entire project as a Fedora Remix. Projects can use the "Fedora Remix" branding while shipping third-party software, even proprietary software, so long as the remix does not use official Fedora branding packages. He added that the reason Fedora had a "'one kernel only' rule" was to reduce the maintenance burden. Messmer said that a remix was considered, but he wanted Fedora as a project to take part in community building around AI. "I believe that the communities we promote will promote the project in return."

Philosophical objections

Fabio Valentini, who is also a FESCo member, spoke up on April 27. He apologized for "arriving a bit late to the party", and said that he was not sure he wanted Fedora to make an AI desktop initiative. Fedora is "already being perceived as 'tainted by AI'" due to the council's decision to approve an AI-assisted contributions policy, which was "driving users and contributors away to distributions which are perceived as not drinking the AI Kool-Aid". (LWN covered the discussion in October 2025, prior to the council's decision.) He said an LTS kernel might be interesting, but did not agree with making "anything with 'AI' in its title" an official initiative, and worried that it would further alienate users.

Messner argued that it would call into question whether Fedora was really an open-source project if Fedora decided to nix the project because it had "AI" in the title, rather than due to policies about proprietary software: "that would actually be bad for our reputation". He cited the Open Source Initiative's Open Source Definition (OSD), which requires that a license "must not restrict anyone from making use of the program in a specific field of endeavor". Valentini said that did not make sense: Fedora has to make decisions "of the 'we could do this, but will not / do not want to do it' kind" if for no other reason than to limit the scope of the project. Choosing not to do something for ethical or philosophical reasons must be a valid reason not to do something.

Fernando Mancera replied that the OSD did not require a project to adopt specific technologies to be considered open source. The decision would be whether Fedora wanted to align itself with and promote a specific use, not about restricting others from pursuing the AI use case. Making something a Fedora initiative implied the project, as a whole, would be focused on its success. "The question is whether Fedora, as a project, should associate its identity and priorities with that field."

Reputational damage

Fedora Project Leader (FPL) Jef Spaleta entered the discussion with a lengthy reply. He said that he had "zero evidence in front of me" that people were avoiding Fedora due to AI, and asked to be shown metrics that would support that claim. Fedora has to be "out in front of conversations" even on controversial technologies, and it could not influence conversations it was not part of. He claimed he was "genuinely concerned about the ethical use of AI", but said the best possible future required the Fedora community to be part of the conversation about ethical use of AI.

The people who are going to get us to the better AI future are the people at the start of their journey and see value in the technology and Fedora needs to be influencing those people so they take the technology into an ethical direction that is most congruent with our shared ideals.

He added that, as FPL, he was "absolutely not concerned about the reputational damage to this project that comes with setting up an entirely new output attractive to developers who want to make use of AI tools".

Valentini replied that Spaleta had missed his point; the whole effort could happen without being an official Fedora initiative. There is no mechanism to prevent someone from working on an AI desktop, he just thought it would be better if Fedora did not promote technology that was "deeply problematic", in its current form, as a project initiative.

After taking a week away to let his thoughts settle, Mancera responded again; he agreed that Fedora should let data guide its decisions, but that went both ways. "Otherwise, we risk holding different arguments to different standards." He said he was struggling with Spaleta "expressing no concern about potential reputational impact"; that could be read as dismissing an issue that some Fedora contributors care deeply about. "Even if the change is non-disruptive in a technical sense, it can still influence how the project is perceived and what it signals about Fedora's priorities."

Spaleta responded, in part, that he was not dismissing anyone's concerns, he disagreed with what to do with those feelings. He said he lived in an area that had "the highest density of data centers" and was directly impacted by their power and water usage. But, telling people not to use the technology was not going to work. "Offering people a more ethical version, which they can actively contribute to making better, may help." Mancera replied:

I do not think we can move this forward in a community way. I am withdrawing all my activities in Fedora project starting right now. The present situation in Fedora is clearly not for me.

There was no need, Simon de Vlieger said, for the AI desktop to be an initiative right now. Instead, he suggested, it should be a remix with a special-interest group that built up a community before making it an initiative. If it turned out to be popular and sustaining, then it could become an initiative. He felt that it was being treated as "a train that must leave the station" to the disregard of community members and their concerns.

Spaleta continued to argue that an AI developer desktop was strategically important to Fedora. "I believe that the base image that comes out of this work needs to be an edition with its own working group in the 2028 timeframe."

Council

While the discussion raged on, Wheeler announced that the council had discussed the initiative on May 6. The council had voted to approve it (six in favor, none opposed) as a 12-month initiative led by Messmer with Spaleta as its executive sponsor. Gompa was unhappy with the council "essentially ignoring the community discussion", and said that the proposal was not acceptable to the community to approve as-is. "I'm especially disappointed at how we're being told that as our opinions and interests in the project as highly engaged contributors do not matter."

On May 8, Wheeler changed his vote to -1. The council requires full consensus to pass significant decisions, which means that a council member can halt the process and require discussion. Based on "recent public and private feedback, we do not yet have the necessary consensus to proceed". He said that feedback from Fedora's kernel experts had not been sufficiently integrated into the plan. "I am casting this vote to ensure we build a structurally sustainable initiative that succeeds without alienating or burning out our core experts."

Wheeler changed the date for the council ticket to May 22, and said that he was optimistic that the council could come to a decision "without a deadlock and FPL override" before then. According to the council's charter, the FPL can "'unstick' things if consensus genuinely cannot be reached" and a decision must be made.

Spaleta has firmly planted a stake in the ground that Fedora should be involved in the "AI conversation", as it were. It does not seem to be enough to do the technical work to make Fedora suitable for working with AI technologies; the project has to send a message that it's in favor of such things.

There is clearly a top-down push from Fedora's corporate sponsor to be AI-friendly, which is not surprising since the company is all-in on AI. Last year, Red Hat vice president of core platforms, Mike McGrath, weighed in on the AI-assisted contributions discussion to complain about Fedora's governing bodies being well-known for what they do not want. He wanted to see Fedora "leading and shaping the future and saying 'AI, our doors are open, let's invent the future again'".

Ultimately, it seems likely this initiative will be approved in some form. The pressure for Fedora to accept and embrace AI as part of its identity seems destined to continue until the project moves in the direction its sponsor wants—or until the winds change, bubble pops, or AI becomes yesterday's news. Since the AI craze is probably not going to go away soon, odds are Fedora will be making room for it sooner than later.

Comments (28 posted)

Forgejo "carrot disclosure" raises security questions

By Joe Brockmeier
May 8, 2026

An unusual, some might say hostile, approach to disclosing an alleged remote-code-execution (RCE) flaw in the Forgejo software-collaboration platform has sparked a multifaceted conversation. A so-called "carrot disclosure" in April has raised questions about the researcher's methods of unveiling a security problem, Forgejo's security policies, and the project's overall security posture.

Forgejo was forked from the Gitea collaboration and hosting platform in 2022. It is a project supported by the Codeberg e.V. nonprofit and is the software used by the Codeberg hosting service. The Fedora Project is also in the final steps of replacing its homegrown Pagure platform with Forgejo.

Carrot disclosure

In his disclosure post on April 29, Security researcher Julien Voisin said that it was Fedora's choice of the project as its collaboration platform that had inspired him to "take a good look at Forgejo's security posture". He claimed he had found a number of security flaws in Forgejo:

All in all, it took me one evening after work to find a good amount of vulnerabilities (adding to the one I got from looking at gitea at some point in the past), and chain some of them to obtain a full-blown RCE [...]

On April 27, Voisin had opened a few pull requests with the Forgejo project: a fix to quote attributes in a comment form, a change to remove a method that passed user-supplied strings to a command, and another to remove the "plain" OAuth authentication method. None of the pull requests included a description that might cause a maintainer to address the fixes as urgent for security reasons.

It also appears that he did not report any specific security flaws following the project's rather detailed security policy, even after being asked by "Gusted" when they noticed that Voisin had opened three pull requests with "some security relation". Voisin replied that he was going through his list of low-impact bugs "that could barely qualify as security issues worth reporting privately let alone choreographing an embargo". He also complained that the policy had "a lot of MUST, MUST NOT, MAY" requirements in reporting a security problem and wondered how the policy was enforced.

Rather than reporting the specifics of the RCE directly to the project, Voisin chose to do what he calls a "carrot disclosure", a term that he had coined in 2024. He defined it as "dangling a metaphorical carrot in front of the vendor to incentivise change", though in actuality the approach sounds more like the stick than the carrot. The idea is to demonstrate that the software is exploitable and to force the vendor to "perform a holistic audit of its software, fixing as many issues as possible in the hope of fixing the showcased vulnerability", or to lose users who are unhappy about running known-vulnerable software. "Users of this disclosure model are of course called Bugs Bunnies."

Fans of Looney Tunes cartoons might observe that, as a rule, Bugs did not go looking for trouble. Often he would be hunted or harassed in some fashion and then declare "of course you realize, this means war": but he did not instigate trouble. That seems relevant here, because Voisin's approach was the opposite: to go looking for trouble, and then to be provocative about it after a flaw is discovered.

Voisin claimed to have found an RCE, apparently unrelated to the issues he had opened pull requests about, that required a Forgejo instance to have open registration—in other words, to allow anyone to sign up to use the platform—and for it to have "a configuration option set to a non-default value". He did not specify which configuration option but said that it was enabled on some instances that he had looked at. His example showed him running a Python script, without revealing the actual code, that achieved an RCE against a Forgejo instance running on a local machine.

He said, in his disclosure post, that he could try to resolve the issues one by one with more pull requests but decided against it:

I could disclose the bugs to Forgejo, they even have a Security Policy, with a lot of MUST/MUST NOT about what I must or mustn't do should I decide to go this way. But given the sorry state of the codebase (not their fault though, they inherited the gitea/gogs ones), I'm pretty sure I could spend another evening and find another chain, and odds are that others have a bunch as well. I could try to fix the issues one by one myself and send pull-requests, but even if I wanted, this is a systemic issue, there is little point in playing endless wack-a-mole.

He said he was told, after discussing the topic with a friend, to "put my money where my mouth is, and just go with carrot disclosure that I usually advocate for in this kind of situation". After publishing the disclosure, he promoted it on the infosec.exchange Mastodon server on April 29. It quickly gained attention and became a topic of discussion on the Fediverse and was also shared on Hacker News, Lobste.rs, and other discussion forums.

Response

The public response was mixed and generally fell into two categories: those critical of Voisin's approach, and those who took Forgejo to task for its security policy and perceived problems. For example, Hans van Zijst said that it was "an incredibly condescending way to talk about someone else's work, and a horrible attempt to force volunteers to follow your priorities". Henry Catalini Smith said: "what you've done here was below any reasonable standard of professional conduct, and also very strange to me". He said that he had recently begun looking at Forgejo's accessibility bugs and thought it would be fun to work with the project on solving them; it made no sense to him that a specialist in a type of problem would want to "build a brand as someone this publicly hostile" to projects that had security problems.

On the Privacy Guides forum, "HackOrSwim" said that the reporter had acknowledged that many of Forgejo's problems were inherited when it forked from Gitea, but not that it would require a lot of resources to correct those problems. "This is essentially technical debt, which doesn't make it acceptable, but it makes it understandable why these issues are there. It's a non-profit, volunteer-driven project."

On the other side, Tony Arkles said that the policy "comes across as pretty demanding for a team that's getting free support from a community member", and called the process grating. "I think I'm with the author on this one." Elliot Speck, an information-security consultant and researcher, found that the Forgejo security guidelines are "obnoxious and pretentious". He said the project was "too busy taking the wrong things seriously".

Follow-up

Voisin published a follow-up with a summary of the events that took place after the disclosure on April 30. He said that he had been "called a handful [of] vile names", received complaints that he had "brought unwanted attention to an easy target", and responded to complaints that his conduct was unprofessional: "The terms 'not professional' (as in 'not acceptable in a professional environment') has been thrown around a lot, but nothing here is or was being done in a professional context."

He also said that he'd learned that "various entities" had revised their opinion about what Forgejo is, and isn't, "which was the main goal of the previous blogpost". Despite the drama, though, Voisin said it had led to some productive, good-faith conversations:

It seems that experimenting with odd vulnerability disclosure schemes is frowned upon. So I ended up sending an email to Forgejo security team, containing: an apology, a bit about my reasoning for proceeding with carrot disclosure, recommendations about what to harden/review, and a bunch of commented exploits/proof-of-concepts as attachment. We'll see how it goes.

On April 30, the Forgejo project published a short response. It said that Voisin had been in contact with Forgejo's security team with his findings:

The issues raised concern defense-in-depth improvements and denial-of-service risks. There is no known RCE exploit possible without internal server credentials.

We believe these findings can be addressed publicly. The security team will open issues where approaches to implement new defensive measurements will be discussed, we believe there's no single answer and as such appreciate the opinion of other Forgejo contributors on this matter.

The statement that the RCE is not possible without credentials seems to be at odds with Voisin's claims, but without more information there's no way to ascertain which is the case.

Given the increasing ease of using LLMs to find security flaws, it's entirely likely that Voisin casting a spotlight has increased the number of people spending tokens on finding holes in the project. Some of them may not be as well-intentioned as Voisin. In the coming weeks and months it will be interesting to watch Forgejo to see what kind of security improvements are made, and what kind of security flaws come to light.

Comments (26 posted)

The 2026 Linux Storage, Filesystem, Memory Management, and BPF Summit

By Jonathan Corbet
May 7, 2026

LSFMM+BPF
Once a year, a collection of developers from the kernel's storage, filesystem, memory-management, and BPF subsystems gathers to discuss pressing development questions that may not be amenable to solution via email. The 2026 edition of the Linux Storage, Filesystem, Memory Management, and BPF Summit was held during the first week of May in Zagreb, Croatia. LWN, naturally, was there in force.

[Esplanade hotel]
Coverage from this gathering is still in progress; the sessions with articles thus far are:

Plenary sessions

Filesystem track

Memory-management track

BPF track

Joint storage and filesystems sessions

Joint storage and memory-management sessions

Group photo

The traditional group photo, as taken by the Linux Foundation. More photos from the summit can be found on the LF flickr site.

[Group photo]

Acknowledgment

Many thanks to the Linux Foundation, LWN's travel sponsor, for supporting our travel to Zagreb to cover this event.

Comments (none posted)

A new era for memory-management maintainership

By Jonathan Corbet
May 7, 2026

LSFMM+BPF
On April 21, Andrew Morton let it be known that he intends to begin stepping away from the maintainership of kernel's memory-management subsystem — a responsibility he has carried since before memory management was even seen as its own subsystem. At the 2026 Linux Storage, Filesystem, Memory Management, and BPF Summit, one of the first sessions in the memory-management track was devoted to how the maintainership would be managed going forward. There are a lot of questions still to be answered.

Morton started by observing that he had received almost no responses to his announcement. That was, he suggested, a result of the fact that he was asking others to take on more responsibility in the subsystem; developers are going to have to pick up a lot of tasks.

[Andrew Morton] How that work can be spread out is an open question. There are 164 C files in the kernel's mm directory; a quick search shows that terms like "THP" (transparent huge page), "cgroup" (control group), and "NUMA" each show up in a large number of those files. The core memory-management concepts, in other words, are widely spread throughout the subsystem. Moving code around might help a bit, Morton said, but not much; everything is heavily interlinked, so trying to split up the subsystem will be challenging. But, he noted with a smile, "it's not my problem".

Regardless of how successful a split is, there will always be a need for a catch-all tree serving the subsystem as a whole. The management of that catch-all and integration tree will be picked up by David Hildenbrand; Morton offered his thanks to Hildenbrand for taking on that responsibility.

The memory-management developers (and those for the kernel as a whole), he said, have many layers of defense to prevent the shipping of bad code to users. The community can put out random stuff, but it goes through weeks of testing in the -mm tree, followed by more weeks of testing after landing in the mainline kernel. Fixes can be backported into the stable kernels for years, and distributors provide further layers of assistance. More recently, Sashiko, by providing a new level of patch review, has become yet another layer of defense.

Developers should, he said, recognize that they are not creating production-quality code at the outset; they are creating a technology and letting others turn it into products. All of the layers of defense allow developers to pursue an "aggressive rate of change". There is a downside to that change rate, though: it puts a lot of pressure on reviewers. Within the memory-management subsystem, the review work is quite lopsided, in that a small number of people do the bulk of the review, while many developers do not carry much of the review load at all. He does not understand why things work that way, and wishes that the situation could be improved.

The memory-management team, he said, is a great group of people; they are cooperative and quick to help each other. He worries, though, that the community could, over time regress to be like other parts of the kernel or other open-source projects, where emails are ignored because maintainers are too busy. Ignoring messages from contributors, he said, is shameful and unacceptable. But, he thinks, the memory-management community does value its culture and will work to maintain it.

Matthew Wilcox said that, when somebody asks him how to get started within the community, he always encourages them to start reviewing patches. That advice is often not taken, though. He added that, while he tries to respond to email, there are days where he is simply unable to reply to all of the messages that have shown up. Dan Williams said that the community has long depended on Morton to apply pressure when responses are needed; Hildenbrand and others will need to apply that pressure in the future.

Hildenbrand responded that there will always need to be somebody who makes sure that people are doing their part; that's part of what makes the subsystem great. The level of developer frustration within memory management is lower than in many other subsystems. He worries a bit about the onset of LLM-based review tools, though. Everybody agrees that there is a need for more human reviewers, so he thinks that reviews from tools like Sashiko should not be posted to the public lists. Early-stage reviewers, who are just learning their way around, will be demotivated by seeing that automated reviews have already found many of the problems. Automated review, Hildenbrand said, should be one of the last lines of defense, not the first.

[David Hildenbrand] Liam Howlett said that sending LLM-based reviews to one-off contributors can have the effect of validating bad ideas; a number of participants suggested that these tools are good at finding bugs, but less good at addressing the question of whether a given change should be made at all. Morton said that beginning reviewers often focus on understandability issues that more experienced developers will skip over; that, too, is important feedback.

At the end of the session, Hildenbrand stepped up to thank the community for trusting him with the responsibility for running the integration tree. He warned Morton that the community was not going to let him go easily, though, there will be a lot of questions. Some sort of working group will be formed, Hildenbrand said, to figure out how the memory-management community's development model should work in the future. It may be that more sub-components will move to their own trees, while there will always be the integration tree to pull everything all together. He closed by letting it be known that he would not be doing all of the work on his own.

Comments (10 posted)

Using dma-bufs for read and write operations

By Jonathan Corbet
May 12, 2026

LSFMM+BPF
The kernel's dma-buf subsystem provides a way for drivers to share memory buffers, usually in order to support efficient device-to-device I/O. At the 2026 Linux Storage, Filesystem, Memory Management, and BPF Summit, Pavel Begunkov, assisted by Kanchan Joshi, led a joint session of the storage and memory-management tracks to explore ways to make the use of dma-bufs more efficient yet, and to make them available for read and write operations initiated by user space.

Begunkov began with a mention of this 2022 patch set from Keith Busch, which pointed out that, while a dma-buf can facilitate efficient I/O operations, there is often a fair amount of expensive setup work to do before those operations happen. This work includes the creation of various internal data structures, the establishment of DMA mappings, and possibly some expensive configuration of the I/O memory-management unit (IOMMU). When a new dma-buf must be created for each operation, that work must be repeated and much of the efficiency is lost. Busch's solution was to allow dma-bufs to be registered with the io_uring subsystem, similarly to how io_uring supports registered files and buffers. That would allow the registered dma-buf to be reused (within io_uring), spreading the setup cost across multiple operations.

[Pavel Begunkov] That series never made it into the mainline, but interest in that concept remains. Begunkov has a patch series of his own extending Busch's work. His objective, he said in the session, is to create a consistent infrastructure to allow for the use of dma-bufs in the networking and storage subsystems. He has chosen io_uring registered buffers as the user-space API, with a special registration operation needed for dma-bufs. User space would obtain a dma-buf from a subsystem that supports them, then register the associated file descriptor with io_uring; thereafter, it would be available for I/O.

There are some requirements for this work. Despite the use of io_uring as the API, the internals of this mechanism should not be io_uring-specific; it should eventually be extendable to filesystems and beyond. It also has to support map invalidation by the dma-buf provider. The internal API is centered around a new io_dmabuf_token structure, which is the interface between the driver implementing the dma-buf and io_uring. Specific I/O requests are tracked with an io_dmabuf_map structure, which is supported by the iomap subsystem to provide a driver-specific way of iterating through I/O requests. The patch series is coming along, but is not yet ready.

One question that comes up occasionally, he said, is whether P2PDMA should be used for this purpose. There are a few reasons why P2PDMA is not sufficient. It is unable to use dma-bufs that user space may already have, but that is a requirement. The new API can support cheaper intermediate transformations of data, better optimize IOMMU use, and provide support for map invalidation; a member of the audience said that P2PDMA supports map invalidation as well. The downside of not using P2PDMA is, of course, the need for a new API, and one that is limited to io_uring for now.

Use cases, Begunkov said, include applications that need to optimize IOMMU use with normal host memory. There are a number of networked storage solutions that could benefit from easy movement of data between network interfaces and filesystems. There is also evidently a company that wants to use this feature for its GPU infrastructure. Joshi added that the NVMe subsystem could benefit from this feature to implement pass-through support, among other things. Future plans include adding support for more block drivers, for the SCSI subsystem, and for filesystems.

An IOMMU pre-mapping benchmark showed performance improvements of up to 8.8x. Notably, pre-mapping completely eliminated the performance penalty that comes from using the IOMMU in either the lazy or strict modes, both of which do a certain amount of TLB invalidation on mapping changes to enforce device isolation. In other words, it is no longer necessary to use the IOMMU pass-through mode, which is seen by some as being less secure, to get full performance

Jason Gunthorpe, though, wondered why pass-through mode was not enough, and how the additional complexity of pre-mapping was justified; Begunkov answered that security concerns were behind the desire to get away from pass-through mode. Gunthorpe said that a better solution was to just not leave the IOMMU mapped after operations are complete. Christoph Hellwig said that some sites are requiring IOMMU use, and that the memory coalescing that IOMMUs do is helpful for performance, so full IOMMU support with good performance is needed; Gunthorpe acknowledged that those were good points. Matthew Wilcox suggested that the mapping of a buffer is a good time to defragment the underlying memory, removing the need for coalescing in the first place.

David Howells worried that misuse (accidental or deliberate) of dma-bufs could create problems by clogging all of the available IOMMU slots, and wondered whether this feature would require privilege to use. Begunkov agreed that it could be a problem, and said that some sort of capability check would be required.

Christian Brauner took issue with the fact that this feature uses scatterlists, an internal API that the developers would eventually like to get rid of; Hellwig answered that dma-bufs still need scatterlists, so they cannot be avoided for now. There was some unfocused discussion on removing the scatterlist dependency from dma-bufs, but Hellwig said that Begunkov's work should not be held up waiting for that cleanup to be done. As time ran out, there was also some discussion of how filesystem access might be supported; patches for that have not yet been seen.

Comments (2 posted)

Scaling transparent huge pages to 1GB

By Jonathan Corbet
May 12, 2026

LSFMM+BPF
As a general rule, when developers talk about huge pages, they are referring to PMD-level pages that are 1MB or 2MB in size, depending on the CPU architecture. Most CPUs can support other huge-page sizes, though. On x86 systems, PUD-level huge pages hold 1GB of data. Providing such large pages transparently to processes has generally not been considered as either feasible or desirable, but Usama Arif is trying to change that assessment. At the 2026 Linux Storage, Filesystem, Memory Management, and BPF Summit, he led a session in the memory-management track on how to make transparent huge pages (THPs) truly huge.

On most systems, a 1GB, physically contiguous chunk of memory can be hard to find, especially after the system has run for a while and memory has fragmented. Applications that can make use of such large chunks of memory have also been relatively scarce. It is not surprising that little effort has gone into making a difficult-to-find resource transparently available to processes that are unlikely to benefit from it. But, as Arif began, large-scale installations are now running with terabytes of installed memory. On such systems, a PMD-level huge page is no longer huge. Managing all that memory brings scalability problems; managing it in 1GB chunks can help.

[Usama Arif] Applications can gain access to 1GB huge pages now by using the hugetlbfs subsystem. But hugetlbfs is a static resource, requiring the establishment of a separate pool at boot time. It provides no fallback if a huge-page allocation request cannot be satisfied. There is a real need, Arif said, for a transparent way to back large applications with 1GB huge pages. He has an RFC patch set to fill this need, posted in February, that turned out to be smaller and less invasive than he had expected.

Arif dove directly into the details of how the management of 1GB transparent huge pages would work. When creating a 2MB PMD-level transparent huge page, current kernels "deposit" an extra (base) page that can later be used to remap the huge page at the PTE level, should the need come to split it. This deposit is made because splitting may be happening in response to memory pressure, so it should be possible to do without having to allocate more memory first. The single page is wasted during the life of the THP, but it serves as a sort of insurance policy for times when memory is scarce.

In the RFC patch for 1GB THPs, Arif scaled up this policy to match the page size; it deposited pages for the PMD-level page table and 512 PTE-level page tables that would be needed to split the THP. That is about 2MB of wasted memory, which makes for an expensive insurance policy. David Hildenbrand had questioned the need for this preallocation — for both PMD-level and PUD-level THPs — so Arif was now considering doing without the page-table deposit. In the session, Hildenbrand said that, if the system is splitting 1GB huge pages, somebody is doing something wrong. Even on the largest systems, those pages are a scarce resource; they should be kept intact if at all possible.

The question that should be considered, Hildenbrand said, is how to decide which processes should be given 1GB THPs. Kiryl Shutsemau suggested that processes could request them with madvise(), but Hildenbrand said that would be problematic for a number of reasons. Shutsemau then wondered about processes asking for 1GB THPs that cannot actually make full use of them; Arif answered that this case is why splitting those pages needs to work.

Hildenbrand, again, said that splitting those pages should be avoided and advocated for a smarter way of allocating them. Perhaps they could be limited to shared-memory regions, for example. Arif said that would put the burden on user space to set things up properly; he was hoping for a more transparent solution. Lorenzo Stoakes said that 1GB huge pages are a resource that the kernel must maintain control over; Usama said that, by default, they would not be allocated unless the system administrator enabled them.

Matthew Wilcox said that the right answer was to make 1GB huge pages cheap enough that they are no longer a scarce resource. Johannes Weiner said that there are a lot of applications that could benefit from 1GB THPs, but they (or their users) do not know that. At his employer, they have been rolling out 1GB huge pages extensively, and seeing a lot of performance benefits from them. He suggested handing out 1GB THPs widely, then fixing the cases where they are not fully used.

Arif moved on to the question of whether supplying 1GB THPs would require using the contiguous memory allocator (CMA). His patch series works without it, but allocating huge pages of that size can be hard. Part of the problem, he said, is that the memory-management subsystem's compaction code is currently working at the PMD level, so it does not succeed in defragmenting 1GB huge pages. He mentioned some ongoing work from Rik van Riel that is aimed at making 1GB chunks easier to allocate.

Splitting of 1GB THPs is an open question as well, Arif said. The current patch set, when called upon to split such a page, will disassemble it all the way to the PTE level, yielding 262,144 base pages. He is considering only splitting to 512 PMD-level huge pages, then splitting just one of those down to the PTE level. He asked whether that would be an acceptable strategy; relative silence in the room suggested that, at a minimum, there were no real concerns with that idea.

When he asked whether the khugepaged kernel thread should try to assemble 1GB THPs from existing process memory, though, the answer was a clear "no". Allocating them at initial mapping time seems to be the desired approach. Creating them after the fact in response to an MADV_COLLAPSE madvise() call might be acceptable, though.

Migration of 1GB THPs is another challenge; finding a 1GB page at the destination node can be difficult. The alternatives are to block migration for these pages, or to split them. Blocking migration is a simple solution, but it loses the transparency aspect, and would break memory hotplug. With splitting, hotplug and NUMA balancing would work, but the 1GB mapping would be lost. Which alternative is best (or least bad) is not clear.

The group might have tried to debate that problem, but the session was far over its allotted time by this point. Hildenbrand closed it by suggesting that the initial implementation should operate only on shared memory; that would simplify a number of aspects of the implementation. It would also be possible to add a mount option for shmfs that would allow administrators to control access to the feature.

Comments (3 posted)

Revisiting mshare

By Jonathan Corbet
May 13, 2026

LSFMM+BPF
Linux can share memory between processes, but each process (almost always) has its own set of page tables. In situations where vast numbers of processes are sharing a memory region, the combined size of the page tables can exceed that of the shared memory itself. There has, thus, long been an interest in enabling unrelated processes to share page tables referring to shared memory. Anthony Yznaga is the latest developer to try to push this idea (known as "mshare") forward; he described the status of that work in a memory-management-track discussion at the 2026 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF).

[Anthony Yznaga] This is not the first (or second) time that page-table sharing has made it onto the LSFMM+BPF agenda; it was most recently discussed in 2024, when Khalid Aziz updated the group on the proposal. Aziz has since retired, and Yznaga has picked up the work.

The overall shape of this patch series has not changed; sharing starts when a process creates a shared memory region by creating a file in a special msharefs filesystem. That region is created along with its own mm_struct structure, which is used to manage the page tables for the region. Each sharing process can then attach to this region by opening and mapping the msharefs file, resulting in the creation of a special virtual memory area (a "window VMA") representing that region in the process's address space. Page faults and other memory-management operations, upon encountering the window VMA, follow the pointer to the special mm_struct and operate on the page tables there.

At least, that is how things looked in 2024, and again in 2025 when Yznaga posted an updated version of the patch set. In the session, though, he let it be known that, while the implementation of mshare is substantially the same, the API has switched back to system-call form, as had been the case with earlier versions. Back then, a single mshare() system call had been proposed; now there is a whole set of them. The shared region is now created with mshare_create():

    int mshare_create(unsigned int flags);

This call will return a file descriptor representing the new region; the only supported flags value is O_CLOEXEC. The size of the region must subsequently be set with an ftruncate() call. A call to mshare_attach() will map the shared region into the calling process's address space:

    int mshare_attach(int fd, unsigned int offset, unsigned int size,
    		      void *addr, unsigned int flags);

(Note that Yznaga did not show the types of the parameters on his slide, so I have filled in something plausible). There is an mshare_map() to do the equivalent of an mmap() call, setting up backing store for the addresses within the shared region. There are various other calls, including mshare_advise() and mshare_protect(), to control the management of this region. Yznaga did not go into details about how other processes find and attach to this region.

The ownership model of the shared region has changed somewhat, in that the process calling mshare_create() is the owner for the life of the process; when that process exits or closes the file descriptor, the region disappears and mappings in other processes are removed. This change, he said, simplifies control-group accounting, makes the lifetime of the region clear, and provides a target for the out-of-memory killer should things reach that point.

Yznaga concluded with a summary of problems he is working on now. Page-table walking is a big challenge he said, especially getting the locking between the window VMAs and the mshare region's VMA correct. The resident-set-size statistics for mshare-using processes are wrong; the information for the shared region is stored in the special mm_struct and not exposed anywhere. The current design requires the process that created the region to stick around; there would be value in some sort of ownership-transfer mechanism so that the creator could hand the region off and exit.

He is also looking for more potential use cases for this feature. Jason Gunthorpe mentioned high-performance-computing processes that need to share resources, so it would be good to show that this feature can work for that use case. Another participant mentioned the Android Zygote process, which serves as the parent for all apps in the system. There is quite a bit of sharing between those processes, and thus potential benefit from mshare, but any changes made by a process to the region should not be visible to other attached processes, so it would be necessary to unshare page tables in that case.

Another participant asked how TLB flushing is handled in the shared region. There is, Yznaga answered, a linked list of all processes sharing the region; when a TLB flush happens, that list is traversed and each process is flushed individually. Gunthorpe observed that this list sounded a lot like an MMU notifier; Yznaga said that he had tried using notifiers, but they did not work in that case.

Matthew Wilcox noted that allowing each process to map the shared region at a different virtual address increases the complexity of the feature; perhaps requiring the region to be mapped at the same address everywhere would be a useful simplification? The response in the room made it clear that this was not a popular idea. The session ended with Wilcox suggesting that, once the feature finally lands, it will be possible to remove the page-table sharing implemented by the hugetlbfs subsystem, which is currently the only way to get this kind of sharing on Linux systems.

Comments (5 posted)

Providing 64KB base pages with 4KB kernels, two different ways

By Jonathan Corbet
May 11, 2026

LSFMM+BPF
Some CPU architectures are able to run with a number of different base-page sizes; using a larger size can often result in better performance at the cost of increased memory use. Other architectures are more limited. At the 2026 Linux Storage, Filesystem, Memory Management, and BPF Summit, two sessions in the memory-management track explored options for letting processes run with 64KB page sizes when the underlying kernel does not. The first was focused on letting each process have its own page size, while the second concerned bringing 64KB pages to x86 systems.

Per-process page sizes

Using 64KB pages improves performance, but doing so can also create internal fragmentation and a significant amount of wasted memory. That memory-use price tends to limit the use of larger base-page sizes. Ryan Roberts and Dev Jain (remotely) presented a plan to enable the running of processes with page sizes that differ from that of the system as a whole, in an attempt to get the best of both worlds.

[Ryan Roberts] Roberts started by saying that there is a performance gap between systems with larger and smaller page sizes. With a "random selection of benchmarks", a performance improvement of 2-17% can be had with a larger page size. But the associated memory usage increase pushes people to stay with the the standard 4KB page size supported by many architectures. The contiguous-PTE support found in some recent processors (where physically contiguous pages can share a translation lookaside buffer (TLB) entry) helps a bit, but even using that feature, the performance gap remains.

There are a number of reasons for the performance difference. On the software side, a larger page size equates to fewer page faults and shorter least-recently-used (LRU) lists in the kernel. On the hardware side, larger pages lead to better TLB use; a system running with 64KB pages can cover 16 times the memory area with the TLB. Arm CPUs can cache the results of the last page-table walk, speeding the translations of addresses that land within the same page-table entry (PTE) page; larger page sizes increase the coverage of that cache. Using larger pages also just makes the page tables more compact, reducing their TLB and cache impact.

There is architecture-level work aimed at closing the performance gap, Roberts said, but the results of that work will not be available for some years yet. So there is reason to explore what can be done on the software side instead. One possibility is to give each process its own page size, so that processes that benefit from larger pages can have them without imposing higher memory use on the system as a whole. The Arm architecture, in particular, supports this concept, allowing the kernel to remain with a 4KB page size while letting individual processes run with larger pages.

Jain took over to describe the proposed implementation, which is split into three layers. The first of those, the "ABI adaptor", is designed to hide the difference between the kernel's page size and that of any given process. Each process's page size is stored in the mm_struct structure; it is preserved when a process forks, but may be changed by an execve() call. Various system calls (mmap(), for example) will modify length and alignment parameters to match the kernel's page size. That work is fairly straightforward, Jain said, but ioctl() calls can require more care. The ELF loader is enhanced to understand the alignment needs of processes with different page sizes. There is a fair amount of trickery added to the implementation of various /proc files so that a process running with 64KB pages sees the results that would come from a 64KB kernel.

The second layer is a set of modifications to the kernel's memory-management subsystem. It turns out that a lot of the code paths used to implement transparent huge pages can be reused to provide 64KB pages, on a 4KB kernel, to processes using the larger page size. For such processes, allocation requests specify the page size as the minimum acceptable allocation size; larger pages, up to the PMD-level huge-page size, remain possible.

The page cache presents challenges of its own, since it is shared by all processes in the system. One option would be to just use 64KB folios there all the time, but that would waste quite a bit of memory when caching small files. So the page cache still uses 4KB pages most of the time. Should a 64KB process map a file with mmap(), all 4KB folios from that file will be dropped from the page cache, and any new folios will subsequently be added to the cache at the larger size.

Kiryl Shutsemau asked whether all filesystems support larger folios in the page cache now; Matthew Wilcox answered in the negative, saying that some filesystems are "lazy slackers" that have not yet added that support. The biggest problem, he said, is Btrfs. Wilcox suggested that, as an alternative to dropping page-cache entries, the kernel could go ahead and use 64KB folios as long as they do not extend past the end of the file.

Lorenzo Stoakes said that this work looks rather invasive, and wondered why it was not possible to just make greater use of multi-size transparent huge pages (mTHPs), which can provide a number of the same benefits. Roberts answered that mTHPs do not provide all of the hardware-level benefits that a larger page size does. Stoakes also worried that extensive use of larger page sizes could put a lot of pressure on the memory-management subsystem's compaction code.

Time was running short, so Roberts skipped over some of the intended discussion (including the third layer, which is the architecture-specific code that handles differently sized page tables) and moved directly to a list of open items. The first of those had to do with what happens if the kernel attempts an operation requiring a 4KB page size while running in the context of a 64KB process. One option would be to have the process fall back to 4KB pages; that would provide functional correctness, but would lose performance. The alternative is to fail the operation; this idea seems simpler, but would require sprinkling a lot of page-size checks throughout the kernel, Roberts said.

User-space ABI compatibility is a challenge; the kernel can pretend to be running with 64KB pages when queried by a 64KB process, but it will never be able to emulate everything. Some /proc files, for example, simply cannot hide the fact that the kernel is using 4KB pages. It is also not possible for /proc/PID/pagemap to represent a 4KB process when read by a 64KB process. There are also some system calls and other features (userfaultfd(), for example) that cannot be emulated.

One way to deal with these problems, Roberts said, would be to "defeature" 64KB processes, limiting their functionality. Processes with different page sizes would be invisible to each other, and processes with page sizes larger than the kernel's would be unable to use features like userfaultfd(). Any operation that cannot be properly represented to a 64KB process would simply fail.

Roberts concluded that saying that, while allowing processes to have different page sizes brings benefits, there are some sticky points as well. Adding this feature would also bring a fair amount of churn to the memory-management subsystem. Those benefits may well be worth the trouble, though.

A 64KB base-page size for x86

Using larger base pages can be a nice solution for workloads that benefit from them, but there is one little problem: some minor architectures, including x86, do not support running with larger base-page sizes. In the next session, Shutsemau proposed a way to work around this limitation on x86 systems. The idea was met with a certain amount of skepticism by the assembled developers, though.

[Kiryl Shutsemau] Using 64KB base pages, Shutsemau began, can provide a 1.7% performance improvement on "a very important workload" on Arm processors; he would like to bring that speedup to x86 systems as well. Using larger pages would reduce the memory overhead of the system memory map, allow for easy (and performance-improving) TLB coalescing, faster I/O operations, and easier allocations of 1GB huge pages. Doing so, he said, requires splitting the kernel's concept of the system page size in two.

Currently, the PAGE_SIZE macro is used throughout the kernel to represent the hardware's base-page size. Shutsemau would phase that out, in favor of PTE_SIZE, which describes the hardware's view of the base-page size, and PG_SIZE, which is the size of pages as they are managed within the kernel (and seen by user space). The PAGE_SIZE macro would only be defined at all if PTE_SIZE and PG_SIZE are equal. Page-frame numbers, he said, would always refer to PTE_SIZE frames.

Needless to say, there are a lot of places in the kernel that would have to change to reflect this new view of the world. Creating page-table entries would become more complicated, since the offset within the (PG_SIZE) page would have to be taken into account; all of the functions that deal with PTEs would gain a new offset parameter. While the kernel is managing 64KB pages, user space would still see the page size as being 4KB, as always. So there would be no user-space changes required to run successfully on such a system.

The most challenging part, Shutsemau said, is page-fault handling, since multiple PTEs would have to be mapped for each faulting page. User space would only be held to a 4KB alignment requirement, meaning that virtual memory areas (VMAs) could begin or end in the middle of a 64KB page. The page-fault handler might, as a result, end up only mapping part of a page when a fault happens; in such cases, the unmapped part of the page would simply be wasted. Misaligned pages could also lead to memory waste.

Wilcox said that copy-on-write (COW) faults would become more expensive on these systems, since they would have to fault in surrounding base pages to fill out a 64KB page. David Hildenbrand, instead, worried about how userfaultfd() could be implemented; it might need a new operation to install a single PTE rather than a whole page.

Hildenbrand also suggested it might be better to just go to a 64KB page size throughout the system; that would make life easier for everybody, he said. That, Shutsemau answered, would really just have the effect of shifting the complexity to the architecture code, which would have to implement the fiction of a larger base-page size and hide the details from the rest of the kernel. Going to a larger base-page size would also break some applications. Hildenbrand was unsympathetic to the latter point, saying that such applications should either be fixed or just be run on 4KB systems.

Jason Gunthorpe said that there has been a lot of experience with 64KB page sizes on Arm systems. Users tend to push back, he said, because there is always one special application that can only run with 4KB pages. Another participant asked why this complexity was needed when the kernel has support for mTHPs that is getting better over time. Part of the problem with that idea, Shutsemau said, is that not all filesystems support larger folios. Sticking with a small base-page size also makes it harder for the system to allocate larger chunks of memory.

On the topic of memory waste, Hildenbrand suggested the possibility of creating "negative-order folios" to represent sub-page chunks of memory. The idea of using the slab allocator for sub-page allocations was also suggested, but that would not work in all cases.

As the session ran low on time, Shutsemau acknowledged that he was not seeing a lot of enthusiasm for his proposal. He asked what the fundamental objections were. Hildenbrand answered that, in current kernels, an order-zero folio is a single page; changing that understanding would entail a significant change in how folios are handled. He asked for a cleaner way, one that requires no "weird part-of-page interfaces" to reach the desired objectives.

Gunthorpe said that the fundamental constraint is that there has to be a way to run old applications that require a 4KB page size. It would be better to find a way to solve that problem, with minimal kernel disturbance, on systems with a larger base-page size. The session closed with Hildenbrand saying that other work in the kernel is addressing many of the motivations behind Shutsemau's proposed changes. Given that, he suggested, 64KB base pages may not be the future; the right path may be better optimizing the operation of systems with 4KB pages.

Comments (14 posted)

A 2026 DAMON update

By Jonathan Corbet
May 8, 2026

LSFMM+BPF
The kernel's DAMON subsystem provides user-space monitoring and management of system memory. DAMON is developing rapidly, so an update on its progress has become a regular feature of the annual Linux Storage, Filesystem, Memory Management, and BPF Summit. This tradition continued at the 2026 gathering with an update from DAMON creator SeongJae Park covering a long list of new capabilities — tiering, data attributes monitoring, transparent huge pages, and more — being added to this subsystem.

DAMON, Park began, is a kernel subsystem that provides efficient monitoring and operations for memory management. At its core, it spawns a kernel thread that samples memory accesses every 5ms. The results are combined into data that is returned to user space every 100ms, though these intervals can, of course, be tuned manually or automatically. The access information returned describes the location, stability, and frequency of memory operations. The system was designed to be both accurate and lightweight, and to be both tunable and auto-tuning. On a typical system, it imposes a performance overhead of less than 0.1%. This subsystem was first merged into the 5.15 kernel release; it is enabled in many distribution kernels at this point.

The "second face of DAMON", Park said, is the DAMOS machinery, which provides operations to change how memory is managed. Operations can, for example, force out cold memory, or migrate memory between tiers depending on its usage patterns. More information is available on the DAMON web site.

Tiering

At the 2025 Summit, Park said, he had described the damos_migrate operations, which had been merged for the 6.11 release. These operations facilitate the movement of pages between system RAM and CXL-attached memory — memory tiering, in other words. Work on TPP-DAMON (where "TPP" stands for "transparent page placement") was underway, with the ability to automatically tune thresholds to yield high RAM utilization. Work on tiering is continuing, but a single thread has proved to be too slow for the task. So TPP-DAMON has moved to a multiple-thread model. It is able to produce a 94% improvement in a llama.cpp benchmark. TPP-DAMON was merged for 6.16, with control-group awareness being added in 6.19. Development has moved elsewhere, though, so TPP-DAMON has already landed in support mode.

[SeongJae Park] The damos_migrate operation has been extended to support dynamic interleaving, where some hot memory is placed in (slower) CXL memory to maximize the overall utilization of memory bandwidth. It can support multiple destination nodes, each with its own weight, but works in virtual address spaces only. This feature can produce a 25% speedup in an unnamed benchmark; it was merged in 6.17.

Automatic tuning of interleaving is still a work in progress; it works in the physical address space. It can request the migration of pages with the goal that a given level of memory pressure should be maintained, or that a specific percentage of hot pages be placed in CXL memory. This feature was merged for the 7.1-rc1 release.

Meanwhile, the effort that was going toward TPP-DAMON is now focused on NUMA-TPP-DAMON, based on the observation that tiering is, in the end, just a special case of NUMA placement. In the new model, a system has a set of memory accessors (CPUs, GPUs, or other devices that access memory), and a set of promotion paths that can be used for memory. The concepts are there, he said, but this work is still in a brainstorming phase.

Davidlohr Bueso asked whether use of NUMA-TPP-DAMON would require disabling NUMA balancing; Park said it would not. Bueso expressed concern about the different layers fighting with each other over memory-placement decisions, but Park thought that it could be avoided with careful goal setting.

Data attributes monitoring

Last year, he said, developers had started work on page-level attribute monitoring, designed to answer questions related to, for example, how many bytes in a given region are backed by huge pages or are charged to a given control group. This monitoring has been implemented, but the overhead is high. The feature has been improved, with a number of important fixes arriving in 6.15, but the overhead problem remains.

There is a new data attributes monitoring project being started, with the goal of supporting use cases like fleet-wide monitoring. It implements a sampling-based, page-level monitor with the ability for users to register probes to narrow the set of interesting pages. Each probe filters pages based on attributes like type (anonymous or file-backed), control-group membership, idleness, and so on. These probes can act as a DAMOS filter.

This system turns out to be lightweight and scalable, using the existing access-sampling logic. Its accuracy, though, is "arguable", depending on multiple workload-related factors. Page-level monitoring can be used to get more accurate information, he said, if the associated overhead is acceptable.

The first version of the data attributes monitoring patch set is on the mailing list, he said, and may be declared ready soon. At this point, the main feature is monitoring of anonymous status, but the future plans are somewhat more ambitious. The intent is to turn data access into another attribute that can be monitored, and to add a pg_idle DAMON filter that can act on that attribute. DAMON will support attribute-based splitting and merging of regions. There will be a richer set of access-check primitives, with filters for data from page faults or the system's performance monitoring unit (PMU). This feature could end up being the base for the NUMA-TPP-DAMON work.

Monitoring data from other sources is an active area of consideration. Park would like to be able to classify data accesses in a number of ways, including the source NUMA node, control group, or thread. This data would help in the writing of cache-aware sched_ext CPU schedulers; it could also be helpful for NUMA-TPP-DAMON. It could be used, for example, to find the virtual machine doing the fewest writes, which would normally be the easiest live-migration target.

Currently, DAMON is using the page-idle bit for access checking; the resulting data lacks any information about who accessed the memory or what type of access was done. To get better data, he would like to pull in events from other sources, including the page-fault handler and the PMU. The NUMA subsystem, for example, collects data on which nodes are accessing pages now; a "prototype hack" exists to feed that data into DAMON. But that use of the data interferes with the original NUMA-balancing intent, which is not the desired result.

So, Park said, an important next step is the cleaning up of the NUMA-hinting code; that should happen before any extensions are attempted. But there will still be concerns about interference between DAMON and NUMA hinting. Since both will be using the page-idle bits, each will "measure" faults caused by the other. One way to address that problem would be to make NUMA hinting and DAMON mutually exclusive, so that only one could be built into the kernel; it is a simple but inflexible approach, and would put distributors in the difficult position of having to decide which feature to enable.

An alternative is run-time isolation, where only one of the two features could be active at any given time. This is a clean and flexible solution, but relatively hard to implement. Partial isolation is yet another approach, where page marks would be left in place during transitions from one subsystem to the other. That would make transitions quicker, at the expense of muddying the data somewhat. Or, Park said, the two subsystems could just be allowed to interfere with each other; whether that would result in real problems is not yet clear. DAMON should be able to handle that interference. Concurrent use of the two subsystems would be rare, so maybe the whole problem can just be ignored. Kiryl Shutsemau pointed out that, since NUMA balancing uses sampling, losing some information is not necessarily a big problem.

Park's proposal was that, once the needed cleanup work is done, the first implementation would use either build-time isolation, or just ignore the problem altogether.

Another source of useful data could be the PMU, via the perf events subsystem. There are RFC implementations integrating the PMU into DAMON circulating, and the perf maintainers seem to have no problems with the idea. But, data out of the PMU is hardware-specific and it is harder to get useful data inside virtual machines. So the utility of this data in the general case is not entirely clear.

DAMON-X

Park briefly discussed a concept that he called "DAMON-X", otherwise known as "DAMON that just works". DAMON offers manual tuning knobs for users who want them, and automatic tuning for everybody else, but each DAMON module runs exclusively of the others. Park is working on a solution where all modules share the same basic monitoring parameters, and differ only in the DAMOS schemes that they offer. A single context can run multiple schemes, which users can install and uninstall at will. To the extent possible, all of this will auto-tune itself and simply work. A proof-of-concept implementation is to be expected later this year.

Access-aware transparent huge pages

Transparent huge pages (THPs) are good in that they can make programs run faster. But they can also cause internal fragmentation and memory waste. Users have some control over the use of THPs via madvise(), but it is hard to provide the right advice; perhaps DAMON can help. The damos_hugepage module will track access patterns and, depending on how memory is being used, collapse base pages into huge pages or split huge pages apart again. It was able to remove 80% of the THP-caused memory bloat from one benchmark while preserving 46% of the performance gain. The work is an early-stage prototype, though, and the benchmark results are not stable.

Park would like to solidify this work, and was wondering whether the damos_hugepage module should do both the assembly and splitting of huge pages, or just one of the two. Developers at Huawei have come up with a collapse-only approach that works by finding three CPU-intensive processes on the system; the hot memory areas of those processes are then collapsed into huge pages. After a defined period, a new set of three is chosen and the process starts over again. This work has yielded good results with MySQL-based workloads.

In general, though, the question of whether DAMON should collapse base pages into huge pages, split huge pages back apart, or both, is an open one. A system that is running in the thp=always mode may not need DAMON to create huge pages. The THP shrinkers, which can split huge pages at need, already exist, so DAMON may not need to do that either. The best set of operations is unclear at this time.

There is also the question of whether the THP primitives should operate in the virtual or physical address spaces. Collapsing a process's pages necessarily involves working within that process's virtual address space. There is, of course, the question of choosing which process to operate on; perhaps that choice could be left up to users. Working in virtual address spaces raises the possibility of interference with DAMON-X. The splitting operation, instead, only needs access to physical addresses, and would have no such interference concerns.

The final question Park raised had to do with the setting of thresholds for the degrees of hotness (for collapsing) or coldness (for splitting). These could perhaps be tuned automatically, but doing that properly depends on what the goal is. Possible goals could be a given ratio of huge to base pages, or a specific TLB-miss rate, though the latter would be hardware-specific. Other possible goals could be expressed in terms of memory bloat or pressure.

As time ran out, David Hildenbrand suggested that the splitting of huge pages by DAMON might not ever be a good idea. When the system has put together a huge page, it makes sense to keep it whole if possible; if that page is not being utilized fully, then perhaps the better solution is to migrate its contents to base pages elsewhere. He also wondered how the hotness of huge pages could be measured; since there is only a single access bit, there is no immediate indication of how much of any given huge page is being accessed.

Comments (2 posted)

Managing pages outside of the direct map

By Jonathan Corbet
May 13, 2026

LSFMM+BPF
When Brendan Jackman proposed a session for the 2026 Linux Storage, Filesystem, Memory Management, and BPF Summit, his topic was "a pagetable library for the kernel". During the actual memory-management-track session, though, he stated that the idea had "fizzled" and he was going to cover related topics instead. What resulted was a session on ways to efficiently manage pages that are not present in the kernel's direct map.

The direct map makes the system's entire physical address space available within the kernel's virtual address space (on 64-bit systems, anyway). That allows the kernel to access any memory location in the system without having to set up any mappings first. The direct map is fast and convenient, but it also makes it easy for the kernel to access memory in unwanted ways, either as the result of a bug, a speculative-execution vulnerability, or some sort of compromise. There can, thus, be significant security benefits to be had by removing memory containing sensitive data from the direct map.

[Brendan Jackman] Jackman started by saying that he has been working on address-space isolation, which involves a lot of direct-map removal, for some time. Progress has been slow, but the feedback he has received has been positive. He is currently stuck on a number of technical details, but is also being held back by a lack of review of his patch sets, which he admitted were dauntingly large. So he is trying to break the problem down into smaller pieces that are more easily reviewed.

One of those pieces is allowing the allocation of unmapped (meaning, not in the direct map) memory. The developers of the Firecracker virtualization manager, he said, have been trying various ways of unmapping guest memory from the host's direct map, but the results have not performed well. He had proposed a set of memory-allocator changes to provide a new allocation flag, __GFP_UNMAPPED, to request memory that is not present in the direct map. Implementing that flag requires adding some new infrastructure to make this allocation more efficient than it is with current kernels. The changes are significant and possibly controversial; he warned the group (with a smile) that David Hildenbrand would merge those change if developers didn't review them.

Specifically, the series changes the existing "migration type" concept, which is used now to separate allocations that can be moved from those that cannot. Migration types would be replaced with a "freetype", which includes additional attributes about a block of memory — including whether that block is currently present in the direct map. That would allow the removal of blocks of memory in bulk from the direct map for use in quickly satisfying __GFP_UNMAPPED allocation requests.

The problem with removing memory from the direct map, though, is that the kernel can no longer access that memory (that being the point of the removal, after all), and sometimes the kernel needs to do exactly that. Zeroing pages at allocation time, implementing system calls like read(), handling copy-on-write faults, and populating guest_memfd memory are all examples of times when the kernel has a legitimate need to access memory. Jackman's answer to that problem is an in-kernel construct that he calls the "mermap"; it would allow pages to be mapped briefly into the kernel's address space so that an operation could be performed.

Mermap mappings are CPU-local; only the CPU that requests the mapping can make use of it. It is a lot like kmap_local_page(), but that function still makes mappings visible to all CPUs, which the mermap does not. Another difference is that the mermap is able to map multiple pages at a time. It also, crucially, is allowed to fail.

There are some other hazards associated with using __GFP_UNMAPPED. To improve performance, the mermap does not perform a TLB flush after an ephemeral mapping is removed; that can leave stale TLB entries around. Those entries could, conceivably, be used to access the memory after it unmapped; they will be flushed before the address is mapped again, though, so there is no risk of getting the wrong memory contents. He is considering requiring allocator users to perform a TLB flush before freeing the pages; otherwise those pages could be reused elsewhere while the stale TLB entries remain in place. Overall, he thinks that this is not the best API, and is interested in suggestions on how to improve it.

Liam Howlett suggested hooking the mermap into the lockdep checker, which is normally concerned with detecting locking bugs, as a way of detecting code that frees ephemerally mapped pages without a corresponding TLB flush. Matthew Wilcox wondered whether the scoped resource-management primitives could be used to ensure, at compile time, that TLB flushes happen when pages are freed. The problem with that approach is that pages are allocated and freed in different scopes, so the problem does not fit that model. David Hildenbrand asked whether having the "TLB flush needed on free" status tracked with pages themselves would help; Jackman said that it would, but that would require a page flag, and those are in perennially short supply.

Jackman's final question for the group was whether the use of a GFP flag was appropriate. There is a push to move memory allocation away from GFP flags in general, so adding another one might not be welcome. In this case, all that is really needed is a way to get the "unmapped" bit into the page allocator. Hildenbrand suggested adding a new allocation context, but Jackman said that the need for unmapped memory is a property of the data to be stored therein, rather than of the context in which the kernel is running at the moment.

At that point, time ran out and the session came to a close. Jackman has posted his own summary of the session, along with a pointer to his slides.

Comments (none posted)

Page editor: Joe Brockmeier
Next page: Brief items>>


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds