|
|
Log in / Subscribe / Register

Controlling memory management with BPF

By Jonathan Corbet
May 15, 2026

LSFMM+BPF
Roman Gushchin began his session in the memory-management track of the 2026 Linux Storage, Filesystem, Memory Management, and BPF Summit by saying that the community has seen a lot of proposals adding BPF-based interfaces for memory management. None of them have made their way into the mainline, though. He wanted to explore the ways in which BPF might be helpful and the obstacles that have kept BPF-based solutions out so far. This session was followed by a discussion led by Shakeel Butt on what the requirements for a new, BPF-based interface for memory control groups might look like.

Obstacles to BPF integration

Existing efforts have tried to capture a number of different memory-management heuristics, he began. There have been proposals to use BPF to control out-of-memory handling, NUMA balancing, memory control groups, page-cache eviction, and more. There are more interesting ideas that have not yet been pursued, including readahead control, madvise(), kernel samepage merging, and guest-memory control. Readahead, in particular, is a messy set of heuristics, but it is important for performance.

[Roman Gushchin] There are a number of obstacles to the addition of BPF interfaces for the memory-management subsystem, he said; he would cover them from the least important to the most. The first was concerns about out-of-tree BPF programs. Kernel developers want to see production-quality code land in the mainline, but that is not how BPF is working now. There are production-quality sched_ext schedulers, for example, but they are all stubbornly out of tree. BPF maintainer Alexei Starovoitov said that "sched_ext was a mistake", in that it did not bring any production schedulers with it into the mainline. That is a hard situation to fix now, he said. It would be good to have a good, in-tree out-of-memory handler; if nothing else, it would help developers to judge the proposed interfaces.

Including BPF programs in the kernel tree does not seem to be controversial, Gushchin said, so the real question is how far developers should go. A first step would be to just include the source for people to examine and play with. Automatic loading of included BPF programs could be a good second step, Starovoitov said; it would let people use the included BPF programs easily. Gushchin suggested that a BPF implementation of systemd-oomd would provide a good example of how that subsystem works.

Another obstacle is the current inability to attach struct ops programs to control groups. BPF programs can be attached, but not those using the struct ops interface. He has an implementation for the out-of-memory handler, but sched_ext uses a different solution.

Then, there is the issue of safety and fallback; a broken BPF memory-management program could easily make the system unusable. This is the hardest issue to solve, at least from Gushchin's perspective; it is hard even to define what "safety" means in this context. Time-based fallbacks are hard to implement and ugly, he said. Memory-management actions can be wrapped into monitored kfuncs, but that leads to non-generic solutions that can hurt performance. The acceptable level of service needs to be defined; a traffic-control program that drops all packets is OK, but a sched_ext scheduler that starves half of the tasks in the system is less so. What should happen if a faulty BPF program is loaded and the system can no longer reclaim memory?

There will always be concerns about performance in hot paths, which will make it hard to justify adding BPF programs in the hottest of them. The memory-management subsystem depends heavily on batching for performance, raising the question of whether BPF programs should run before or after batching is done. He suggested that batching should happen first, but that makes it impossible to control the batching itself with a BPF program.

Finally, the most important obstacle, he said, is ABI stability; this concern had been most recently raised by David Hildenbrand on the mailing list. In person, Hildenbrand said that there was some confusion about what providing hooks for BPF programs means; are they a permanent memory-management feature? The community may not want to commit to keeping those hooks around indefinitely. That concern has led to a decision not to provide hooks for the management of transparent huge pages; nobody knows what the picture will look like in five years, he said, so it will not be possible to get the interface right.

What will happen, he said, is that memory-management developers will wake up someday and realize that some aspect of the interface should be done differently. If they act on that realization, programs will break and people will get angry. Perhaps the solution is to only commit to supporting BPF programs that are maintained in the kernel tree itself. Hildenbrand concluded by saying that he sees the value of using BPF, but worries that adding interfaces may commit the subsystem to maintaining features that it regrets in the future.

The session was out of time at this point. At the conclusion, Gushchin said that it would be important to add only the most generic sorts of BPF hooks. So, for example, a hook to assign out-of-memory scores would be a bad idea, since a future out-of-memory killer might not use them. But a hook to free some memory under pressure, perhaps by killing some processes, could be useful.

Reimagining memory control groups

[Shakeel Butt] Butt followed immediately after with a discussion of how he might like to see the kernel's memory controller evolve, and how BPF might fit into that. He started by saying that the memory controller distributes memory resources hierarchically, implementing both hard and soft limits. Any given group will be allowed to use up to its hard limit when memory is plentiful, but will be squeezed back to the soft limit when memory is tight.

The memory controller has a number of challenges, he said. The enforcement of limits is inflexible and disruptive; since it happens synchronously, it can cause unexpected stalls in latency-sensitive threads. Its interfaces have proved hard to evolve, since significant changes will break the existing ABI, which kernel developers are not allowed to do. It would be nice, he said, to have a mechanism that would make it possible to experiment with alternatives.

The goal of a new interface would be to provide capabilities to support a wide variety of use cases. One example use case was provided in his session proposal and repeated during the session itself:

Policy: "keep system-level memory utilization below 95 percent; avoid priority inversions by not throttling allocators holding locks; trim each workload's usage to its working set without regressing its relevant performance metrics; collaborate with workloads on load shedding and memory trimming decisions; and under extreme memory pressure, collaborate with the OOM killer and the central job scheduler to kill and clean up a workload."

A new memory controller, he said, would need to provide memory-use notifications that applications can act on. It needs to support background reclaim, so that memory limits can be enforced without stalling running threads. Memory-use throttling should be aware of threads that hold locks to avoid priority-inversion problems. User space needs to be able to influence throttling in other ways, with the ability to, for example, identify specific threads that should be exempt to an extent. The controller should also support memory tiering, providing control over how pages are moved between tiers.

There was not much time to go into how this new interface would work; this effort, in general, appears to be in an early stage. Butt said that a new BPF callback, bpf_memcg_charge_succeed(), could be added to inform a BPF program about an increase in memory usage; that program might then respond by initiating background reclaim. Other callbacks could inform the program when a control group has reached a usage watermark (or hit a usage limit), relying on the program to provide hints on how to respond. That program might initiate some sort of reclaim, but it could also inform the application of the situation with the expectation that said application would respond by reducing its memory usage.

At the end, a member of the audience said that a useful feature would be the ability to introspect what types of memory an application is using; Butt answered that this feature was already being worked on.

Index entries for this article
KernelBPF/Memory management
KernelControl groups/Memory controller
ConferenceStorage, Filesystem, Memory-Management and BPF Summit/2026


to post comments

BPF mm stability

Posted May 16, 2026 2:33 UTC (Sat) by RazeLighter777 (subscriber, #130021) [Link]

It's interesting examining the problem of supporting stability for BPF, the docs technically do make some ABI claims about for instance about kfuncs being potentially subject to change while BPF helpers are not and are considered stable.

It's difficult to reason about verifying a program for memory management in BPF. It seems unlikely that on the first try we'd get something flexible enough for all use cases, yet that doesn't fundamentally break the "safety" provided by the verifier. So I see why this has been a thorny topic.


Copyright © 2026, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds