|
|
Log in / Subscribe / Register

Custom page-cache policies with BPF

By Jonathan Corbet
May 22, 2026

LSFMM+BPF
The kernel's page cache is charged with maintaining pages (or, more correctly, folios) containing copies of data from files in the filesystem; its performance has a big effect on the performance of the system as a whole. One of the key decisions the kernel must make is when to evict folios from the page cache. At the 2026 Linux Storage, Filesystem, Memory Management, and BPF Summit, Tal Zussman ran a memory-management-track session on how the page cache could be better customized for specific workloads. It will not be much of a spoiler to say that it involves BPF.

Eviction from the page cache can be managed by either the traditional least-recently-used (LRU) algorithm or the multi-generational LRU; these subsystems were discussed in detail earlier in the Summit. Zussman started by saying that there are workloads that are not served well by either of those options. As an example, he mentioned an unnamed financial database that performs a large number of small, performance-sensitive queries; there is also a set of lower-priority analytical tasks. The database will fill the page cache while handling the high-priority work but, when an analytical scan starts up, most of the data needed to satisfy those queries is pushed out. The next such query then ends up thrashing all of that data back in. The current page cache, he said, is not flexible enough to address this problem; it is leaving performance on the table.

Vlastimil Babka asked why the kernel's access-twice heuristic does not prevent this scenario. That heuristic marks data as inactive when it is first added to the LRU lists; the second access causes it to be marked as active. Inactive data is evicted first, so this heuristic keeps single-use data from pushing out more useful pages. Zussman answered that there are often multiple scans happening at the same time, fooling that heuristic, but the real problem is that the page cache lacks awareness of how the application accesses data.

There are three ways that this problem might be fixed, he said. One would be to change the LRU policy implemented by the kernel, which is hard to do and would probably regress other workloads. The existing hint interfaces (such as posix_fadvise()) are not able to affect eviction policy in useful ways for many workloads. The second way to solve the problem is application-level caching, which has downsides of its own, including duplicating the caching done by the page cache. Using direct I/O can address the duplication problem, but it requires reimplementing a lot of functionality that the kernel is already providing.

So, he said, there needs to be a way to fundamentally change the underlying policy. The sched_ext subsystem makes that possible for CPU scheduling; he is proposing a similar feature, called cache_ext, for the page cache. It would allow page-cache policies to be loaded (as a BPF program) from user space, with no kernel changes, and attached to control groups. It is implemented as a struct_ops program with callbacks to inform the program when folios are added to or removed from the page cache, and when they are accessed. Another callback requests the program to evict a number of folios from the page cache.

Shakeel Butt asked whether this interface would be general enough to manage all of memory, not just the page cache; Zussman answered that he is focused on file-backed memory for now, but the interface could probably be extended. The set of hooks needed for anonymous pages would likely be different, though.

Policies, he continued, operate on eviction lists maintained by the BPF program. When a folio is added to the page cache, the program picks a list to add that folio to; at eviction time, the program chooses from whichever list it thinks is best. It is a simplistic structure, he said, but most policies can be implemented using these lists. An audience member asked if the lists could be managed in a BPF arena; Zussman said that might be possible, but that limitations within BPF make it hard.

For the financial workload described above, user space would inform the cache_ext program which applications are performing scans by storing their process IDs in a BPF map. When a folio is added to the page cache, the program checks to see if it is being added on behalf of one of those scan applications; if so, the folio is put onto a special list that is targeted first at eviction time. For this application, he said, this policy produced a 70% increase in query throughput and a big reduction in tail latency.

David Hildenbrand asked whether there is an access callback that can recognize a high-priority task and move the relevant folios to a new list; Zussman said that could be implemented, but is not being done now. Kiryl Shutsemau said that turning to BPF is an overreaction to the problem, and suggested that the focus should be on creating better kernel interfaces instead. Improvements to posix_fadvise(), perhaps, could set the "don't cache" flag on folios that should be evicted quickly. Matthew Wilcox, though, expressed regret that this flag, which consumes a scarce page flag, had ever been added. Had cache_ext existed before, he said, that flag would never have been necessary.

Liam Howlett expressed some dismay that cache_ext uses policies attached to control groups; Zussman said that the LRU lists are already maintained for each control group, so that is a natural place to apply the policy. In this way, different groups can have different policies. Howlett said that, if the workload changes, the application is still locked into the old policy unless it is moved to a new control group; Zussman said that the policy can be changed at any time.

Babka asked whether the eviction hook runs when reclaim is done globally, or when it happens at the control-group level; the answer was the latter. Brendan Jackman asked why the policy was being applied at the page-cache level rather than, say, to the inode cache? Zussman said that there may be a place for policies at the inode level, but that would be a much higher-level policy.

Hildenbrand asked how a control group would be transitioned to a new policy; the answer is that switching is a destructive operation — the lists that had been constructed by the old policy would be lost. It is possible to export some knobs to tune policies, which could avoid the need to replace them entirely much of the time. Hildenbrand said that, at times, the memory-management subsystem will remove specific folios from the LRU lists for a while, and asked if the policies could handle that. Zussman answered that any metadata stored in BPF maps would persist in that situation, but the list information would be lost.

The last question came from Shutsemau, who asked whether the workload will be expected to tag page-cache requests somehow for the benefit of the policy. The answer, Zussman said, depends on what the goal is. If the intent is to create a generic policy that does not understand the behavior of specific applications, there will be no use for tagging. A more application-aware policy, though, may want information from the application about what it is doing.

Zussman has posted the slides from this session.

Index entries for this article
KernelBPF/Memory management
KernelMemory management/Page cache
ConferenceStorage, Filesystem, Memory-Management and BPF Summit/2026


to post comments

Data structures and overhead

Posted May 25, 2026 1:12 UTC (Mon) by RazeLighter777 (subscriber, #130021) [Link] (1 responses)

BPF does have the LRU map types which seem will suited for this purpose, as well as the new arena types. It will be interesting to see if there is any peformance overhead for BPF page cache programs versus the built in LRU implementation

Data structures and overhead

Posted May 28, 2026 8:02 UTC (Thu) by firstyear (subscriber, #89081) [Link]

> The second way to solve the problem is application-level caching, which has downsides of its own, including duplicating the caching done by the page cache.

It's like anything - it depends. But I don't think application caching inherently has as many downsides as presented. Application caching has the benefit of being application and context aware about behaviour, but also likely keeping the cache in a more friendly format in memory for the application.

It's one thing to have DB pages in the FS cache, but it's another to have a deserialised struct in your application cache ready to go. If an application relied purely on FS cache, now you have to either deserialise those cache pages each operation when you read them, or you need to map your application memory structures in a way that requires binary level compatibility with what's in the FS cache to avoid a deserialise step (think mmap writing raw c structs). This has a lot of trades that application level caching simply avoids.

But the bigger reason to avoid the FS cache - memory pressure. The moment there is memory pressure, FS cache is going to be squeezed and evicted first before anything else. This leaves you far more vulnerable to fluctuations in performance if the application is on a resource contended host. This is commonly seen with k8s/docker where you may have single host with many applications. If one container relies on FS cache, and another on application cache, then we know which will be placed under cache eviction pressure first.

As an interesting upside though, if you use application cache and there is memory pressure then your own application cache may end up swapped out itself. As a result you end up with the benefits of application caching, but also the kernel able to swap in/out pages as needed when under demand, without the risks of premature eviction by relying on the FS cache. There are helpful cgroup limits too that can define how much swap you can use here in this situation as well.

So while the FS cache is extremely valuable and useful, it has a time and place to shine, but application level caching will likely always outperform the FS cache due to awareness of the use case.


Copyright © 2026, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds