|
|
Log in / Subscribe / Register

Removing read-only transparent huge pages for the page cache

By Jonathan Corbet
April 10, 2026
Things do not always go the way kernel developers think they will. When the kernel gained support for the creation of read-only transparent huge pages for the page cache in 2019, the developer of that feature, Song Liu, added a Kconfig file entry promising that support for writable huge pages would arrive "in the next few release cycles". Over six years later, that promise is still present, but it will never be fulfilled. Instead, the read-only option will soon be removed, reflecting how the core of the memory-subsystem has changed underneath this particular feature.

The transparent huge pages (THP) feature automatically collects base pages into 2MB (on Intel processors) huge pages. Use of huge pages can be beneficial as a way of reducing memory-management overhead and (especially) the load on the CPU's translation lookaside buffer (TLB), but only if most of the memory contained within the huge pages is actually used. Initially, the THP feature only worked with anonymous memory (program data and such), leaving file-backed memory untouched.

There are advantages to using huge pages for file-backed memory as well, though, for all of the same reasons, but implementing that support was a harder task. The page cache at that time was true to its name, in that it was focused on the caching of individual base pages; there was no huge-page awareness at that level. So, for many years, THP was limited to anonymous memory.

Liu's 2019 patch series sought to change that situation — partly, at least. This series modified the khugepaged kernel thread, which is tasked with coalescing base pages into huge pages in the background, giving it the ability to do the same with file-backed pages. The page cache remained almost entirely unaware of this work happening behind its back. Even in this case, though, support was limited; since writing to a THP introduced a number of additional complications, that case was simply disallowed. Indeed, only virtual memory areas marked with VM_DENYWRITE were considered for THP merging. The only way to set that flag is to create an executable text section with execve(), simply creating a read-only mapping is not enough, so this feature was limited to memory containing executable text — which is one place where it was expected to do some good. Even for text, THP merging does not happen by default; an madvise() call is needed to enable it.

An interesting problem arises if some process opens a file for write access while read-only THPs have been created for that file. In that case, the kernel simply kicks all of the file's pages out of the page cache, then starts fresh using only base pages. The feature was marked "experimental" at the time, awaiting the write support that, we were promised, was just on the horizon. But that support never materialized, and the configuration variable controlling this feature, CONFIG_READ_ONLY_THP_FOR_FS, is still marked experimental. Even so, a number of distributions enable it.

It is not surprising for a kernel developer to take a bit longer than expected to finish a project, but six years still seems like a fairly long time. One can speculate as to why Liu, who remains active in kernel development, never quite got around to tackling the trickiest parts of this problem, but the fact is that it never happened, though Collin Fijalkovich did manage to merge a tweak that allowed the creation of THPs for shared-library code as well. A global pandemic and changes of priorities may well have played into this course of events, but there was another significant change in its nascent stage at that time.

In December 2020, Matthew Wilcox introduced the folio concept; initially, a folio was just a more efficient way of handling compound pages in the memory-management subsystem, but it quickly became evident that folios were rather more widely applicable than that. Specifically, they have evolved into the kernel's way of managing compound pages of just about any size, from a single base page to truly huge pages. They have become the solution to the longstanding problem of managing memory in larger units when it is more efficient to do so, without the significant memory waste due to internal fragmentation that would come from using larger pages everywhere.

In recent years, quite a bit of effort has been put into transforming the kernel's page cache into a folio cache (even though the name remains unchanged). It is now capable of handling folios of many sizes. Among the many improvements this change has enabled is making it easier to perform large transfers to and from block devices. For years, the kernel was unable to handle filesystems with a block size larger than the system's base-page size; now that capability exists, for some filesystems at least. On some systems, the TLB can efficiently handle translations for blocks of eight or 16 pages; the page cache can now work with those blocks (often called multi-size THPs, or mTHPs).

Evolving the page cache to naturally manage large folios seems like a better solution than cobbling together THPs behind the page cache's back, so it is not surprising that, in recent years, there has not been a lot of interest in extending the older THP work. Instead, development energy has gone into improving support for folios. So it was, in retrospect, only a matter of time before somebody came along with a plan to remove the CONFIG_READ_ONLY_THP_FOR_FS code; that task fell to Zi Yan in late March. Yan's series removes the configuration option and, instead, enables the creation of read-only THPs for pages backed by a filesystem that can handle folios up to the traditional huge-page size.

This idea is popular with the memory-management developers, who see the current implementation as a hack that has served its time. There is a small problem, though, as pointed out by Rui Wang: not all filesystems support folios of that size. In fact, few filesystems do; this support is limited to XFS and, in some configurations, ext4. For any other situation, Wang said, this change could create significant performance regressions; it should perhaps be delayed until filesystem-level support has improved further.

Wilcox, though, seems willing to pay that price:

If we leave this fallback in place, we'll never get filesystems to move forward. It's time to rip off this bandaid; they've got eight months before the next stable kernel. I've talked to them about it for years.

Memory-management developer David Hildenbrand agreed, and filesystem developer Darrick Wong seemed to agree as well. Only Wang has supported the idea of keeping this feature in place for longer.

It is unusual for developers of one subsystem to attempt to force a change elsewhere in the kernel in this way, but it is not entirely unprecedented. But, if this change goes through, it will indeed cause performance regressions for some users, most of whom are in no position to add the needed support to their filesystem and may turn out to be a bit disgruntled about having been caught in the crossfire. It seems that this outcome would be best avoided if possible. As it happens, the Linux Storage, Filesystem, Memory Management, and BPF Summit is the ideal place for all of the relevant developers to discuss a change like this; the next summit happens in early May. With luck, the outcome will be a plan that everybody involved can live with.

Index entries for this article
KernelMemory management/Folios
KernelMemory management/Huge pages


to post comments

MADV_COLLAPSE support?

Posted Apr 10, 2026 14:33 UTC (Fri) by andresfreund (subscriber, #69562) [Link] (2 responses)

Right now support for MADV_COLLAPSE seems to, at least for some file systems, be tied to CONFIG_READ_ONLY_THP_FOR_FS. In my experiments the gains for using huge pages for code can be quite large (e.g. for postgres it can be a 15% increase in read only OLTP) and can't easily be achieved in other ways (you can remap to manually allocated huge pages, but it's fragile as hell and breaks some tooling, binary layout optimization can help some).

Last time I checked, a few months ago, MADV_COLLAPSE seemed to require CONFIG_READ_ONLY_THP_FOR_FS, even for xfs, despite its support for large folios. Possible that I did something wrong or hit momentary breakage (I did find a thread reporting one sometimes need to retry, but that didn't help).

The other reason to care about MADV_COLLAPSE support, rather than just performance improvement, is that sometimes ending up with huge pages, sometimes not, makes for hellish benchmarking. Particularly because the older binary often ends up with more large pages than a freshly compiled one.

So: Wither MADV_COLLAPSE?

MADV_COLLAPSE support?

Posted Apr 10, 2026 14:37 UTC (Fri) by corbet (editor, #1) [Link] (1 responses)

I don't have time to dig in to verify it now but ... the patch series itself essentially replaces CONFIG_READ_ONLY_THP_FOR_FS checks with "can the filesystem do PMD-size folios?" checks. So I am guessing that MADV_COLLAPSE should work as expected.

MADV_COLLAPSE support?

Posted Apr 10, 2026 15:07 UTC (Fri) by andresfreund (subscriber, #69562) [Link]

Ah. That'd mean it'd be more widely usable than before. That'd be great. I asked a clarification question on the thread...

Thanks for the article and response!

Writable THPs

Posted Apr 10, 2026 15:12 UTC (Fri) by willy (subscriber, #9762) [Link]

I don't think Song Liu ever intended to make filesystem-oblivious writable THPs work. The intent (at least from my side!) was always to make filesystems support arbitrary-sized "pages" ... which turned into folios when it became apparent that we needed a way to distinguish PAGE_SIZE pages from allocated chunks of memory. So I think it's fair to see "folios now exist" as being the fulfillment of "writable huge pages".

Song's patches were an important part of the process of getting folios going. It's not the order I would have done it in, but since Song had already done it, it got some of the infrastructure in place that we needed. We cooperated on this work, after the session at the 2019 LSFMM (https://lwn.net/Articles/789159/) we started a fortnightly phone call which continues to this day (although Song no longer participates).

By the way, the page cache was already aware of THPs before Song's patches. shmem supported THPs. It was really filesystems that were (and remain!) the limiting factor.

Block size

Posted Apr 13, 2026 9:28 UTC (Mon) by claudex (subscriber, #92510) [Link] (2 responses)

Filesystem block size are much less smaller than a huge page, for example xfs max block size is 64KiB. So this will generate much more TLB entries than a huge page of 2MiB. Maybe this won't have any impact in practice but it seems odd to target the block size.

Block size

Posted Apr 13, 2026 18:00 UTC (Mon) by willy (subscriber, #9762) [Link] (1 responses)

I think you're confused (the article did talk about two things in the same paragraph, rather than splitting them into separate paragraphs).

Folios can contain multiple pages. That has enabled both:

1. Support for block sizes larger than page size

2. Support for using PTEs (and thus TLB entries) which describe chunks of memory intermediate in size between a single page and a PMD size.

Block size

Posted Apr 13, 2026 18:08 UTC (Mon) by claudex (subscriber, #92510) [Link]

Thanks, I didn't understand the two use cases.


Copyright © 2026, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds