Removing read-only transparent huge pages for the page cache
in the next few release cycles". Over six years later, that promise is still present, but it will never be fulfilled. Instead, the read-only option will soon be removed, reflecting how the core of the memory-subsystem has changed underneath this particular feature.
The transparent huge pages (THP) feature automatically collects base pages into 2MB (on Intel processors) huge pages. Use of huge pages can be beneficial as a way of reducing memory-management overhead and (especially) the load on the CPU's translation lookaside buffer (TLB), but only if most of the memory contained within the huge pages is actually used. Initially, the THP feature only worked with anonymous memory (program data and such), leaving file-backed memory untouched.
There are advantages to using huge pages for file-backed memory as well, though, for all of the same reasons, but implementing that support was a harder task. The page cache at that time was true to its name, in that it was focused on the caching of individual base pages; there was no huge-page awareness at that level. So, for many years, THP was limited to anonymous memory.
Liu's 2019 patch series sought to change that situation — partly, at least. This series modified the khugepaged kernel thread, which is tasked with coalescing base pages into huge pages in the background, giving it the ability to do the same with file-backed pages. The page cache remained almost entirely unaware of this work happening behind its back. Even in this case, though, support was limited; since writing to a THP introduced a number of additional complications, that case was simply disallowed. Indeed, only virtual memory areas marked with VM_DENYWRITE were considered for THP merging. The only way to set that flag is to create an executable text section with execve(), simply creating a read-only mapping is not enough, so this feature was limited to memory containing executable text — which is one place where it was expected to do some good. Even for text, THP merging does not happen by default; an madvise() call is needed to enable it.
An interesting problem arises if some process opens a file for write access while read-only THPs have been created for that file. In that case, the kernel simply kicks all of the file's pages out of the page cache, then starts fresh using only base pages. The feature was marked "experimental" at the time, awaiting the write support that, we were promised, was just on the horizon. But that support never materialized, and the configuration variable controlling this feature, CONFIG_READ_ONLY_THP_FOR_FS, is still marked experimental. Even so, a number of distributions enable it.
It is not surprising for a kernel developer to take a bit longer than expected to finish a project, but six years still seems like a fairly long time. One can speculate as to why Liu, who remains active in kernel development, never quite got around to tackling the trickiest parts of this problem, but the fact is that it never happened, though Collin Fijalkovich did manage to merge a tweak that allowed the creation of THPs for shared-library code as well. A global pandemic and changes of priorities may well have played into this course of events, but there was another significant change in its nascent stage at that time.
In December 2020, Matthew Wilcox introduced the folio concept; initially, a folio was just a more efficient way of handling compound pages in the memory-management subsystem, but it quickly became evident that folios were rather more widely applicable than that. Specifically, they have evolved into the kernel's way of managing compound pages of just about any size, from a single base page to truly huge pages. They have become the solution to the longstanding problem of managing memory in larger units when it is more efficient to do so, without the significant memory waste due to internal fragmentation that would come from using larger pages everywhere.
In recent years, quite a bit of effort has been put into transforming the kernel's page cache into a folio cache (even though the name remains unchanged). It is now capable of handling folios of many sizes. Among the many improvements this change has enabled is making it easier to perform large transfers to and from block devices. For years, the kernel was unable to handle filesystems with a block size larger than the system's base-page size; now that capability exists, for some filesystems at least. On some systems, the TLB can efficiently handle translations for blocks of eight or 16 pages; the page cache can now work with those blocks (often called multi-size THPs, or mTHPs).
Evolving the page cache to naturally manage large folios seems like a better solution than cobbling together THPs behind the page cache's back, so it is not surprising that, in recent years, there has not been a lot of interest in extending the older THP work. Instead, development energy has gone into improving support for folios. So it was, in retrospect, only a matter of time before somebody came along with a plan to remove the CONFIG_READ_ONLY_THP_FOR_FS code; that task fell to Zi Yan in late March. Yan's series removes the configuration option and, instead, enables the creation of read-only THPs for pages backed by a filesystem that can handle folios up to the traditional huge-page size.
This idea is popular with the memory-management developers, who see the current implementation as a hack that has served its time. There is a small problem, though, as pointed out by Rui Wang: not all filesystems support folios of that size. In fact, few filesystems do; this support is limited to XFS and, in some configurations, ext4. For any other situation, Wang said, this change could create significant performance regressions; it should perhaps be delayed until filesystem-level support has improved further.
Wilcox, though, seems willing to pay that price:
If we leave this fallback in place, we'll never get filesystems to move forward. It's time to rip off this bandaid; they've got eight months before the next stable kernel. I've talked to them about it for years.
Memory-management developer David Hildenbrand agreed, and filesystem developer Darrick Wong seemed to agree as well. Only Wang has supported the idea of keeping this feature in place for longer.
It is unusual for developers of one subsystem to attempt to force a change
elsewhere in the kernel in this way, but it is not entirely unprecedented.
But, if this change goes through, it will indeed cause performance
regressions for some users, most of whom are in no position to add the
needed support to their filesystem and may turn out to be a bit disgruntled
about having been caught in the crossfire. It seems that this outcome
would be best avoided if possible. As it happens, the Linux Storage,
Filesystem, Memory Management, and BPF Summit is the ideal place for all of
the relevant developers to discuss a change like this; the next summit happens
in early May. With luck, the outcome will be a plan that everybody
involved can live with.
| Index entries for this article | |
|---|---|
| Kernel | Memory management/Folios |
| Kernel | Memory management/Huge pages |
