Large folios for anonymous memory
The initial Linux kernel release used 4KB pages on systems whose total memory size was measured in megabytes — and a rather small number of megabytes at that. Since then, installed-memory sizes have grown by a few orders of magnitude or so, but the 4KB page size remains mostly unchanged. So the kernel has to manage far more pages than it once did; that leads to more memory used for tracking, longer lists to scan, and more page faults to handle. In many ways, a 4KB page size is far too small for contemporary systems.
Some architectures support running with larger page sizes, and any system could emulate larger pages by clustering the existing base pages. But a larger page size has its own problem: internal fragmentation that can waste a significant amount of memory. In practice, this problem has been severe enough to keep 4KB pages around, despite their drawbacks.
One of the key advantages that folios bring is that they make the system's base page size less important; a folio can have any size (as long as it is a power of two) and kernel code will do the right thing with it. That allows different sizes to be used in different settings, as appropriate.
Anonymous folios
Roberts's large folios for anonymous memory patch set takes advantage of this flexibility to improve the management of anonymous pages — pages associated with program data and not backed by a file on disk. At its core, the change is simple; whenever the kernel is called upon to map a page of anonymous memory for a process, it tries to allocate and map a larger folio instead. By default, the code will try to allocate and map a 64KB folio, dropping down to smaller sizes if that cannot be done; there is a hook that allows architecture-specific code to specify a different default.
Since anonymous memory starts out filled with zeroes, mapping it in larger chunks is not particularly hard; there is no extra I/O that must be done. The biggest advantage for the kernel is that mapping large folios can significantly reduce the number of page faults that must be handled. If a single fault results in the mapping of a 64KB folio, that memory range can be accessed with just that one fault, rather than the 16 that would otherwise be required when mapping 4KB base pages.
Of course, it is not always possible to map a larger folio in that way. If a physically contiguous chunk of memory with a suitable size and alignment is not available, then the attempt will fail. It is also not possible to map a folio that extends beyond the virtual memory area (VMA) that contains it. If there are pages already mapped in a part of the address range that a folio would cover then, once again, that folio cannot be used. The ability to transparently drop down to smaller sizes means that the kernel can use an allocation that is suited to the conditions it finds at the time. Among other things, that helps to avoid internal fragmentation with small mappings.
Benchmarks running the most important workload of all — compiling the kernel — show an approximately 5% reduction in the time needed, with a reduction in kernel time of about 40%. That, alone, suggests that this work may be a good idea, but there are more gains to be had on current hardware.
Reducing TLB pressure
Virtual-address translation is a complicated process; it involves stepping through three to five levels of page tables, perhaps incurring cache misses at each step. The CPU tries to avoid this expense whenever possible by maintaining a cache of recent translations in the translation lookaside buffer (TLB). To a surprising extent, an application's performance will be determined by how well it fits into the TLB; a lot of TLB misses will slow things considerably. Unfortunately, TLB memory is expensive, so the cache is never as big as one might like it to be.
One important technique for stretching the TLB is the use of huge pages, which can allow 2MB (or even 1GB) of memory to be covered by a single TLB entry. Huge pages are, however, huge; they can be difficult to allocate on a busy system and can create huge internal-fragmentation problems of their own. The smaller folios used by Roberts's patch are much easier to manage, but they don't provide the same TLB benefits that huge pages do.
Or, at least, that was once the case. More recent CPUs have started adding a bit to their page-table entries to indicate that a small range of pages has been placed in physically contiguous memory. The processor can use that information to collapse the TLB entries referring to those pages into a single entry; the benefit is not as large as with a full huge page, but it is also much easier to obtain. This benefit will only be enjoyed, though, if the kernel sets the "contiguous PTE" bit where the mapping is truly contiguous.
The second patch set from Roberts does exactly that, for the arm64 architecture at least. In an amazing coincidence, arm64 systems can map a contiguous range of up to 64KB — which just happens to be the default folio size set for arm64 in the first patch series — into a single TLB entry. With this series applied, contiguous ranges of pages are detected automatically, and the appropriate page-table bits will be set. That results in another 2% gain for the kernel-compilation benchmark.
Discussion
These gains will only happen if this code is merged into the mainline kernel, though. That seems likely to happen, but there will be some changes needed first. For example, Yu Zhao has complained about the architecture-specific function to set the default folio size. That function takes the faulting VMA as a parameter; Zhao feels that the result is a mixture of architecture-specific decision making with policy that should be managed by the core memory-management code. Roberts has indicated that he is willing to change that interface.
Zhao also dislikes
the practice of trying intermediate sizes if the desired folio size cannot
be used. The work, he said, would "be a lot easier to sell
" if it
fell back immediately to the base-page size. As was explained in the
anonymous-folio cover letter, Zhao has recommended this change in the
past, and Roberts tried it; the result was worse performance on some
benchmarks. So he seems less willing to give on this point. When asked,
Zhao gave
three reasons for his dislike of the intermediate fallback, with the
most significant being that it may cause system-wide fragmentation:
The third one is why I really don't like this best-fit policy. By falling back to smaller orders, we can waste a limited number of physically contiguous pages on wrong vmas (small vmas only), leading to failures to serve large vmas which otherwise would have a higher overall ROI.
A possible compromise would be to attempt a single fallback to the size known as PAGE_ALLOC_COSTLY_ORDER, which is 32KB by default, before giving up and mapping base pages. In other words, this policy would avoid allocating relatively small (but still larger than single-page) folios that might break up larger, physically contiguous ranges of memory.
Another concern is that this work — and the benchmarking that comes with it — is all specific to the arm64 architecture. Support for physically contiguous page-table entries is coming to x86 processors as well, so this feature will eventually need to work beyond arm64. That suggests that a favorable review from the x86 community will be a necessary precondition to getting this work merged. Intel developer Yin Fengwei has been participating in the discussion and has indicated that some, but not all, of the patches seem ready.
The biggest stumbling block, though, may be that large anonymous folios are not yet fully integrated into the rest of the memory-management subsystem. As mentioned in one changelog in the series:
The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which defaults to disabled for now; there is a long list of todos to make FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some madvise ops, etc). These items will be tackled in subsequent patches.
Roberts has posted a more detailed list of things that need to be fixed and indicated that he would prefer to merge the feature, disabled by default, and deal with the remaining problems afterward. But, as Matthew Wilcox pointed out, there will be little desire to merge a patch set that still has that kind of outstanding issue, so these problems will almost certainly need to be worked out before this feature can be considered ready for the mainline.
This work suggests that the debate over whether the kernel's page size
should be increased is over; the answer is to use the size that works best
in each situation rather than using a single page size everywhere. The
folio work has given the kernel some of the flexibility needed to adopt a
policy like that. There is a gap, though, between the ability to implement
such a feature and creating a feature that can be deployed in production
kernels. Future kernels will almost certainly be capable of mapping
variably sized anonymous folios, but getting to that point may take a while
yet.
Index entries for this article | |
---|---|
Kernel | Memory management/Folios |
Posted Jul 6, 2023 17:57 UTC (Thu)
by dankamongmen (subscriber, #35141)
[Link] (4 responses)
in that case, do you even need a distinct set of huge TLB entries anymore? they used to be very limited (they're not so much anymore), and i worried about capacity evictions as a result. overall, it's been a hard road for the userspace programmer who has hoped to make use of huge pages.
Posted Jul 6, 2023 17:58 UTC (Thu)
by dankamongmen (subscriber, #35141)
[Link]
Posted Jul 6, 2023 19:06 UTC (Thu)
by joib (subscriber, #8541)
[Link] (1 responses)
Posted Jul 6, 2023 21:17 UTC (Thu)
by pm215 (subscriber, #98099)
[Link]
Incidentally, the contiguous bit is only a hint -- a CPU implementation is allowed to ignore it entirely and just do 4K TLB entries, so those other 15 page table entries all still have to be there in memory and all consistently say the same thing. (This is unlike a real huge page, which doesn't need as much memory for the page table itself, and which the hardware mandatorily has to support.)
Posted Jul 20, 2023 0:10 UTC (Thu)
by kijiki (subscriber, #34691)
[Link]
The bit is copied from the PTE when the TLB is loaded.
Posted Jul 7, 2023 10:23 UTC (Fri)
by james (subscriber, #1325)
[Link] (4 responses)
See, for example, Wikichip:
Posted Jul 8, 2023 8:13 UTC (Sat)
by yuzhao@google.com (guest, #132005)
[Link] (2 responses)
The contiguous block entries (16*4KB) on (older) Armv8-A CPUs require OS to set the contiguous bit in PTEs, hence not transparent. The (newer) ARMv8.2-A supports Hardware Page Aggregation (HPA), which like the AMD PTE coalescing is also transparent but at a smaller granularity (4*4KB).
Intel at the moment doesn't provide either.
Posted Jul 13, 2023 7:37 UTC (Thu)
by ikitayama (subscriber, #51589)
[Link] (1 responses)
Posted Jul 13, 2023 20:21 UTC (Thu)
by yuzhao@google.com (guest, #132005)
[Link]
Posted Oct 3, 2023 5:43 UTC (Tue)
by nim (subscriber, #102653)
[Link]
I think this should read "If a 32-Kbyte aligned block of eight consecutive 4-Kbyte pages are also consecutive and 32-Kbyte aligned in physical address space...", since a single 64-byte cache line contains 8 PTEs.
For best results one should probably allocate from the OS with at least 32KB granularity. I got here googling for whether Linux will make some effort (possibly for some other reason) to allocate consecutive and aligned pages when asked for larger blocks. I'm not sure of the answer yet.
Large folios for anonymous memory
Large folios for anonymous memory
Large folios for anonymous memory
Large folios for anonymous memory
Large folios for anonymous memory
I understand the hardware support has been there on AMD chips since the original Zen core: it's just done transparently by the hardware noticing the memory is allocated that way.
Large folios for anonymous memory
Like the Zen/Zen+ microarchitecture, Zen 2 supports page table entry (PTE) coalescing. When the table walker loads a PTE, which occupies 8 bytes in the x86-64 architecture, from memory it also examines the other PTEs in the same 64-byte cache line. If a 16-Kbyte aligned block of four consecutive 4-Kbyte pages are also consecutive and 16-Kbyte aligned in physical address space and have identical page attributes, they are stored into a single TLB entry greatly improving the efficiency of this cache. This is only done when the processor is operating in long mode.
and Dr Ian Cutress at Anandtech, quoting an AMD slide:
PTE Coalescing: Combines 4K page tables into 32K page size
Unfortunately (as is obvious from those quotes) it seems to be difficult to authoritatively pin down precisely what does happen.
Large folios for anonymous memory
Large folios for anonymous memory
Then 4*64KB.
Large folios for anonymous memory
Large folios for anonymous memory