Prerequisites for large anonymous folios

By Jonathan Corbet
September 8, 2023

The work to add support for large anonymous folios to the kernel has been underway for some time, but this feature has not yet landed in the mainline. The author of this work, Ryan Roberts, has been trying to get a handle on what the remaining obstacles are so he can address them. On September 6, an online meeting of memory-management developers discussed that topic and made some progress; there is still some work to do, though, before large anonymous folios can go upstream.

Folios are, at their core, a simple mechanism to group physically contiguous pages into larger units that can be managed more efficiently. Over the last couple of years, many parts of the memory-management subsystem have been converted to use folios, but the management of anonymous pages (pages representing data that is not backed up by a file) has proved to be somewhat tricky.

Large anonymous folios

Roberts started the discussion with a review of the work. Whenever it handles a page fault on an anonymous virtual-memory area (VMA) that requires allocating a page of new memory, the kernel will attempt to allocate a group of pages as a folio rather than a single page. This folio will normally be between 16KB and 64KB in size. It will be mapped into the process's address space at the page-table-entry (PTE) level — there is no alternative to doing that, since these folios are smaller than the huge-page size.

The addition of large anonymous folios is driven by performance concerns, he said. Allocating and mapping an entire folio will reduce page faults, since the process will not need to fault in the other pages in that folio. A lot of per-page operations, such as the management of reference counts, become per-folio operations instead, so there are correspondingly fewer of them. Using folios reduces the length of the memory-management subsystem's least-recently-used (LRU) lists, making the scanning of that list more efficient. On some architectures (specifically 64-bit Arm), zeroing a 64KB folio is faster than zeroing individual pages.

Turning to specific use cases, Roberts said that this feature provides better performance on Android systems running with 4KB base pages. It allows the kernel to use the CPU's "contiguous-PTE" bits, which indicate that a contiguous set of PTEs maps to contiguous pages of memory; that, in turn, allows the hardware to optimize the translation lookaside buffer (TLB) entry for those pages. On a 64-bit Arm system with 4KB base pages, using the contiguous-PTE requires 16-page (64KB) contiguous mappings, which large anonymous folios can provide. Support for the contiguous-PTE feature is implemented in a separate patch set.

There is a separate use case on Arm systems running with 64KB base pages. On such systems, the PMD size (the size of the smallest huge pages) is 512MB; that much physically contiguous memory is difficult to allocate, so huge pages tend not to be used when running with 64KB pages. The contiguous-PTE size, though, is 2MB, which is much easier to manage. Large anonymous folios can be used to provide a sort of "2MB transparent huge page" feature that improves performance.

Using a larger base-page size has a similar effect to using large folios everywhere, so it is not surprising that Roberts was asked whether switching Android systems to 64KB base pages was out of the question. The answer was "yes", but that using 16KB pages might be an option; even then, though, there are problems with app compatibility and wasted memory due to internal fragmentation. That question led naturally to Roberts's next point, though, which was illustrated with benchmark results.

Roberts did a number of tests with the industry-standard kernel-compilation benchmark, and with Speedometer 2.0 as well. In both cases, enabling large anonymous folios improved performance significantly, on a scale similar to what can be achieved using 16KB base pages. The increase in memory usage, though, was tiny compared to the situation with 16KB base pages (and 64KB base pages was far worse yet). Large anonymous folios, thus, look like a way to gain many of the performance advantages provided by a larger base-page size without most of the associated cost.

The patch series is currently in its fifth revision, posted on August 10. It has had a lot of reviews, and a number of prerequisite changes have been identified. There are a few open issues that he wanted to discuss: naming, runtime controls, and which statistics to measure and report. He did not manage to get through even that short list in the time allotted, though.

Starting with naming, Roberts said that the feature should have a name that tells users what it does. "Large anonymous folio" might work, but he is not sure that it is the best option. Others he suggested were "transparent large page", "flexible transparent huge page", and "small transparent huge page". It quickly became clear, though, that Roberts would not succeed in getting the assembled developers to disagree over the name; they had other concerns instead.

How to turn it off

David Hildenbrand said that the real question to answer is the extent to which large anonymous folios should be used automatically. The biggest problem with the existing transparent huge page feature is memory waste, and the same situation will arise with this feature. If large anonymous folios are added to the kernel without a way to disable them, he said, the result will certainly be problems for some applications. The feature is reasonable, he said, but it needs a way to avoid and/or recover from over-allocation of large folios.

The problem of memory waste comes about when applications create sparse mappings — regions of memory where the actively used pages are scattered and surrounded by unused pages. If a large folio is allocated in a sparse mapping, most of it will likely consist of unused pages, which are wasted. Roberts questioned the use cases for such mappings, but it seems that they are reasonably common. Memory allocators often create them; highly threaded applications with a lot of thread stacks are another example.

Roberts said that he saw a couple of approaches for dealing with this problem, preferably in a way that lets him get a minimal patch into the kernel to better characterize how large anonymous folios affect performance. It seems that there needs to be a way to turn the feature on or off at a minimum. Even better, though, would be some sort of dynamic mechanism that would check on page usage and break apart folios that are not fully utilized. He is not sure how to do that without scanning the pages in question, though.

Hildenbrand said that, if Roberts wants large anonymous folios to be enabled by distributions, there has to be a way to toggle it on or off. The best way, of course, would be to have the kernel tune this behavior automatically and not add any new knobs at all. But some way must be found, he said, to deal with the memory-waste problem, whether it's reclaiming incorrectly allocated folios, giving user space a way to tell the kernel about sparse mappings, or some other sort of mechanism.

Roberts concluded that the kernel will have to get better at discovering and reclaiming unused pages in folios. That is not necessarily a small job, though, and he would really like to start with a minimum patch that people can play with. Hildenbrand said that Roberts should not worry too much about introducing a new toggle to just disable the feature; if the kernel gets to a point where it can tune its own behavior effectively, it can just ignore the setting of the knob.

As it turns out, Roberts already had a proposal for a control knob, called laf_order, that would be found in either debugfs or sysfs (there is some disagreement over which place is better). Setting this knob to zero disables large anonymous folios entirely, while setting it to "auto" (the proposed default) leaves the management of anonymous folios to the kernel. It can also be set to an integer indicating the allocation order to use for anonymous folios — setting laf_order=4, for example, would set the size to 16 pages, or 64KB. Yu Zhao argued that the knob needs to accept a list of orders rather than a single one, with the kernel falling back through the list if the larger orders cannot be allocated.

Matthew Wilcox, instead, pointed out that different applications will want different behavior. A system-wide laf_order knob is good for expressing what the system administrator wants, but the administrator likely does not know what the software wants. Hildenbrand, though, said that there should not be too much effort put into optimizing this feature; it should be possible to disable it when it is not useful, and to play with it otherwise. He reiterated that automatic tuning would be best; user space should not be expected to know better than the kernel when it comes to memory management.

John Hubbard said that there is a need for per-application tuning of the feature, or it will be disabled by users; David Rientjes said that it was looking like there would be a need for applications to tell the kernel which folio size to use for each VMA. Wilcox protested that, "when applications tell the kernel that they know what they are doing, they don't". Instead, the kernel should watch what an application does, ramp up the use of large anonymous folios conservatively, and track how those folios are used.

Roberts said that he had thought about a "per-VMA slow start" feature where, if page faults within a VMA form in contiguous blocks, the folio size for that VMA would be increased. Wilcox agreed that this idea was worth looking into. Roberts said that the universal toggle still felt like the first step to take, though.

Out of time

At that point, the hour was done; Roberts lamented that he had come with three topics to discuss and had never gotten past the first. He is still looking for a roadmap to get him to a bare-minimum patch that could get into the kernel. Other improvements can be made after that. There was not much agreement on what that minimum should be, though. Hildenbrand suggested a simple toggle to disable the feature; Zhao, instead, argued for a separate toggle attached to each memory control group — and that a per-VMA toggle would be better yet. Hubbard suggested a set of controls that mirrors the knobs that have been developed to control transparent huge pages.

In a sense, the meeting came to an end without having resolved much of anything. But it did shine a light on some of the problems that will have to be solved before large anonymous folios can be a part of a mainstream kernel release. The experience with transparent huge pages shows that introducing a feature before it is truly ready can cause long-term difficulties of its own, with users disabling the feature long after its problems have been solved. With luck and some patience, large anonymous folios will eventually enter the kernel without such troubles.

Index entries for this article
Kernel	Memory management/Folios

Prerequisites for large anonymous folios

Posted Sep 8, 2023 16:29 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

> Roberts said that he had thought about a "per-VMA slow start" feature where, if page faults within a VMA form in contiguous blocks, the folio size for that VMA would be increased. Wilcox agreed that this idea was worth looking into. Roberts said that the universal toggle still felt like the first step to take, though.

This could also be a task for madvise(MADV_SEQUENTIAL) and madvise(MADV_RANDOM), even though those are currently no-ops (and therefore not used a lot) on anonymous pages. But perhaps more interesting, if userspace uses MADV_POPULATE_READ and MADV_POPULATE_WRITE then the kernel can also use large pages.

Prerequisites for large anonymous folios

Posted Sep 9, 2023 6:47 UTC (Sat) by ibukanov (subscriber, #3942) [Link] (9 responses)

What is the reason for bigger CPU pages to cause bigger memory memory usage?

In user space modern memory allocators like jemalloc get memory from the kernel in big chunks at least 1mb in size and using 16K pages, not 4K, does not affect the memory usage.

It can be a problem if the application uses a lot of native threads with small stack usage. As the stack needs at least one page with one million threads the memory usage can be problematic. But these days the trend is to use one thread per core and use event-based archive or green threads to implement multitasking.

Prerequisites for large anonymous folios

Posted Sep 9, 2023 7:07 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (7 responses)

Fragmentation when files are accessed very sparsely. You fill them in 16k at a time instead of 4k, and waste 3/4 of your memory in the worst case.

Prerequisites for large anonymous folios

Posted Sep 9, 2023 9:38 UTC (Sat) by ibukanov (subscriber, #3942) [Link] (6 responses)

Hm, is it because applications mmap a lot of small size? Or is the issue the increased Linux file cache usage which has page size granularity? If the latter the Linux file hash has to be fixed not to depend on that.

Prerequisites for large anonymous folios

Posted Sep 9, 2023 11:52 UTC (Sat) by kleptog (subscriber, #1183) [Link] (3 responses)

I imagine a chunk of this is due to the fact that ELF binaries simply mmap() all the libraries they use and let the kernel fault in the pages they actually need. You also have data files like libicudata and locale files. If you're only using small parts of a library this is fairly efficient using a 4k page size, but you can imagine that with a 64k page size the overhead balloons. The alignment requirement doesn't help either.

This also suggests you could become more memory efficient by actually adjusting the position of functions in libraries. You'd need to collect a large sample of usages at a function level and use that to control that layout of functions in the resulting binaries/executables. For example, figure out which parts of libssl are actually required for TLS1.3 and arrange them all together. This doesn't buy much at 4k page size, but it helps a lot with bigger page sizes. And you'd have to do it at a distro-level to be effective.

Prerequisites for large anonymous folios

Posted Sep 9, 2023 14:27 UTC (Sat) by walters (subscriber, #7396) [Link] (2 responses)

I think https://github.com/llvm/llvm-project/tree/main/bolt will do this right?

Prerequisites for large anonymous folios

Posted Sep 10, 2023 21:08 UTC (Sun) by kleptog (subscriber, #1183) [Link] (1 responses)

Yes, that would do it. As I suspected the actual technical part of optimising the binaries is done, the problem would be to get decent samples of all the popular applications. For example, it appear it requires specially compiled binaries to work (--emit-relocs), so you can't just ask a random group of people to run a sampling profiler in the background for a day. It also seems aimed at optimising individual binaries, whereas I think shared libraries are where a lot of the gains could be made.

But hey, all it takes is one person with enough will & skills and it might happen.

Prerequisites for large anonymous folios

Posted Sep 11, 2023 14:45 UTC (Mon) by aaupov (guest, #166901) [Link]

> Yes, that would do it. As I suspected the actual technical part of optimising the binaries is done, the problem would be to get decent samples of all the popular applications. For example, it appear it requires specially compiled binaries to work (--emit-relocs), so you can't just ask a random group of people to run a sampling profiler in the background for a day.
`--emit-relocs` is required for function reordering and only at optimization time. It's possible to collect samples from a regular distro binary (stripped, no relocs) and then use that profile to optimize a separately-built binary (not stripped, with relocs preserved).

> It also seems aimed at optimising individual binaries, whereas I think shared libraries are where a lot of the gains could be made.
BOLT can optimize shared libraries.

Prerequisites for large anonymous folios

Posted Sep 9, 2023 14:52 UTC (Sat) by Paf (subscriber, #91811) [Link]

I don’t understand what you’re thinking the page cache does or should do? It’s mostly just physical pages today. The only way using folios can increase memory usage is via increased read in of sparse files - you just want a few bytes and instead of, say, 4K we read in 16K or 64K. That’s it. So we could get better at recognizing sparseness and switching page size down, I guess.

Prerequisites for large anonymous folios

Posted Sep 9, 2023 18:33 UTC (Sat) by willy (subscriber, #9762) [Link]

Several of the responses in this thread have misunderstood the question.

If you build a kernel with CONFIG_PAGE_SIZE_64K, you literally cannot mmap at a smaller granularity than 64k. The hardware is configured such that each page table entry controls access to a 64kB chunk of memory. This talk of "the page cache needs to ..." is foolish. The page cache must allocate in 64k size chunks or it cannot support mmap [*].

This is why folios are superior. You can keep your 4kB mmap granularity. The kernel decides when to use 64kB (or smaller, or larger) chunks of memory to cache files. You get opportunistic use of features like CONTPTE if the conditions allow.

[*] Since the vast majority of files are never mmaped, we *could* cache files in smaller sizes, then transition to 64kB allocations if somebody calls mmap on this file. This would be a huge increase in complexity and I am far from convinced it would be worthwhile.

Prerequisites for large anonymous folios

Posted Sep 9, 2023 13:18 UTC (Sat) by matthias (subscriber, #94967) [Link]

> In user space modern memory allocators like jemalloc get memory from the kernel in big chunks at least 1mb in size and using 16K pages, not 4K, does not affect the memory usage.

No, that is address space that is allocated in big chunks, i.e., the kernel creates a virtual memory area (or changes its size). However, the actual memory is only allocated when it is used. For every page, on first access, there will be a page fault and then the kernel sets up the mapping for that page. If user space only touches one page of the 1mb region, then only one page will ever be mapped.

Prerequisites for large anonymous folios

Posted Sep 9, 2023 7:48 UTC (Sat) by atnot (subscriber, #124910) [Link] (3 responses)

> Even better, though, would be some sort of dynamic mechanism that would check on page usage and break apart folios that are not fully utilized. He is not sure how to do that without scanning the pages in question, though.

That does sound an awful lot like DAMON. But of course, that's Amazon's Thing, and this is Google's Thing, so that can't work...

The degree to which Amazon, Google and Facebook appear to be constantly implementing solutions to near identical memory management problems in different parts of the memory management subsystem is honestly kind of frustrating to watch...

Prerequisites for large anonymous folios

Posted Sep 9, 2023 17:16 UTC (Sat) by Wol (subscriber, #4433) [Link]

> That does sound an awful lot like DAMON. But of course, that's Amazon's Thing, and this is Google's Thing, so that can't work...

And this is different from the normal state of affairs how? I think it's been called NIH from time immemorial ...

Cheers,
Wol

Prerequisites for large anonymous folios

Posted Sep 9, 2023 17:26 UTC (Sat) by farnz (subscriber, #17727) [Link]

This is one area where maintainers have a tough job; they need to be on top of all of these things and able to tell someone at company Z that they need to explain why their thing is needed given the existence of the other mainlined thing from company Y. They also need to be able to keep track of all the things so that when someone from company X has a good thing that could be even better if integrated with other things in the kernel, they can tell the person from company X what to integrate with.

This is not an easy task, and Linux already has issues with maintainers burning out, so I'm not surprised that it doesn't happen as much as it would in a perfect world. The one advantage Linux has over past ways of doing this is that the code from all the big companies is being combined in one kernel, so in theory at least, someone can come in, take the past good ideas from the tech giants and combine them into one (smaller and better) way to fix the problems.

Prerequisites for large anonymous folios

Posted Sep 9, 2023 18:43 UTC (Sat) by willy (subscriber, #9762) [Link]

Ryan works for ARM. I work for Oracle. People from Intel, Google, SUSE, IBM, Red Hat and Facebook have been involved in this project. Hell, I was working on this when I was at Microsoft. Stop trying to make this into some Great Powers game. We're all collaborating.

I was actually looking at DAMON on Thursday, wondering if it could help us in this. I think it might, but I haven't been able to formulate the question in a coherent way yet.