The state of the page in 2024

By Jonathan Corbet
May 15, 2024

The advent of the folio structure to describe groups of pages has been one of the most fundamental transformations within the kernel in recent years. Since the folio transition affects many subsystems, it is fitting that the subject was covered at the beginning of the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit in a joint session of the storage, filesystem, and memory-management tracks. Matthew Wilcox used the session to review the work that has been done in this area and to discuss what comes next.

The first step of this transition, he began, was moving much of the information traditionally stored in the kernel's page structure into folios instead, then converting users of struct page to use the new structure. The initial goal was to provide a type-safe representation for compound pages, but the scope has expanded greatly since then. That has led to a bit of ambiguity: what, exactly, is a folio in current kernels? For now, a folio is still defined as "the first page of a compound page".

By the end of the next phase, the plan is for struct page to shrink down to a single, eight-byte memory descriptor, the bottom few bits of which describe what type of page is being described. The descriptor itself will be specific to the page type; slab pages will have different descriptors than anonymous folios or pages full of page-table entries, for example.

Among other motivations, a key objective behind the move to descriptors is reducing the size of the memory map — the large array of page structures describing every physical page in the system. Currently, the memory-map overhead is, at 1.6% of the memory it describes, too high. On systems where virtualization is used, the memory map is also found in guests, doubling the memory consumed by the memory-map. By moving to descriptors, that overhead can be reduced to 0.2% of memory, which can save multiple gigabytes of memory on larger systems.

Getting there, though, requires moving more information into the folio structure. Along the way, concepts like the pin count for a page can be clarified, cleaning up some longstanding problems in the memory-management subsystem. This move will, naturally, increase the size of the folio structure, to a point where it will be larger than struct page. The advantage, though, is that only one folio structure is needed for all of the base pages that make up the folio. For two-page folios, the total memory use is about the same; for folios of four pages or more, the usage is reduced. If the kernel is caching the contents of a 1GB file, it currently needs ~~60MB~~ 16MB of page structures. If that caching is done entirely with base pages, that overhead will increase to 23MB in the future. But, if four-page folios are used instead, it drops to 9MB total.

Some types of descriptors, including those for slab pages and page-table entries, have already been introduced. The page-table descriptors are quite a bit smaller than folios, since there are a number of fields that are not needed. For example, these pages cannot be mapped into user space, so there is no need for a mapping count.

Wilcox put up a plot showing how many times struct page and struct folio are mentioned in the kernel since 2021. On the order of 30% of the page mentions have gone away over that time. He emphasized that the end goal is not to get rid of struct page entirely; it will always have its uses. Pages are, for example, the granularity with which memory is mapped into user space.

Since last year's update, quite a lot of work has happened within the memory-management subsystem. Many kernel subsystems have been converted to folios. There is also now a reliable way to determine whether a folio is part of hugetlbfs, the absence of which turned out to be a bit of a surprising problem. The adoption of large anonymous folios has been a welcome improvement.

The virtual filesystem layer has also seen a lot of folio-related work. The sendpage() callback has been removed in favor of a better API. The fs-verity subsystem now supports large folios. The conversion of the buffer cache is proceeding, but has run into a surprise: Wilcox had proceeded with the assumption that buffer heads are always attached to folios, but it turns out that the ext4 filesystem allocates slab memory and attaches that instead. That usage isn't wrong, Wilcox said, but he is "about to make it wrong" and does not want to introduce bugs in the process.

Avoiding problems will require leaving some information in struct page that might have otherwise come out. In general, he said, he would not have taken this direction with buffer heads had he known where it would lead, but he does not want to back it out now. All is well for now, he said; the ext4 code is careful not to call any functions on non-folio-backed buffer heads that might bring the system down. But there is nothing preventing that from happening in the future, and that is a bit frightening.

The virtual filesystem layer is now allocating and using large folios through the entire write path; this has led to a large performance improvement. Wilcox has also added an internal function, folio_end_read(), that he seemed rather proud of. It sets the up-to-date bit, clears the lock bit, checks for others waiting on the folio, and serves as a memory barrier — all with a single instruction on x86 systems. Various other helpers have been added and callbacks updated. There is also a new writeback iterator that replaces the old callback-based interface; among other things, this helps to recover some of the performance that was taken away by Spectre mitigations.

With regard to individual filesystems, many have been converted to folios over the last year. Filesystems as a whole are being moved away from the writepage() API; it was seen as harmful, so no folio version was created. The bcachefs filesystem can now handle large folios — something that almost no other filesystems can do. The old NTFS filesystem was removed rather than being converted. The "netfs" layer has been created to support network filesystems. Wilcox put up a chart showing the status of many filesystems, showing that a lot of work remained to be done for most. "XFS is green", he told the assembled developers, "your filesystem could be green too".

The next step for folios is to move the mapping and index fields out of struct page. These fields could create trouble in the filesystems that do not yet support large folios, which is almost all of them. Rather than risk introducing bugs when those filesystems are converted, it is better to get those fields out of the way now. A number of page flags are also being moved; flags like PageDirty and PageReferenced refer to the folio as a whole rather than to individual pages within it, and thus should be kept there. There are plans to replace the write_begin() and write_end() address-space operations, which still use bare pages.

Beyond that, there is still the task of converting a lot of filesystems, many of which are "pseudo-maintained" at best. The hugetlbfs subsystem needs to be modernized. The shmem and tmpfs in-memory filesystems should be enhanced to use intermediate-size large folios. There is also a desire to eliminate all higher-order memory allocations that do not use compound pages, and thus cannot be immediately changed over to folios; the crypto layer has a lot of those allocations.

Then, there is the "phyr" concept. A phyr is meant to refer to a physical range of pages, and is "what needs to happen to the block layer". That will allow block I/O operations to work directly on physical pages, eliminating the need for the memory map to cover all of physical memory.

It seems that there will be a need for a khugepaged kernel thread that will collapse mid-size folios into larger ones. Other types of memory need to have special-purpose memory descriptors created for them. Then there is the KMSAN kernel-memory sanitizer, which hasn't really even been thought about. KMSAN adds its own special bits to struct page, a usage that will need to be rethought for the folio-based future.

An important task is adding large-folio support to more filesystems. In the conversions that Wilcox has done, he has avoided adding that support except in the case of XFS. It is not an easy job and needs expertise in the specific filesystem type. But, as the overhead for single-page folios grows, the need to use larger folios will grow with it. Large folios also help to reduce the size of the memory-management subsystem's LRU list, making reclaim more efficient.

Ted Ts'o asked how important this conversion is for little-used filesystems; does VFAT need to be converted? Wilcox answered that it should be done for any filesystem where somebody cares about performance. Dave Chinner added that any filesystem that works on an NVMe solid-state device will need large folios to perform well. Wilcox closed by saying that switching to large folios makes compiling the kernel 5% faster, and is also needed to support other desired features, so the developers in the room should want to do the conversion sooner rather than later.

Index entries for this article
Kernel	Memory management/Folios
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2024

The state of the page in 2024

Posted May 15, 2024 15:05 UTC (Wed) by Paf (subscriber, #91811) [Link]

I’d be curious to hear about the size of the single threaded read/write performance improvements folks have observed on full conversion to large folios - both for buffered and direct IO. I’d expect it to be significant in both cases but I’m not sure how much.

The state of the page in 2024

Posted May 15, 2024 17:39 UTC (Wed) by WolfWings (subscriber, #56790) [Link] (7 responses)

I'm somehow amused they'd ask about VFAT for "do we need to convert this?" considering it's just about the most-used filesystem on any remotely modern system running Linux since it underpins UEFI support.

I get it, it's not performance critical so the 'benefits' are a wash, but calling it "little used" seems like they forgot for a moment how it got a second lease on life in recent years.

The state of the page in 2024

Posted May 15, 2024 22:34 UTC (Wed) by Kamilion (guest, #42576) [Link] (2 responses)

Eh? I thought those were supposed to be of type exfat in practice now? It's been a while, but so far as I recall, vfat (the driver) handles fat12b and fat16, and I'm not sure which of the two are used for fat32...

The state of the page in 2024

Posted May 15, 2024 22:45 UTC (Wed) by dezgeg (subscriber, #92243) [Link] (1 responses)

No, fat32 is handled by vfat driver as well. exfat is a different non-compatible beast altogether that used to be not be well supported in open source Linux (nor UEFI implementations) for a long time due to patents.

The state of the page in 2024

Posted May 16, 2024 0:23 UTC (Thu) by WolfWings (subscriber, #56790) [Link]

Yup, it's a common oversight.

I'm unaware of any UEFI firmware on x86 at least that supports exFAT, broadly speaking it's only ever been supported for 32GB and larger volumes because it started as literally the "64GB and larger flash card" option, and the UEFI "ESP" partition is rarely more than 500MB.

In theory it can be used for that, but it'd be using unusually small cluster sizes which are almost never seen in the wild so vendors never test against them because they're not required to be supported.

The state of the page in 2024

Posted May 16, 2024 4:08 UTC (Thu) by willy (subscriber, #9762) [Link]

Funnily, VFAT isn't on my list. That's because VFAT and MSDOS both use the same address_space_operations, so I just counted it as one (FAT).

I think it would make a lovely project for someone to convert the FAT FS from buffer_heads to iomap. That way you'd get large folio support basically for free.

It's probably even a useful thing to do. The storage vendors are threatening us with 16KiB sector sizes and while the block layer should support this nicely, I will have a bit more confidence in the iomap layer. I hope to finish reviewing the bs>PS patches tomorrow.

The state of the page in 2024

Posted May 16, 2024 7:45 UTC (Thu) by Wol (subscriber, #4433) [Link] (2 responses)

I think the other thing about vFAT is the entire UEFI filesystem would probably fit into a single large page (I might be exaggerating slightly here, but only just :-) so even if it were performance critical there might be pretty much no benefits to be had ...

Cheers,
Wol

The state of the page in 2024

Posted May 16, 2024 18:30 UTC (Thu) by jem (subscriber, #24231) [Link] (1 responses)

My UEFI System Partition contains 160 megs of stuff, and no cruft. But of course, if the single page is large enough...

The state of the page in 2024

Posted May 16, 2024 18:44 UTC (Thu) by farnz (subscriber, #17727) [Link]

My laptop uses 1 GiB pages for most of the direct map of physical memory. Your UEFI System Partition would fit in one page :-)

The state of the page in 2024

Posted May 19, 2024 17:24 UTC (Sun) by rharjani (subscriber, #87278) [Link] (1 responses)

>> If the kernel is caching the contents of a 1GB file, it currently needs 60MB of page structures. If that caching is done entirely with base pages, that overhead will increase by 23MB in the future. But, if four-page folios are used instead, it drops to 9MB total.

I didn't get the calculation here. If 1GB file uses struct page to cache the file contents today, isn't it 16MB in total (1G/ 64bytes) (since sizeof a page structure is 64bytes). So where does 60MB came from?

Continuing to the next line.. I think +23MB is assuming the size of struct folio will grow (which is also something not very clear to me as how exactly?). So "+23MB" will depend upon how much the struct folio will grow. Looks like that part of information is missed?
And that info about increase in size of struct folio may clarify how order-2 (four-page) folios will drop the use to 9MB total?

It will be helpful to get the answers to above questions please. Or maybe I will wait for the youtube video to come and try to figure out.

The state of the page in 2024

Posted May 19, 2024 18:50 UTC (Sun) by willy (subscriber, #9762) [Link]

My slides say 16MB. I should put those on a webpage; our humble editor has a copy in his inbox, but I guess I spoke unclearly, and the notes weren't checked against the slides for that number.

To re-run the calculation, 1GiB/4KiB * 64 is 16MiB. 1GiB/4KiB * 88 is 23MiB. 1GiB/16KiB * 144 is 9.2MiB.

To explain those numbers, we need 8bytes/page for the memdesc. An order-0 folio is estimated to be 80 bytes (need to include pfn and needs to be a multiple of 16 bytes). A higher order folio will probably be 112 bytes for extra mapcount information and the deferred split list. Plus four 8byte memdescs gets us to 144 bytes.

Hope that's helpful. Let me know if I skipped a necessary step.