The state of the page in 2025
The initial idea behind folios, Wilcox began, was to manage pages in larger blocks; the experience of the last few years shows that it works. Later on, the goal of shrinking the page structure, which represents a single page in memory was added. Even later came objectives like enabling filesystems with block sizes larger than the page size and improving the debugability and clarity of the memory-management subsystem. There has been the accumulation of a lot of cruft in that subsystem over the years, he said; the folio transition is an opportunity to clean some of it out.
There are two understandings of what a folio is. The first, which he
called the "Ottawa interpretation", is what he intended initially; it was,
in essence, just the head page of a compound page. Over time, though, the
conception of folios
has shifted toward the "New York interpretation", much of which is the work
of Johannes Weiner. In that view, folios are an opportunity to shrink
struct page to a single u64 memory descriptor. Progress
is being made toward that goal, but it will not be achieved this year.
Since a folio is an independent structure, it can grow as needed. The size of struct page, instead, is strictly constrained; since there is one per page, it must be as small as possible. Even though struct folio is getting larger, there will be a lot fewer of them, so the overall memory-management overhead will decrease. Getting to the point where struct page can be replaced will require quite a bit of work, still.
In 2025, the objective is to get to the point where struct folio is indeed a separate structure from struct page and can be allocated independently. Then, data can be removed from struct page, shrinking it, but not yet all the way.
Wilcox noted that he is getting tired of converting filesystems to folios, which is a necessary step on the path to a smaller page structure. So he is considering adding a new kernel configuration option, CONFIG_SEPARATE_FOLIO, that would compile out any code that is not yet prepared for a separate folio structure. That would allow the creation of a kernel where the separate-folio changes can be tested, even if it isn't yet a kernel that supports all of the important features (networking, say) that users might actually want.
The upcoming work involves removing all references to a number of
struct page fields, including mapping, index,
and lru. The networking subsystem's pagepool mechanism will need
to be separated out. There is also the page pointer in struct
folio, which points to the underlying page structure, and a
lot of "more interesting
" casts scattered through the kernel code
that perform a similar function; those will have to be fixed. For example,
the buffer_head structure has page and folio pointers in a single
union and will need to be fixed. Another near-future change will be
adapting the slab allocator to use the separate slab structure.
Wilcox then reviewed some of the goals he had covered in the 2024 update. Many of them remained incomplete; in his defense, he said, it had only been ten months since last year's conference. Given those two months — essentially one whole development cycle — he would have checked off more of them.
Some things were definitely accomplished, though. The zpdesc memory descriptor was added
to replace struct page use in the zswap subsystem. It is
currently an overlay on struct folio, but Wilcox thinks it could
be made more space-efficient. The ability to
use a filesystem block size larger than the page size has been
"quite the journey
", but that ability now exists for the XFS and
bcachefs filesystems. It should be easy to add to other filesystems as
well — at least, once those filesystems are able to support large folios.
Another achievement is the ability to allocate and free frozen pages, which have no reference count.
The adoption of this feature, he said, is "borrowing pain from the
future
". Recently, the network stack found this pain; happily, this problem turned
up and was fixed before the 6.14 release. Wilcox would like to see more
use of frozen pages in general, but he pointed out that there will always
be some places where reference counts will be needed.
Another important step forward is imprecise mapping-count tracking for large folios, which changes the kernel to track the number of processes mapping a folio, rather than the number of mappings, once the number of processes exceeds two. This work enables precise tracking for the common cases while maintaining correctness in the more exotic cases.
It is now possible to create large folios in generic_perform_write(), which, he said, is a big deal; it can double write performance in some tests. That result wasn't surprising, he said, since using large folios frees the kernel from having to manage large numbers of base pages. Meanwhile, the bh_page pointer in buffer heads is now unused, all that remains is to actually delete it. There has also been a lot of work removing the wrapper functions around the various page flags, further reducing the role of struct page.
There is currently some use of page types, which will eventually be stored in the memory descriptor. The type is stored in the page structure, but it is overlaid by the mapcount field, so it cannot be used in mapped pages. There is some trickery being used to distinguish mapping counts (positive numbers) from the page types, which are indicated by a negative mapping count. Various types of pages, including hugetlb, slab, zsmalloc, unaccepted, and large-kmalloc pages, are identified by page types now.
Page flags have long been in short supply; there is exactly one of them available at the moment. The PG_slab flag is gone now, as is PG_error, which turned out to just be the inverse of PG_uptodate. Those changes freed two flags, but then PG_dropbehind (which might be renamed PG_reclaim) was added. The PG_uncached flag is now an alias for PG_arch_2, while PG_mappedtodisk overlays PG_owner_2. The PG_private_2 flag is almost unused, but the Ceph and NFS filesystems still need it. PG_private needs more work before it can be removed, since it is used for a lot of different things in different places. Often it is used to indicate that there is something stored in the private field of struct page, but a test for NULL should be usable instead (some members of the group disagreed with that claim). Most of the existing page flags will eventually become folio flags, he said, while PG_hwpoison will become a page type.
Wilcox concluded with some suggestions for anybody who wants to help with the folio transition. At the top of his list was working to make more filesystems support large folios; that, he said, is good for the system as a whole. The "bitmap" MD target needs to stop using buffer heads, but he hasn't looked at MD in many years and is afraid of it. Removing any uses of the page member of struct folio and the lru page field would also be useful.
There have been many developers who have helped with this work so far, and
he thanked them all. It has been a fun project; he was looking forward, he
said in jest, to next year's Summit where he will be able to say that it is
complete.
| Index entries for this article | |
|---|---|
| Kernel | Memory management/Folios |
| Conference | Storage, Filesystem, Memory-Management and BPF Summit/2025 |
