Better hugetlb page-table walking
"Hugetlb" refers to an early huge-page implementation in the kernel that
has often been thought of as an independent memory-management subsystem.
It works with memory that has been reserved specially; the hugetlbfs
filesystem must be used to gain access to that memory. Many applications
are better served by transparent huge pages, which require no special code,
but hugetlb users remain. It gives more reliable access to huge pages for
some applications, and can reduce memory usage by sharing page tables
across multiple processes.
The existence of hugetlb as an independent memory-management mechanism has long grated on the development community. The 2024 Summit featured a session focused on hugetlb unification, and some progress has been made in that direction. The 2025 session limited its scope to page-table walking in particular, in the hope of getting rid of some duplicated code and special cases. Salvador posted an RFC patch set unifying the hugetlb page-walking code in July 2024, but the reviews were mixed, and that work has not proceeded further.
Since then, David Hildenbrand has proposed, in general terms, a new page-walking API that could be considered instead. (That initial proposal happened in this email, but most of the discussion about the implementation has evidently happened privately. Salvador has an implementation in his repository.) The core idea is an API that walks through a virtual-memory area (VMA) and manages locking and batching of operations, telling the caller what type of pages were found. This new API would make implementation of /proc/PID/smaps much simpler, he said. If the group agreed, he said he would like to start converting some of the /proc code over, then move on to some of the other page-table walkers in the kernel.
Lorenzo Stoakes asked whether Salvador intended to replace all of the page-table walkers in the kernel with calls to the new API. Salvador said he did not intend to do that right away; there are a lot of special cases in many of those walkers, so the conversion is not always straightforward. Hildenbrand said that, for now, it is best to focus on the lower levels of the page tables.
Ryan Roberts expressed concerns that the performance of many system calls
is sensitive to small changes. Adding a page-table walker with indirect
calls could introduce an unacceptable slowdown, he said. But, as it turns
out, the proposed API is implemented as an iterator with no indirect calls,
so that should not be a problem. At the end of the session, Matthew Wilcox
asked how this API will handle a copy-on-write operation in the middle of a
large allocation. For now, apparently, it does not handle that case at
all; in the future, it will be able to return ranges of compatible
page-table entries.
Index entries for this article | |
---|---|
Kernel | Memory management/Internal API |
Conference | Storage, Filesystem, Memory-Management and BPF Summit/2025 |