Optimizing single-owner memory

By Jonathan Corbet
May 26, 2023

The kernel's memory-management subsystem is optimized for the sharing of resources to the greatest extent possible. But, as Pasha Tatashin pointed out during a memory-management session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, a lot of memory has a single owner and will never be shared. He presented some ideas for optimizing the management of that memory to a somewhat skeptical crowd.

The problem he is trying to solve, Tatashin began, is specific to Google, but he thought that others might be experiencing it too. The memory he is talking about is anonymous memory that is never shared. A process may allocate a substantial amount of memory, then never fork, or it might have used madvise() to tell the kernel not to share a range of memory with any children. At Google, 90% of memory is never shared, since Google heavily favors the use of threads rather than independent processes.

The kernel's memory management is not as efficient as it could be for this kind of workload, he said. About 1.6% of of a system's physical memory is dedicated to page structures to manage that memory. Google's server fleet contains petabytes of memory, which is its most expensive component, so the expense of that 1.6% overhead is considerable. The page structure is there to manage all types of memory, he said; it's not needed for single-user anonymous memory.

Eliminating the possibility of sharing for that memory would bring some advantages. He has had to debug problems where memory is falsely shared, often as the result of driver bugs. Without sharing, those bugs won't happen. Since single-owner memory is always migratable, it can always be assembled into 1GB huge pages, which helps performance.

Tatashin described a single-owner-memory driver that would have two components. The memory pool would manage 1GB chunks of memory that can come from a number of sources, including hugetlbfs, device DAX, or a CXL memory pool; this memory need not have page structures associated with it. The memory pool would take pains to separate movable allocations from those that cannot be moved. The other piece is the driver itself, which manages memory in smaller chunks and makes it available to processes. It implements a new type of virtual memory area (VMA) that is marked as being page-frame-number (PFN) mapped, so the kernel will not expect to find page structures behind it. This driver can support most memory-oriented calls like madvise().

User-space processes can then open /dev/som to allocate single-owner memory for their use. Tatashin has run into some problems with the implementation, though, including the fact that PFN-mapped VMAs are not supported everywhere in the kernel. In particular, the get_user_pages() family of functions will not work with them, making it impossible to use single-owner memory in a number of contexts.

Page aging, he said, is hard to manage since, without page structures, there is no least-recently-used (LRU) list to consult. One solution here would be to create a new, smaller variant of struct page for this purpose; he has been resisting that approach so far, but does not have a better solution. Swapping is not supported with this memory, and neither are NUMA placement or hardware poisoning.

Tatashin said he realizes that the work implementing folios and page descriptors will eventually solve many of the same problems. But, he said, this work is still expected to take some years to complete, and he would like to have a solution sooner than that. Matthew Wilcox interjected that the folio work would happen more quickly if Tatashin helped, and said that the single-owner-memory work is trying to eliminate the memory-management subsystem, but will end up reimplementing it all. This effort, he said, is likely to end up like hugetlbfs, which duplicates much memory-management functionality; he questioned whether anything would be gained in the end.

As the session wound down, others in the group expressed similar feelings. There will never be an end to the addition of features to this special-purpose allocator, so it would always be growing. John Hubbard expressed the consensus in the room when he suggested that Tatashin just work to make the core memory-management code better suit his needs.

Index entries for this article
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2023