|
|
Subscribe / Log in / New account

Ways to reclaim unused page-table pages

By Jonathan Corbet
May 9, 2022

LSFMM
One of the memory-management subsystem's most important jobs is reclaiming unused (or little-used) memory so that it can be put to better use. When it comes to one of the core memory-management data structures — page tables — though, this subsystem often falls down on the job. At the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), David Hildenbrand led a session on the problems posed by the lack of page-table reclaim and explored options for improving the situation.

Page tables, of course, contain the mapping between virtual and physical addresses; every virtual address used by a process must be translated by the hardware using a page-table entry. These tables are arranged in a hierarchy, three to five levels deep, and can take up a fair amount of memory in their own right. On an x86 system, a 2TB mapping will require 4GB of bottom-level page tables, 8MB of the next-level (PMD) table, and smaller amounts for the higher-level tables. These tables are not swappable or movable, and sometimes get in the way of operations like hot-unplugging memory.

Page tables are reclaimed by the memory-management subsystem in some cases now. For example, removing a range of address space with munmap() will clean up the associated page tables as well. But there are situations that create empty page tables that are not reclaimed. Memory allocators tend to map large ranges, for example, and to only use parts of the range at any given time. When a given sub-range is no longer needed, the [David
Hildenbrand] allocator will use madvise() to release the pages, but the associated page tables will not be freed. Large mapped files can also lead to empty page tables. These are all legitimate use cases, and they can be problematic enough; it is also easy to write a malicious program that fills memory with empty page tables and brings the system to a halt, which is rather less legitimate.

In all of these cases, many of the resulting page tables are empty and of no real use; they could be reclaimed without harm. One step in that direction, Hildenbrand said, is this patch set from Qi Zheng. It adds a reference count (pte_ref) to each page table page; when a given page's reference count drops to zero, that page can be reclaimed.

Another source of empty page-table pages is mappings to the shared zero page. This is a page maintained by the kernel, containing only zeroes, which is used to initialize anonymous-memory mappings. The mapping is copy-on-write, so user space will get a new page should it ever write to a page mapped in this way. It is easy for user space to create large numbers of page-table entries pointing to the zero page, once again filling memory with unreclaimable allocations. This can be done, for example, by simply reading one byte out of every 2MB of address space mapped as anonymous memory. These page-table pages could also be easily reclaimed, though, if the kernel were able to detect them; there is no difference (as far as user space can tell) between a page-table page filled with zero-page mappings and a missing page-table page. At worst, the page-table page would have to be recreated should an address within it be referenced again from user space.

There are other tricks user space can use to create large numbers of page-table pages. Some of them, like so many good exploits, involve the userfaultfd() system call.

So what is to be done? Reclaiming empty page-table pages is "low-hanging fruit", and the patches to do so already exist. The kernel could also delete mappings to the zero page while scanning memory; that would eventually empty out page-table pages containing only zero-page mappings, and allow those to be reclaimed as well. Ordinary reclaim unmaps file-backed pages, which can create empty page-table pages, but that is not an explicit objective now. It might be possible to get reclaim to focus on pages that are close to each other in the hopes of emptying out page-table pages, which could then be reclaimed as well.

There are some other questions to be answered as well, though. Some page-table pages may be actively used, even if they just map the zero page; reclaiming them could hurt performance if they just have to be quickly recreated again. There are also questions about whether it makes sense to reclaim higher-level page tables. Reclaiming empty PMD pages might be worthwhile, Hildenbrand said, but there is unlikely to be value in trying to go higher.

When should this reclaim happen? With regard to empty page-table pages, the answer is easy — as soon as they become empty. For zero-page mappings, the problem is a little harder. They can be found by scanning, but that scanning has its own cost. So perhaps this scanning should only happen when the kernel determines that there is suspicious zero-page activity happening. But defining "suspicious" could be tricky. Virtual machines tend to create a lot of zero-page mappings naturally, for example, so that is not suspicious on its own; it may be necessary to rely on more subjective heuristics. One such might be to detect when a process has a high ratio of page-table pages to its overall resident-set size.

Even when a removable page-table page has been identified, he said, removing it may not be trivial. There is a strong desire to avoid acquiring the mmap_lock, which could create contention issues; that limits when these removals can be done. Removing higher-level page-table pages is harder, since ordinary scanning can hold a reference to them for a long time. Waiting for the reference to go away could stall the reclaim process indefinitely.

With regard to what can be done to defeat a malicious user-space process, Hildenbrand said that the allocation of page-table pages must be throttled somehow. The best way might be to enforce reclaim of page-table pages against processes that are behaving in a suspicious way. One way to find such processes could be to just set a threshold on the amount of memory used for page-table pages, but that could perhaps be circumvented simply by spawning a lot of processes.

At this point, the session wound down. Yang Shi suggested at the end that there might be some help to be found in the multi-generational LRU work, if and when that is merged. It has to scan page tables anyway, so adding the detection of empty page-table pages might fit in relatively easily.

Index entries for this article
KernelMemory management
ConferenceStorage, Filesystem, Memory-Management and BPF Summit/2022


to post comments

Ways to reclaim unused page-table pages

Posted May 9, 2022 14:47 UTC (Mon) by nickodell (subscriber, #125165) [Link] (1 responses)

There must be something I'm missing here with regard to malicious processes.

>These are all legitimate use cases, and they can be problematic enough; it is also easy to write a malicious program that fills memory with empty page tables and brings the system to a halt, which is rather less legitimate.

Are the page tables used by a process not charged to it in memory accounting? If so, it seems like a malicious process which makes lots of page tables can be dealt with in the same way as a malicious process which allocates lots of memory. If not, why not?

Ways to reclaim unused page-table pages

Posted May 11, 2022 11:00 UTC (Wed) by david.hildenbrand (subscriber, #108299) [Link]

Hi,

considering cgroups, page tables are charged that way already, towards the total memory consumption of a cgroup. In cgroup v1 there is an additional "kmem" accounting to track+limit such kernel allocations separately, which seems to no longer exist in v2. So in v2 you can only limit the total memory consumption of a cgroup.

In general, the issue is that page tables are unmovable and unswappable. In environments where neither matters (e.g., no swap, no memory hotunplug, no THP, no memory compaction), there is no real difference to movable and swappable user pages.

However, usually we care about the difference, for example, when using swap for memory overcommit or having memory in the system managed by ZONE_MOVABLE for memory hotunplug or memory defragmentation. As page tables cannot go to swap or cannot be placed on ZONE_MOVABLE, malicious processes can trigger OOM or negatively affect the performance of other workloads in setups, simply by primarily consuming primarily unmovable and unswappable memory via page tables.


Copyright © 2022, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds