Memory-management short topics: page-table sharing and working sets

By Jonathan Corbet
January 9, 2023

The kernel's memory-management developers have been busy before and during the holidays; the result is a number of patch sets making significant changes to that subsystem. It is time for a quick look at three of those projects. Two of them aim to increase the sharing of page tables between processes, while the third takes advantage of the multi-generational LRU to create a better picture of what a process's working set actually is.

Revisiting msharefs

Some applications are structured as a large set of independent processes, all sharing a (potentially large) region of memory. Each of those processes will have its own set of page tables for that shared region. Duplicating page tables imposes a relatively small cost when the number of processes is low, but when that number gets large, the memory occupied by page tables may exceed the size of the memory region they refer to. In many cases, this duplication of page tables brings no extra value.

For some time, Khaled Aziz has been working on a mechanism to allow cooperating processes to share page tables referring to a shared memory area; this work has, at times, taken the form of the mshare() system call and the msharefs filesystem. There have been concerns raised with both solutions, so now Aziz is back with yet another attempt. This implementation does away with new system calls and filesystems and, instead, just adds a new flag (MAP_SHARED_PT) to the mmap() system call. If a process maps a shared segment (implying that MAP_SHARED must also be provided) with this new flag, then the page tables mapping this segment will also be shared with the other users, saving the overhead of making an independent copy of those tables.

As with the other versions, there are some interesting semantics and limitations associated with shared page tables. Any address-space changes (such as an mprotect() call) made to the shared region by one process will apply to every process sharing the page tables; that is seen as an advantage for some use cases. The memory segment must be aligned at the PMD level (2MB on many architectures), and it must be mapped at the same virtual address in all processes. The same-address requirement could perhaps be removed, Aziz said, if there is a reason to do so.

Underneath the API, the implementation of page-table sharing follows the same lines as before. A separate mm_struct structure is created to manage the shared region as if it were a separate address space.

There have been no comments on the new version so far. One might expect that using mmap() would address most of the concerns about the user-space API for this feature. But this kind of page-table sharing, with its unique semantics, represents a significant memory-management change to serve a relatively rare use case. It is not yet clear that the case has been made that this functionality is worth the cost.

Copy-on-write page tables

A different, and somewhat more transparent, approach to page-table sharing can be found in this patch set from Chih-En Lin. When a process calls fork(), the new child process will share its memory with the parent. Any writable pages are marked copy-on-write (COW); should either process write to a COW page, that page will first be copied (breaking the sharing) so that the other process does not see the change. Sharing memory in this way saves a lot of copying, especially if the child process will not actually use much of the parent's memory.

While the parent's memory is not copied into the child on fork(), the parent's page tables are copied. If the parent process has a large address space, that copying can still create a significant cost, and it may be entirely useless if the child does not access that memory. Lin seeks to reduce that cost by, instead, extending the COW mechanism to the bottom (PTE) level of the page-table hierarchy.

A process must opt into the COW behavior with a new prctl() command (PR_SET_COW_PTE). Once that has been done, any new child processes will be created with shared page tables. The usual COW behavior applies here; should either process make a change to a PTE page, that page will be copied and the sharing will be broken. An mprotect() call, for example, would end up copying the affected page-table pages. Thus, COW page tables should not result in any behavioral changes visible to either side, other than fork() calls running a bit more quickly and requiring less memory.

Of course, that is not quite true. While a fork() may be a bit faster, other operations, including page-fault handling, may be slower due to the need to break the sharing of the page-table pages. Whether this sharing is beneficial overall may thus vary depending on the workload; benchmark results included in the cover letter show a 3-5% performance increase for some workloads, and a slight decrease for others. This variability of results explains the need to opt into the COW behavior; for most workloads it probably will not make enough of a difference to be worth the trouble.

Here, too, the implementation adds a certain amount of complexity to the core memory-management code. The sharing of page-table pages requires the addition of a reference count to each of those pages so the kernel knows when they are no longer in use. There are numerous operations that can require the sharing to be broken, including transparent huge-page collapse, kernel same-page merging, madvise() calls, and more. The code also has to properly handle page-table pages that cannot be shared, including those referring to pages that are pinned or mapped to device memory. The first posting of this work drew some questions about whether the added complexity was worth it (and a side discussion on better alternatives to fork()). There have been few responses to the current version, but it seems likely that this discussion has not yet reached its conclusion.

Working-set estimation

A process's "working set" is the subset of its pages that it is actually using at any given time. Identifying the working set is a key part of effective memory management; if a process's working set can be kept in RAM, that process will perform much better than if it must continually fault pages in from secondary storage. Giving a process more memory than is needed to hold the working set, though, is wasteful. So a lot of effort goes into trying to give each process just enough memory — but not too much.

In this patch set, Yuanchu Xie notes that the multi-generational LRU (MGLRU) work that was merged for the 6.1 kernel provides much of the infrastructure needed to create better working-set-size estimates. The MGLRU organizes a process's pages into "generations", with recently-used pages being placed into the youngest generation. Over time, unused pages age into the older generations, until they are eventually reclaimed.

The working set should thus be found in the youngest generations. The only problem is that the generational aging does not happen on any sort of set schedule; instead, it is done when memory pressure increases and the kernel needs to find pages to reclaim. As a result, the younger generations can accumulate pages that have not been used in some time, while pages that are part of the working set may remain stuck in the older generations; this situation can persist for some time if memory pressure is not high.

As a way of getting better working-set-size estimates out of the MGLRU, Xie adds a new mechanism to force aging to happen regularly. It takes the form of a new knob, memory.periodic_aging, that is implemented in the memory control-group controller, but for the root group only. It holds the aging interval in seconds; setting it to a non-zero value will enable periodic MGLRU aging system-wide. There is a new kernel thread, called kold, that does this aging work.

If memory.period_aging is set to, for example, 60 seconds, then the youngest generation for any process should contain the pages that are known to have been used within the last minute, while the second-youngest generation will hold pages that have been idle for more than one minute, but less than two. The kernel could use this information to adjust the amount of memory available to each process, but it could also be of use to user-space memory-management mechanisms. Processes could use their own working-set information to optimize their behavior and avoid using more memory than is available to them.

Before user space can use this information, though, it needs to be made available, which is not currently the case. So the patch set adds another memory-controller file called memory.page_idle_age to export generational data to user space. Reading this file will produce a table with counts of the number of pages in each of a set of fixed age ranges (ranging from one second to just over one hour), with separate lines for file-backed and anonymous pages. This information seems like it could be useful in a number of situations, including simply better understanding how the generational-aging algorithm is working.

This patch series is on its first posting, and has not yet drawn any review comments. It is far less invasive than the other patches examined here and seems like it should be less controversial. If nothing else, though, this work could benefit from some documentation so that potential users of the new functionality do not need to reverse-engineer its interface from the source.

Index entries for this article
Kernel	Memory management/Page replacement algorithms
Kernel	Memory management/Page-table sharing

Memory-management short topics: page-table sharing and working sets

Posted Jan 9, 2023 19:31 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (2 responses)

> A process must opt into the COW behavior with a new prctl() command (PR_SET_COW_PTE)

If there's no user-visible behavior change, why make processes opt in?

Memory-management short topics: page-table sharing and working sets

Posted Jan 9, 2023 19:42 UTC (Mon) by mb (subscriber, #50428) [Link] (1 responses)

>why make processes opt in?

There's a whole paragraph in the article about the reasons for why this is opt-in.

Memory-management short topics: page-table sharing and working sets

Posted Jan 9, 2023 19:54 UTC (Mon) by quotemstr (subscriber, #45331) [Link]

Yes, I noticed that. :-) The question stands though: in most other cases, performance optimizations don't have to be individually enabled. (Is there a prctl for folios or EAS?) I understand that the patch currently regresses performance in some use cases, but that's just a signal to me that it's not yet ready. Ideally, the system would tune itself. If it can't --- because the kernel doesn't have enough information to know what a process is going to do --- then the prctl or personality flag or whatever should provide the missing information to the kernel at a high level, not act as a low level toggle for an implementation detail of the mm subsystem.

Memory-management short topics: page-table sharing and working sets

Posted Jan 9, 2023 22:41 UTC (Mon) by walters (subscriber, #7396) [Link]

There's also https://lwn.net/Articles/908268/ which I didn't see mentioned.

Memory-management short topics: page-table sharing and working sets

Posted Jan 11, 2023 18:47 UTC (Wed) by mmechri (subscriber, #95694) [Link] (1 responses)

Does anyone know how the work to augment MGLRU for working-set estimation compares to the following approaches?

- Brendan Gregg's WSS tools [1]
- DAMON [2] [3]

[1] https://www.brendangregg.com/wss.html
[2] https://damonitor.github.io/test/result/visual/v24/index....
[3] https://sjp38.github.io/post/damon_profile_callstack_exam...

Memory-management short topics: page-table sharing and working sets

Posted Jan 12, 2023 23:02 UTC (Thu) by Yuanchu (subscriber, #153443) [Link]

The working set work focuses on understanding the usage pattern of a workload, per page type and per NUMA node, with granularity in minutes. It should also have low enough overhead that it doesn't impact application performance by much (to be evaluated), and could be turned on for most latency tolerant workloads to aid proactive reclaim.

The parallel in Brendan Gregg's WSS would be idle page tracking, which tracks accesses by setting and checking the PG_idle bit. The benefit here is that MGLRU already does almost all of this during aging, and is less clunky than userspace writing to /sys/kernel/mm/page_idle/bitmap.

With DAMON, you can get a lot more, e.g. a heatmap, but there's additional work and tuning required to make use of it. The working set extensions are really about exposing information MGLRU already has.