Looking forward to mapcount madness 2025
In theory, tracking mappings should be relatively straightforward: when a mapping to a page is added, increment that page's mapping count to match. The removal of a mapping should lead to an associated decrement of the mapping count. But huge pages and large folios complicate this scenario; they have their own mapping counts that are, essentially, the sum of the mapping counts for the pages they contain. It is often important to know whether a folio as a whole has mappings, so the separate count is useful, but it brings some complexities.
For example, one question that the kernel often asks is: how many processes have mapped a given page or folio? There are a number of operations that can be optimized if it is known that a mapping is exclusive — that the page or folio is mapped by a single process. The handling of copy-on-write pages is also hard to execute correctly if the exclusivity of a given mapping is not known; failures on this front have led to some unpleasant bugs in the past. For a single page, the exclusivity question is easily enough answered: if the mapping count is one, the mapping is exclusive, otherwise it is shared. That rule no longer applies if mapping counts are maintained at the folio level, though, since the folio-level count will almost never be just one.
The current scheme also has performance-related problems that folios could maybe help to improve. Mapping a traditional PMD-size huge page is equivalent to mapping 512 base pages; currently, if the entire huge page is mapped, the mapping count of each of its base pages must be incremented accordingly. Incrementing the mapping count on each of those base pages takes time the kernel developers would rather not spend; it would be a lot faster to only keep track of a single mapping count at the folio level. This optimization can only make the exclusivity question even harder to answer, though, especially in the presence of partially mapped folios (where only some of its pages are mapped into an address space).
Thus, it is not surprising the kernel developers have spent years trying to figure out how to properly manage the mapping counts as memory-management complexity increases.
Make room!
To better track mapping counts at the folio level, Hildenbrand first needed a bit more space in the folio structure for some additional information. struct folio is a bit of a complicated and confusing beast. As a way of facilitating the transition to folio use throughout the kernel, this structure is overlaid on top of struct page, which describes a single page. But folios often need to track more information than can be fit into the tightly packed page structure; this is especially true for large folios that contain many pages.
But, since a large folio does contain many pages — and physically contiguous pages at that — there are some tricks that can be employed. There is no real need to maintain a full page structure for every page within a folio, since they are managed as a unit; indeed, eliminating the management of all of those page structures is one of the objectives of the folio transition. But those page structures exist, laid out contiguously in the system's memory map. So a large folio does not have just one page structure's worth of memory at its disposal; it has the page structures for all of the component pages. The page structures for the "tail pages" — those after the first one — can thus be carefully put to use holding this additional information.
If one looks at the definition of struct folio, it quickly becomes clear that it is larger than a single page structure. After the initial fields that overlay the page structure for the head page, one will find this:
union { struct { unsigned long _flags_1; unsigned long _head_1; atomic_t _large_mapcount; atomic_t _entire_mapcount; atomic_t _nr_pages_mapped; atomic_t _pincount; #ifdef CONFIG_64BIT unsigned int _folio_nr_pages; #endif /* private: the union with struct page is transitional */ }; struct page __page_1; };
This piece of the folio structure precisely overlays the page structure of the first tail page, assuming such a page exists. It contains information intended to help maintain the mapping count in current kernels, and other relevant fields. There is also a __page_2 component (not shown) that mainly holds information used by the hugetlbfs subsystem. As a result, the folio structure is actually the length of three page structures, though most of it is only valid for large (at least four pages) folios.
As sprawling as this seems, it still lacks the space Hildenbrand needed to better track mapping counts. To be able to handle order-1 (two-page) folios, he needed that space to fit within the page-1 union shown above. So the first six patches of the series are dedicated to shuffling fields around in the folio structure, adding a __page_3 union in the process. The __page_1 union gains some complexity, but the core of the work is in these new fields:
mm_id_mapcount_t _mm_id_mapcount[2]; union { mm_id_t _mm_id[2]; unsigned long _mm_ids; };
They will be used to keep better track of the mapping for the folio to which they belong. Describing how that is done requires a bit more background, though.
One, two, or many
So how does all of this work help to improve the tracking of the mapping counts for large folios that may be shared between multiple processes and which can be partially mapped in any one of them? The starting point is the mm_struct structure that represents a process's address space. Any time a folio is mapped, that mapping will belong to a specific process, and thus a specific mm_struct structure. So the question of whether a folio is exclusively mapped comes down to whether all of its mappings belong to the same mm_struct. It is a simple matter of tracking which mm_struct structures hold mappings to the folio.
Of course, there could be thousands of those structures containing such mappings; consider that almost every process in the system will have the C library mapped, for example. Tracking all of those mappings without consuming a lot of time and memory would not be an easy task. But it is not really important to track every mapping to something like the C library; the purpose here is to stay on top of the folios that are exclusively mapped, and thus don't have all those mappings.
The _mm_id array that was added to page 1 of the folio structure is intended to serve this purpose; it can track up to two mm_struct structures that have mappings to the folio. The most straightforward way to do that would be to just store pointers to those mm_struct structures, but space in the folio structure is still at a premium. So, instead, a shorter "mm ID" is assigned to each mm_struct, using the kernel's ID allocator subsystem.
When a folio is first created, both _mm_id entries are set to MM_ID_DUMMY, indicating that they are unused. When the time comes to add a mapping, the kernel will search _mm_id for the appropriate mm ID, then increment the associated _mm_id_mapcount entry to record the new mapping(s). So, for example, if eight pages within a folio are mapped into the address space, the count will be incremented by eight to match. If the mm ID does not have an entry in _mm_id, the kernel will look for an MM_ID_DUMMY entry to use for this mm_struct, then start tracking the mappings there.
The kernel is now maintaining multiple mapping counts for this folio. The _large_mapcount field of the folio structure continues to count all of the mappings to the folio from any address space, as it does in current kernels. But there is also the _mm_id_mapcount count for each mm_struct tracking the number of mappings associated with that specific structure. The question of whether the folio is mapped exclusively is now easy to answer: if one of the _mm_id_mapcount counters is equal to _large_mapcount, then all of the mappings belong to the associated mm_struct and the kernel knows that the mapping is exclusive. Otherwise, the mapping is shared.
The ability to track two mm_struct structures handles the most common case of short-term shared mappings — when a process calls clone() to create a new child process. That new process will use the second _mm_id slot for the mapping that is now shared between the parent and the child. If, as often happens, the child calls execve() to run a new program, the shared mapping will be torn down, the child's _mm_id slot will be released, and the kernel will know that the folio is, again, mapped exclusively.
There is just one tiny gap in this mechanism, though: what happens when a third process comes along and maps the folio? There will be no _mm_id slot available for it, so its mapping(s) cannot be tracked. Should this happen, the kernel will set a special bit in the folio structure indicating that it no longer has a handle on where all the mappings to the folio come from, and will treat it as being shared. This could result in the kernel mistakenly concluding that a folio is mapped shared when it is mapped exclusively; the consequence will be worse performance, but no lack of correctness. If enough processes unmap the folio, there could come a time when _large_mapcount again aligns with one of the _mm_id_mapcount counts, and the kernel will once again know that the folio is mapped exclusively.
Per-page mapcounts and more
The result of all this work is that the kernel has a better handle on whether any given folio is mapped exclusively or shared, though it may still occasionally conclude that a folio is shared when it is not. But that was not the only objective of this work; Hildenbrand also would like to do away with the overhead of maintaining the per-page mapping counts in large folios. The final part of the patch series is an implementation of that goal; at the end, the per-page counts are no longer used or maintained.
The most significant consequence of dropping the per-page mapping counts appears to be making some of the memory-management statistics provided by the kernel (the various resident-set sizes, for example) a bit fuzzier. Hildenbrand suggests that this imprecision should not be a problem, but he also acknowledges that it will take time to see what the implications really are. To avoid surprises during that time, there is a new configuration parameter, CONFIG_NO_PAGE_MAPCOUNT, that controls whether these changes are effective. This work is considered experimental enough that, at this point, Hildenbrand does not want to have it enabled by default in production kernels.
There will be a desire to do that at some point, though; dropping the per-page map counts can make a clone() call up to 20% faster for some workloads, according to performance results included in the patch cover letter.
Meanwhile, this work enables another optimization with regard to how some transparent huge pages are used after a process forks. In current kernels, if the huge page (folio) is mapped at the base-page level ("PTE mapped"), it will not be reused after the fork. As the use of transparent huge pages — and, especially, in multi-size huge pages that must be PTE mapped — grows, reusing those huge pages will become increasingly important. Now, with the per-mm_struct mapping counts, the kernel can tell when a process has exclusive access to the huge page and can continue to use it as such. This reuse yields significant improvements in some benchmark results.
The use of large folios is expected to grow in the future; they are a more
efficient way to manage much of the memory that any given process uses. So
it is important to optimize that case as much as possible. Hildenbrand's
patch set makes some steps in that direction while addressing a thorny
problem that has resisted solution for years. These changes are currently
in the linux-next repository, so there is a reasonable possibility that
they could land in the mainline during the 6.15 merge window. If so, the
2025 Linux Storage,
Filesystem, Memory-Management, and BPF Summit, which will be concurrent
with that merge window, may be the last to feature a "mapcount madness"
session.
Index entries for this article | |
---|---|
Kernel | Memory management/Folios |
Posted Mar 18, 2025 8:43 UTC (Tue)
by taladar (subscriber, #68407)
[Link] (7 responses)
Posted Mar 18, 2025 9:36 UTC (Tue)
by dottedmag (subscriber, #18590)
[Link] (5 responses)
It's a well-known fact in theoretical CS community that type systems need to strike a balance between complexity and expressivity, or they either will be too hard to apply or will reject too many sound programs.
Lots of code is straightforward, and type systems are helpful, as they cover the common scenarios.
The desired behaviour in this tiny piece of code though is very complex, as it is shaped by many competing constraints, so expecting a type system that will be designed just right to allow checking these complicated invariants is futile. So testing has to take other forms: manually written tests, code reviews, load testing, fuzzing, custom static checks etc.
If the correspondence "type system = static check" is not obvious, consider that "variable X is private in class Y" is the same as the static check "variable X inside a struct created by function Y::New can be accessed only by a function named Y::*".
Posted Mar 18, 2025 9:39 UTC (Tue)
by dottedmag (subscriber, #18590)
[Link]
I have never written any kind of static check for C in 30 years, it being too hard to handle, overlaid with preprocessor. Compared to that I have written dozens of application-specific static checks for Go in 5 years. The type systems of both languages are similarly expressive.
Posted Mar 18, 2025 9:49 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (3 responses)
I'm also getting the impression that if you need a complex type system, you have usually broken Einstein's dictum "make it as simple as possible, but no simpler". Witness the comments about how Rust couldn't type check a common lock pattern, only for someone to ask "why are you doing it that way? It doesn't actually make sense".
There's the old saw that computerising a broken system merely gives you even more broken system, just faster. Imposing a type system (and modifying the type system to suit) on an excessively complex broken system just gives you an even more complex broken system. You need to refactor the complexity away (and yes, I know that's a damn hard job), and then you'll end up with a much simpler system, and a simple type system.
And that is clearly the case here. They are trying to refactor away a lot of obviously unnecessary complexity. The big problem is you need to get your head round the complex now before you can safely change it to a simpler future.
Cheers,
Posted Mar 18, 2025 10:08 UTC (Tue)
by dottedmag (subscriber, #18590)
[Link] (1 responses)
Hotever it is not proven that for any given problem any arbitrary set of constraints delivers a solution that is globally optimal, compared to solutions that are bound by other complexity limit, or being complexity-unbounded.
Note that I'm not describing specific kinds of requirements: both system behaviour and properties important to the people producing the system may and do shape the solution.
Posted Mar 18, 2025 11:20 UTC (Tue)
by david.hildenbrand (subscriber, #108299)
[Link]
For example, why not use a single MM slot instead of two? Well, because there are situations (esp. pageout, page migration), where our owner could suddenly change even when only two processes are involved, such as with short-lived fork. So I decided for a *slightly* more complicated approach (two slots, which didn't add that much complexity or runtime overhead after all), which makes the whole thing at least predictable in the common case.
I suggest to take a look at the introduction of v2 to see "how good" this work was for me ;)
Posted Mar 18, 2025 12:02 UTC (Tue)
by pizza (subscriber, #46)
[Link]
Then there's the problem that "obviously unnecessary" .... is rarely obvious, and usually a sign that the problem is not properly understood.
Posted Mar 18, 2025 10:05 UTC (Tue)
by david.hildenbrand (subscriber, #108299)
[Link]
The whole "making space in folios" nasty magic is just a temporary thing (primarily to make 32bit support fly!) until Willy manages to allocate "struct folio" separately, then this will just clean up very nicely.
Note that this work is part of a series of optimizations and cleanups. The munmap() speedup is much more impressive/important than the fork() speedup mentioned in this article. These "a little bit faster" things accumulate. For example, munmap() in some cases was speed up by up to a factor of 4 with larger folios in some of my measurements ever since we started this whole mapcount rework.
The hardest part is dealing with the legacy stuff and the legacy interfaces that are affected by the per-page mapcounts.
The mapcount madness is not over yet: some future work (see the cover letter) will make it finally possible to cleanly map folios that are larger than a single PMD (e.g., 4MiB, 8 MiB ...). Once that is in and the per-page mapcounts are gone for good, I think we'll be in pretty good shape.
Posted Mar 18, 2025 10:45 UTC (Tue)
by ferringb (subscriber, #20752)
[Link] (1 responses)
I'm mostly idly wondering about potential impacts to OOMK selection in containerized workloads.
Posted Mar 18, 2025 11:10 UTC (Tue)
by david.hildenbrand (subscriber, #108299)
[Link]
Most importantly, the RSS of a process won't be affected. Further, for most workloads (having many partially-mapped folios) nothing will change really. In the common case (mostly fully mapped large folios), nothing will change at all.
The only things that can be affected in /proc/$PID/smaps *with large folios / THPs* will be:
1) PSS: Proportional Set Size (Pss / Pss_dirty)
It's confusing; in contrast to the Rss, it's supposed to represent the "Real memory footprint" of a process in a better way, but there are already scenarios where it will be incorrect.
As we no longer have the per-page mapcount information in large folios, we calculate it based on the per-folio average. So in extreme cases it could be higher or lower (again, only with partially mapped large folios).
2) USS: Unique Set Size
"If I kill this process now, how much memory will get freed up". The fuzzy accounting means that you might end up freeing more memory up (something indicates as shared/non-unique although it is exclusive/unique).
For example, more memory might be indicated as Shared_Clean instead of Private_Clean; same with Shared_Dirty vs. Private_Dirty.
Again, for partially-mapped folios the accounting of e.g., "AnonPages", "Mapped" in /proc/meminfo (and similar cgroup/per-node interfaces) can be higher. I am not 100% sure if that will really matter: because these folios *do consume that memory* even when partially mapped. The very last patch in the series documents that.
Coming back to you question: these are exactly the things we have to learn and why I am being careful. Changes to PSS/USS etc. should not matter, because it's not going to be the common case. Regarding changes to /proc/meminfo, I really don't know yet.
Trying to be too clever for their own good?
Trying to be too clever for their own good?
Trying to be too clever for their own good?
Trying to be too clever for their own good?
Wol
Trying to be too clever for their own good?
Trying to be too clever for their own good?
Trying to be too clever for their own good?
Trying to be too clever for their own good?
Consequences of fuzzier accounting
Consequences of fuzzier accounting