Fleshing out memory descriptors
Wilcox started by saying that he has been thinking about what will happen once the folio conversion is done. The ultimate goal, he said, looks like this:
struct page { u64 memdesc; };
The lowest four bits would be a type field saying what kind of descriptor it is; the rest would (usually) be a pointer to a type-specific structure. David Hildenbrand immediately said that what is really needed is a type hierarchy; some types have subtypes, and the kernel will surely exceed the 16 types that can be represented in those four bits at some point; 11 types have already been defined. Wilcox disagreed, noting that no new types had been added for some time and questioning whether the kernel would ever run out. I remarked that I was documenting that claim for posterity, to general laughter.
Descriptor type zero, he said, would be a special type indicating "miscellaneous memory with no further data". It would, as it turns out, have a number of subtypes. Pages falling under this type could include those in the vmalloc range, guard pages, offline pages, and others. Bit 11 of the descriptor would be set if the page can be mapped to user space, bits 12-17 would contain the page order, and the higher bits could contain information about which node and zone contain the page.
There was a brief discussion of how memory descriptors would be allocated; Wilcox envisioned an interface like:
struct page *page = alloc_page_desc(MEMDESC_TYPE_FOLIO);
Jason Gunthorpe remarked that he would like to see more details on what the state transitions for memory descriptors will be.
Wilcox moved on to discussing pages owned by the buddy allocator; they would have a descriptor that looks like:
struct buddy { unsigned long prev; unsigned long next; };
That design reduces the size of the descriptor to two 64-bit integers, which is "a step in the right direction". That information would be enough to support basic allocator operations like insertion, removal, and merging. The amount of space needed for the descriptor could be reduced by storing page-frame numbers rather than addresses. Given a willingness to limit installed memory to 2TB, the descriptor could be condensed down to eight bytes. The only problem with that idea is that systems with more than 2TB of installed memory are on the market now.
This descriptor could be reduced further by making it contain page-frame numbers relative to the base of the zone containing the pages. At that point, each memory zone could contain 2TB of memory; with enough zones, much larger total memory sizes could be handled. Wilcox thought that this solution might come at the cost of having to add more memory zones (generally seen as undesirable), but Vlastimil Babka pointed out that large-memory machines use a NUMA architecture, so the memory is already divided into multiple zones.
A 30-minute slot is clearly not enough to design the descriptor-based future, so it is not surprising that this discussion did not get much further. Wilcox brought it to a conclusion by saying that his goal for this year is to get rid of the mapping and index fields of struct page; that will require some work to fix the existing users in the kernel. Then the work of splitting the various users of page structures into specific descriptor types can proceed. Once approximately half of users have been converted, he will submit a patch to shrink the page structure; it "should all just work", he said. That will lead to the next important phase of this transition: seeing where the performance regressions are; he admitted that he does not know how that will work out.
(Wilcox has also put together a
few wiki pages on the memory-descriptor design).
Index entries for this article | |
---|---|
Kernel | Memory management/Memory descriptors |
Conference | Storage, Filesystem, Memory-Management and BPF Summit/2024 |
Posted May 27, 2024 14:05 UTC (Mon)
by koverstreet (✭ supporter ✭, #4296)
[Link] (6 responses)
The tricky part is that freeing a page then inserts onto a radix tree, which might allocate - recursion.
You can satisfy the allocation from the page you're freeing, but that might have to be split, so it gets tricky.
Posted May 27, 2024 14:23 UTC (Mon)
by willy (subscriber, #9762)
[Link] (5 responses)
I already got it down to a u64, not sure why you're trying to complicate it further?
Posted May 27, 2024 15:07 UTC (Mon)
by koverstreet (✭ supporter ✭, #4296)
[Link] (4 responses)
we talked about this before, I think the ultimate conclusion was that the linked lists in the buddy allocator really aren't that bad so probably not worth the hassle - just throwing it out there in case it gets revisited or someone ambitious comes along
Posted May 27, 2024 16:07 UTC (Mon)
by willy (subscriber, #9762)
[Link] (3 responses)
Posted May 27, 2024 16:09 UTC (Mon)
by willy (subscriber, #9762)
[Link] (2 responses)
There is not. I do not propose ever allocating a struct buddy.
That's clear in the wiki pages, I hope. And if it is, then I'll remedy that.
Posted May 27, 2024 18:29 UTC (Mon)
by pbonzini (subscriber, #60935)
[Link] (1 responses)
Would it make sense to tackle the "private" field _last_ (not first) by sticking it in the second word of struct page, that is:
Posted May 27, 2024 18:56 UTC (Mon)
by willy (subscriber, #9762)
[Link]
Posted May 27, 2024 17:30 UTC (Mon)
by pbonzini (subscriber, #60935)
[Link] (1 responses)
Posted May 27, 2024 17:57 UTC (Mon)
by willy (subscriber, #9762)
[Link]
The mechanism is to mark the slab cache for allocating a memdesc as TYPESAFE_BY_RCU. Then something like GUP-fast can be implemented by rcu_read_lock(); READ_ONCE(&page->memdesc); check the memdesc type; subtract off the type and cast the result to a folio; try_get() the refcount and then rcu_read_unlock(). Now you have a refcount so the folio can't be freed under you. You then need to check that the page table still refers to the same page and the page is part of the same memdesc. Now you're safe.
For the page cache, it's simpler because the page cache contains folio pointers, so it's the same check we do now.
I haven't spent much time thinking about the physical memory walkers (eg compaction and memory-failure). They seem like simpler cases than GUP-fast since there's no PTE to examine; we're going straight from a PFN to a memdesc. Although they're more complex because they have to handle memdescs which are not folios.
You're probably right that it needs an rcu_store or something to make sure the memdesc initialisation is ordered before the store to page->memdesc. I'll listen to the experts once we get to that point in the conversion!
Thanks for bringing it up.
Fleshing out memory descriptors
Fleshing out memory descriptors
Fleshing out memory descriptors
Fleshing out memory descriptors
Fleshing out memory descriptors
It is clearer in the wiki, yes:
Fleshing out memory descriptors
struct page {
union {
unsigned long memdesc;
struct buddy buddy;
};
};
struct page {
unsigned long memdesc_or_next;
unsigned long private_or_prev;
};
?
Fleshing out memory descriptors
Fleshing out memory descriptors
Fleshing out memory descriptors