|
|
Subscribe / Log in / New account

Fleshing out memory descriptors

By Jonathan Corbet
May 27, 2024

LSFMM+BPF
One of the long-term goals of the folio conversion in the kernel's memory-management subsystem is the replacement of the page structure, which describes a page of physical memory, with an eight-byte "memory descriptor". This change would reduce the overhead of tracking physical memory, increase type safety, and make memory management more flexible. Thus far, though, details on what the memory-descriptor future will look like have been relatively scarce. At the 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit, Matthew Wilcox led a discussion to try to fill in the picture somewhat.

Wilcox started by saying that he has been thinking about what will happen once the folio conversion is done. The ultimate goal, he said, looks like this:

    struct page {
        u64 memdesc;
    };

The lowest four bits would be a type field saying what kind of descriptor it is; the rest would (usually) be a pointer to a type-specific structure. David Hildenbrand immediately said that what is really needed is a type hierarchy; some types have subtypes, and the kernel will surely exceed the 16 types that can be represented in those four bits at some point; 11 types have already been defined. Wilcox disagreed, noting that no new types had been added for some time and questioning whether the kernel would ever run out. I remarked that I was documenting that claim for posterity, to general laughter.

Descriptor type zero, he said, would be a special type indicating "miscellaneous memory with no further data". It would, as it turns out, have a number of subtypes. Pages falling under this type could include those in the vmalloc range, guard pages, offline pages, and others. Bit 11 of the descriptor would be set if the page can be mapped to user space, bits 12-17 would contain the page order, and the higher bits could contain information about which node and zone contain the page.

There was a brief discussion of how memory descriptors would be allocated; Wilcox envisioned an interface like:

    struct page *page = alloc_page_desc(MEMDESC_TYPE_FOLIO);

Jason Gunthorpe remarked that he would like to see more details on what the state transitions for memory descriptors will be.

Wilcox moved on to discussing pages owned by the buddy allocator; they would have a descriptor that looks like:

    struct buddy {
        unsigned long prev;
	unsigned long next;
    };

That design reduces the size of the descriptor to two 64-bit integers, which is "a step in the right direction". That information would be enough to support basic allocator operations like insertion, removal, and merging. The amount of space needed for the descriptor could be reduced by storing page-frame numbers rather than addresses. Given a willingness to limit installed memory to 2TB, the descriptor could be condensed down to eight bytes. The only problem with that idea is that systems with more than 2TB of installed memory are on the market now.

This descriptor could be reduced further by making it contain page-frame numbers relative to the base of the zone containing the pages. At that point, each memory zone could contain 2TB of memory; with enough zones, much larger total memory sizes could be handled. Wilcox thought that this solution might come at the cost of having to add more memory zones (generally seen as undesirable), but Vlastimil Babka pointed out that large-memory machines use a NUMA architecture, so the memory is already divided into multiple zones.

A 30-minute slot is clearly not enough to design the descriptor-based future, so it is not surprising that this discussion did not get much further. Wilcox brought it to a conclusion by saying that his goal for this year is to get rid of the mapping and index fields of struct page; that will require some work to fix the existing users in the kernel. Then the work of splitting the various users of page structures into specific descriptor types can proceed. Once approximately half of users have been converted, he will submit a patch to shrink the page structure; it "should all just work", he said. That will lead to the next important phase of this transition: seeing where the performance regressions are; he admitted that he does not know how that will work out.

(Wilcox has also put together a few wiki pages on the memory-descriptor design).

Index entries for this article
KernelMemory management/Memory descriptors
ConferenceStorage, Filesystem, Memory-Management and BPF Summit/2024


to post comments

Fleshing out memory descriptors

Posted May 27, 2024 14:05 UTC (Mon) by koverstreet (✭ supporter ✭, #4296) [Link] (6 responses)

struct buddy could be reduced to a single ulong by switching the linked lists for indices into radix trees.

The tricky part is that freeing a page then inserts onto a radix tree, which might allocate - recursion.

You can satisfy the allocation from the page you're freeing, but that might have to be split, so it gets tricky.

Fleshing out memory descriptors

Posted May 27, 2024 14:23 UTC (Mon) by willy (subscriber, #9762) [Link] (5 responses)

... and the page might be HIGHMEM.

I already got it down to a u64, not sure why you're trying to complicate it further?

Fleshing out memory descriptors

Posted May 27, 2024 15:07 UTC (Mon) by koverstreet (✭ supporter ✭, #4296) [Link] (4 responses)

to get rid of the separately allocated struct buddy

we talked about this before, I think the ultimate conclusion was that the linked lists in the buddy allocator really aren't that bad so probably not worth the hassle - just throwing it out there in case it gets revisited or someone ambitious comes along

Fleshing out memory descriptors

Posted May 27, 2024 16:07 UTC (Mon) by willy (subscriber, #9762) [Link] (3 responses)

There is no separately allocated struct buddy. Did you not read the article? Or the wiki pages that are linked?

Fleshing out memory descriptors

Posted May 27, 2024 16:09 UTC (Mon) by willy (subscriber, #9762) [Link] (2 responses)

Hm, now I re-read the article, it's a little confusing and I see how someone could come away with the impression that there is.

There is not. I do not propose ever allocating a struct buddy.

That's clear in the wiki pages, I hope. And if it is, then I'll remedy that.

Fleshing out memory descriptors

Posted May 27, 2024 18:29 UTC (Mon) by pbonzini (subscriber, #60935) [Link] (1 responses)

It is clearer in the wiki, yes:
struct page {
    union {
        unsigned long memdesc;
        struct buddy buddy;
    };
};

Would it make sense to tackle the "private" field _last_ (not first) by sticking it in the second word of struct page, that is:

struct page {
    unsigned long memdesc_or_next;
    unsigned long private_or_prev;
};
?

Fleshing out memory descriptors

Posted May 27, 2024 18:56 UTC (Mon) by willy (subscriber, #9762) [Link]

Oh, interesting idea. That's a path to getting us to a 16 byte struct page sooner. I like it. There's various other work to be done to remove the use of page->flags (mostly things like looking up the zone/node/etc), but that needs to happen anyway.

Fleshing out memory descriptors

Posted May 27, 2024 17:30 UTC (Mon) by pbonzini (subscriber, #60935) [Link] (1 responses)

Matthew, looking https://kernelnewbies.org/MatthewWilcox/FolioAlloc what is going to be the policy for lockless or RCU-protected access to descriptors? I imagine that the descriptors are fine because no one should even try to read the descriptor for that pfn until after folio_alloc returns; but I think that there would be some kind of atomic_store_release or smp_wmb() at the end, and a matching rcu_dereference() or READ_ONCE() when reading the memdesc field.

Fleshing out memory descriptors

Posted May 27, 2024 17:57 UTC (Mon) by willy (subscriber, #9762) [Link]

Indeed, I did not mention it in the talk.

The mechanism is to mark the slab cache for allocating a memdesc as TYPESAFE_BY_RCU. Then something like GUP-fast can be implemented by rcu_read_lock(); READ_ONCE(&page->memdesc); check the memdesc type; subtract off the type and cast the result to a folio; try_get() the refcount and then rcu_read_unlock(). Now you have a refcount so the folio can't be freed under you. You then need to check that the page table still refers to the same page and the page is part of the same memdesc. Now you're safe.

For the page cache, it's simpler because the page cache contains folio pointers, so it's the same check we do now.

I haven't spent much time thinking about the physical memory walkers (eg compaction and memory-failure). They seem like simpler cases than GUP-fast since there's no PTE to examine; we're going straight from a PFN to a memdesc. Although they're more complex because they have to handle memdescs which are not folios.

You're probably right that it needs an rcu_store or something to make sure the memdesc initialisation is ordered before the store to page->memdesc. I'll listen to the experts once we get to that point in the conversion!

Thanks for bringing it up.


Copyright © 2024, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds