Willy's memory-management to-do list

By Jonathan Corbet
April 30, 2018

Matthew "Willy" Wilcox has been doing a fair amount of work in the memory-management area recently. He showed up at the 2018 Linux Storage, Filesystem, and Memory-Management Summit with a list of discussion topics related to that work; it was enough to fill a plenary session with some spillover into the memory-management track the next day. Some of his topics were fairly straightforward; others look to be somewhat more involved.

He started the plenary session by noting that the "vm_fault_t controversy" turned out to be rather more involved than he had expected; he seemed to be referring to a disconnected series of patches (example) creating a new vm_fault_t typedef for page-fault handlers. He has been busy trying to run the resulting changes through the relevant maintainers, but it has been some work; he didn't realize, he said, that the filesystem developers would be so "belligerent" about wanting to see the full series — which doesn't exist yet. In any case, he said, this is a boring topic; the room seemed to agree, so he moved on.

He then put up an example of code performing a memory allocation, and pointed out that it contained several bugs, including a missing overflow check and a lack of type checking. Bugs like this are fairly common in the kernel. He proposes to handle that use case with a new helper called kvmalloc_struct(), and is looking for feedback. The room didn't seem to find this topic to be worth arguing about either; Ted Ts'o finally suggested that Wilcox should "paint it blue".

He then called for the addition of malloc() and free() to the kernel API. A call to malloc() would turn into a kvmalloc() call with the GFP_KERNEL flags. His purpose is to make it easier for new developers to write drivers by providing something that is more similar to the normal C API. There did not seem to be a lot of support for this idea from the group, though.

If an application uses mmap() to map the same page four-billion times, the map count in the kernel will overflow, with all of the undesirable effects that come from a reference-count overflow. Getting to this point is not easy; one needs a machine with 30GB of RAM to be able to do it. He has posted a fix for the problem; it simply kills any process that has tried to map the same page more than 5,000 times. Andrew Morton suggested that the alternative is to just leak the page.

There are two ways to get huge pages in user space (hugetlbfs and transparent huge pages), and they use the page cache differently; Wilcox would like to unify them. Hugetlbfs indexes pages in multiples of 2MB, while transparent huge pages use a normal 4KB offset. He would like to make hugetlbfs less special by using 4KB offsets there too. The only problem is a big performance hit, since there are many more entries in the radix tree; that makes this approach unworkable. So a solution he intends to pursue instead is to change the transparent huge pages implementation to use the multi-size features of his XArray mechanism, making it more closely match hugetlbfs.

Then, he would like to enhance the page cache to allow the use of block sizes that are bigger than the system page size. He thinks it can be done without requiring higher-order allocations, which has been a sticking point in the past. In short, the memory-management subsystem would inform the filesystem when a page fault has occurred and ask the filesystem to take care of populating the page cache with the needed pages. The filesystem can do that with normal 4KB pages; better performance will be had if it attempts a larger allocation first.

Dave Chinner pointed out that there were working patches for larger block sizes in 2007; they used compound pages, and were not accepted due to the high-order allocation issues. We have been here before, he said, and know how it works. Have high-order allocations been fixed in the meantime? Wilcox answered that the difference this time around is the fallback path that is implemented within the filesystems. Chinner worried that this idea didn't sound reliable; in particular, there could be problems (as usual) with truncate(). Wilcox answered that much of the work could be done once in the virtual filesystem layer and, hopefully, made to work reliably.

He also briefly mentioned the idea of teaching the page cache about holes (ranges with no blocks allocated) in files. Currently those are represented by zero-filled pages in the cache if need be. Replacing those with "zero entries" could save a significant amount of memory; an actual page would only need to be allocated in the event of a write operation.

There was also a brief discussion of "PFN entries" in the page cache. Currently, page-cache entries include a pointer to the page structure representing the page in memory. That structure will point to a specific mapping (such as the file containing the page). If you want to share pages in memory that are shared on disk (in a couple reflinked files, for example), the same page will have different mappings depending on where it came from. In that case, putting that pointer in the cache is going to lead to trouble, so he proposes using the page-frame number instead. There would still be page structures backing the whole thing up, but there would be an extra level of indirection to access them.

Finally, he said that he would like to get rid of the GFP_NOFS allocation flag, which tells the system that it cannot call into filesystem code to free memory. Instead, "scoped allocation", which simply tracks when filesystem code is holding a lock (and thus cannot be called back into) should be used. XFS is the closest to having implemented scoped allocation, he said, but there are still places where GFP_NOFS is used. This work is not currently making good progress.

Ts'o said that some good documentation would help; he has been trying to push this work forward, but has run into some questions. Chinner warned that this work has to be done carefully; GFP_NOFS is often used to silence the lockdep checker rather than out of a real need to avoid filesystem calls. He suggested adding a GFP_NOLOCKDEP flag for that purpose. Meanwhile, these call sites are hard to identify, since they are almost never documented as such.

The plenary session came to an end at this point, but Wilcox had not yet run out of ideas to run by the development community.

Cleaning up struct page

The page structure is one of the most complicated in the kernel; the curious are encouraged to have a look at its definition. Each page structure tracks one page of physical memory; it is used differently depending on how the page itself is used. As a result of varying needs and of the need to keep the structure small (current systems have millions of them), struct page has become a difficult-to-follow mixture of structures, unions, and #ifdefs. Few developers dare to try to make changes there.

Wilcox posted a set of diagrams showing how the various fields of struct page are used now. When a kernel subsystem allocates a page, he said, it also gets access to the page structure to keep track of it. But that access is not exclusive. The refcount field can go up and down at any time, even if the allocating subsystem thinks it has exclusive access to the page. If the page is mapped into user space, the mapping field will be use for reverse-mapping. Various flags have special meanings, and so on.

In the end, a lot of users simply don't bother trying to store information in struct page even though there is space available there; it simply looks too complicated, and it is not at all clear which fields are safe to modify. That is, he said, "a shame".

To make kernel development less shameful, he is proposing a reorganization of the page structure to make it more comprehensible. The fields that are safe to touch have been moved together, resulting in five contiguous words of available memory. The complex arrangement of structs and unions has been replaced with a single set of unions, each containing a set of structs or simple types.

There was some discussion about the details of specific fields, and it was established that drivers could safely use the mem_cgroup field. In general, though, everybody seemed to feel that the proposal was a major improvement that made struct page much easier to understand. Wilcox promised that a patch set making these changes would be forthcoming soon.

Index entries for this article
Kernel	Memory management/struct page
Conference	Storage, Filesystem, and Memory-Management Summit/2018

Struct page cleanup

Posted May 1, 2018 0:03 UTC (Tue) by willy (subscriber, #9762) [Link] (6 responses)

I made a few cleanups to struct page last cycle, so it used to be worse ;-)
https://elixir.bootlin.com/linux/v4.15.18/source/include/...

I didn't know where I was going at the time; it wasn't until I cleaned up some bits that I could see how to clean up other bits.

Anyway, the patches to rearrange struct page into the later version are now posted:
https://marc.info/?l=linux-mm&m=152511980529195&w=2

Struct page cleanup

Posted May 1, 2018 7:58 UTC (Tue) by Sesse (subscriber, #53779) [Link] (5 responses)

Am I stupid for not understanding your diagram? :-) I don't understand what the horizontal and vertical axes are. It sort of looks like horizontal counts bits (because the first few things are the page flags, right?), except it possibly can't. Are the things along the horizontal axes perhaps different uses (ie., a union)? Why are they numbered, then?

Struct page cleanup

Posted May 1, 2018 11:14 UTC (Tue) by willy (subscriber, #9762) [Link] (4 responses)

It helps to be in the room when I'm explaining the diagram :-)

For the first three tabs, each row represents four bytes. The first row (or two rows for the 64-bit struct page) are 'flags', and each cell represents one bit with a two-letter abbreviation of the name of the flag.
Then there's a row of headers which enumerate some of the users of struct page (page cache, anon, slab, slub, page tables, etc)
At this point the meaning of each cell changes; one cell represents one byte.
The dark lines represent boundaries within the struct page, eg we currently have:

struct page {
union {
struct address_space *mapping;
/* deferred_list.head */
};
union {
unsigned long index;
/* deferred_list.tail */
};
...
};

so there's a dark line between mapping and index to show that we can't just put a list_head into the union.

Fields which have the same meaning between all page types are shown without vertical lines (eg refcount) while fields which happen to be used for the same meaning between different page types are shown with vertical lines.

Hope that's helpful. Yes, it's a confusing diagram. I spent hours trying to come up with a good graphical representation of what's going on in struct page, and this was the best I could do.

Struct page cleanup

Posted May 1, 2018 11:16 UTC (Tue) by Sesse (subscriber, #53779) [Link] (3 responses)

Maybe you should add an extra blank line after the flags, to point out that the numbering no longer matters there?

Also, I wonder, why are people all “WE ARE OUT OF PAGE FLAGS” all the time, when there are clearly tons of them left for 64-bit and a bunch even for 32-bit?

Struct page cleanup

Posted May 1, 2018 11:49 UTC (Tue) by willy (subscriber, #9762) [Link] (2 responses)

I've added the blank line as requested. I think that improves things for people following along at home, but I'm not sure it would have fit on the projector's screen :-)

My slot wasn't focused on the flags, so I didn't fill in all the rest of the details; I juts handwaved at it because everybody else in the room knew what other things were in those bits. I've filled in the fields that are commonly stored in those bits; the problem is that they're all dependent on various CONFIG options and I haven't taken the trouble to figure out likely values in order to size the fields to realistic number of bits. If you can make sense of page-flags-layout.h and the rest of the generating machinery, please let me know. The first problem you'll face is that whoever did the ASCII art in those files was a fan of big-endian bit numbering (probably from IBM).

Struct page cleanup

Posted May 1, 2018 11:51 UTC (Tue) by Sesse (subscriber, #53779) [Link] (1 responses)

Thanks! That made it clearer for me, at least. :-)

Struct page cleanup

Posted May 1, 2018 12:23 UTC (Tue) by willy (subscriber, #9762) [Link]

Excellent! Thanks for the constructive criticism!

Willy's memory-management to-do list

Posted May 2, 2018 16:07 UTC (Wed) by Gchbg (guest, #91567) [Link] (2 responses)

> Ted Ts'o finally suggested that Wilcox should "paint it blue".

What does it mean to paint it blue?

Willy's memory-management to-do list

Posted May 2, 2018 16:20 UTC (Wed) by admalledd (subscriber, #95347) [Link]

What color should the bike shed be? https://en.wiktionary.org/wiki/bikeshedding

Normally, phrases that reference bikeshedding are used to point out that a discussion has gotten off track, and is likely time to move on. Other common use is to point out that even if important, it may not be the best use of the current time.

(Example here IMO, is that physical face-to-face meetings are few and far between for many of the Kernel, and if anything can be done later on the mailing list, do so and use the remaining time for other things that are best done in-person.)

Willy's memory-management to-do list

Posted May 3, 2018 14:29 UTC (Thu) by smitty_one_each (subscriber, #28989) [Link]

Ts'o is clearly not much of a Rolling Stones scholar.

Willy's memory-management to-do list

Posted May 4, 2018 13:33 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)

> it simply kills any process that has tried to map the same page more than 5,000 times

Does that potentially interact with KSM, or VMs that merge identical pages in other ways? It seems plausible for a process to have 5000 identical pages (e.g. 20MB of zeroes), and something might want to merge them into a single physical page to save RAM.

Willy's memory-management to-do list

Posted May 6, 2018 12:54 UTC (Sun) by willy (subscriber, #9762) [Link]

The writeup for this part wasn't exactly what I intended to say ;-)

What the patch does is keep a list of processes with more than 5000 mappings. That's no process on my laptop, but some workloads will have a process or two end up on this list. I understand ElectricFence creates a lot of mappings and so does UML.

Once any individual page goes over 2 billion mappings, we check the list. Anyone on the list with this page mapped more than 1000 times is deemed to be part of the attack and is killed.

https://lwn.net/Articles/748524/

Thanks for mentioning the KSM possibility. I hadn't thought of that. I'll take a look at the KSM code to see if it tries to avoid mapcount overflow. I think the zero page is treated specially, but it's not unreasonable to check for other special patterns.