Pulling slabs out of struct page

By Jonathan Corbet
October 8, 2021

For the time being, the effort to add the folio concept to the memory-management subsystem appears to be stalled, but appearances can be deceiving. The numerous folio discussions have produced a number of points of consensus, though; one of those is that far too much of the kernel has to work with page structures to get its job done. As an example of how a subsystem might be weaned off of struct page usage, Matthew Wilcox has split out the slab allocators in a 62-part patch set. The result may be a foreshadowing of changes to come in the memory-management subsystem.

The kernel maintains one page structure for every physical page of memory that it manages. On a typical system with a 4KB page size, that means managing millions of those structures. A page structure tells the kernel about the state of the page it refers to: how it is being used, how many references to it exist, its position in various queues, and more. The required information varies depending on how any given page is being used at the moment; to accommodate this, struct page is a complicated mess of structures and unions. The current definition of struct page makes for good pre-Halloween reading, but those who truly want a good scare may want to see what it looked like before Wilcox cleaned things up for 4.18.

One of the users of struct page is the set of slab allocators, which obtain blocks of pages ("slabs") from the kernel and hand them out in smaller, fixed-size chunks. They are used heavily, and their performance will affect the performance of the system as a whole, so it is not surprising that they reach into the memory-management subsystem at the lowest levels. To support this usage, many of the fields in struct page are there specifically for the slab allocators. Just to complicate things, the kernel has three slab allocators: SLAB (the original allocator, often used by Android), SLUB (often used for desktop and data-center systems), and SLOB (a tiny allocator intended for embedded systems). Each has its own needs for fields in struct page.

Wilcox's patch set creates a new struct slab by removing the relevant fields from struct page. The new structure is, in its entirety:

    struct slab {
	unsigned long flags;
	union {
	    struct list_head slab_list;
	    struct {	/* Partial pages */
	        struct slab *next;
    #ifdef CONFIG_64BIT
	        int slabs;	/* Nr of slabs left */
	        int pobjects;	/* Approximate count */
    #else
	        short int slabs;
	        short int pobjects;
    #endif
	    };
	    struct rcu_head rcu_head;
	};
	struct kmem_cache *slab_cache; /* not slob */
	/* Double-word boundary */
	void *freelist;		/* first free object */
	union {
	    void *s_mem;	/* slab: first object */
	    unsigned long counters;	/* SLUB */
	    struct {			/* SLUB */
	        unsigned inuse:16;
	        unsigned objects:15;
	        unsigned frozen:1;
	    };
	};

	union {
	    unsigned int active;	/* SLAB */
	    int units;			/* SLOB */
	};
	atomic_t _refcount;
    #ifdef CONFIG_MEMCG
	unsigned long memcg_data;
    #endif
    };

As can be seen, this structure still relies heavily on unions to overlay the information that each allocator needs to store with the page. Those could be eliminated by splitting the structure into three allocator-specific variants, but that would add complication to a patch set that is already large (and set to grow).

It is worth noting that struct slab is really struct page in disguise; instances of struct slab overlay the corresponding page structure in the kernel's memory map. It is, essentially, the kernel's view of struct page for pages that are owned by a slab allocator, extricated from its coexistence with all of the other views of that structure. This means that struct slab must be laid out with care; some fields (_refcount, for example) are shared with struct page, and the results of a disagreement over its location would be unfortunate. To ensure that no such calamity occurs, Wilcox has included a set of compile-time tests verifying the offsets of the shared fields.

After that, the remaining patches in the series convert various code in the slab allocators (and beyond) to use the new type. The SLUB conversion is done meticulously, in over 40 small steps. Wilcox described the conversion of the other allocators as "slapdash", done in a single patch each. Presumably a later version of the patch set will turn these proof-of-concept patches into a proper series of their own, but it's not entirely clear who will do that; Wilcox wrote in the cover letter:

I don't know the slab allocators terribly well, so I would be very grateful if one of the slab maintainers took over this effort. This is kind of a distraction from what I'm really trying to accomplish with folios, although it has found one minor bug.

As of this writing, no slab maintainers (who tend to be thin on the ground in the best of times) have responded to this request.

This might seem like a lot of work to put an old structure into a new form, but there are a number of reasons to want something like this. Just pulling the slab-specific fields out of struct page simplifies that structure significantly. Using a separate type makes it clear which variant of the page structure the code expects to deal with, and it adds a degree of type safety; it is no longer possible to accidentally access the wrong union fields.

But the biggest benefit comes simply from beginning to separate the slab allocators from struct page. Eventually it may become possible to disassociate struct slab from struct page entirely and allocate it dynamically. That would be one small step toward encapsulating struct page within the core memory-management code and hiding it from the rest of the kernel, a change that would ease the much-needed task of improving page-level memory management.

First, though, some variant of this work must make it into the mainline kernel. It should be an easier process than getting folios merged, but getting big changes into the memory-management code is never easy. The relative silence that has greeted this work so far might be a bit worrisome, especially since Wilcox has requested help, but it is the early days for this series still. Regardless of how struct slab fares, though, it provides an indication of the direction that the memory-management developers are trying to go.

Index entries for this article
Kernel	Memory management/Memory descriptors

Pulling slabs out of struct page

Posted Oct 8, 2021 16:34 UTC (Fri) by abatters (✭ supporter ✭, #6932) [Link] (11 responses)

> instances of struct slab overlay the corresponding page structure in the kernel's memory map

In that case it is good that the kernel is compiled with -fno-strict-aliasing.

https://en.cppreference.com/w/c/language/object#Strict_al...

Pulling slabs out of struct page

Posted Oct 8, 2021 21:12 UTC (Fri) by willy (subscriber, #9762) [Link] (10 responses)

TBAA is unsuited to kernel development.

https://www.yodaiken.com/2021/10/06/plos-2021-paper-how-i...

Pulling slabs out of struct page

Posted Oct 10, 2021 7:03 UTC (Sun) by dvdeug (guest, #10998) [Link] (9 responses)

It's certainly <i>an</i> article. Starting with lines like "The C programming language [33] is the first, and, so far,only widely successful programming language that provides operating system developers with a high-level language alternative to assembler (compare to [42])"? BLISS, PL/I and Algol 68 all were used as implementation languages for OSes released before Unix, and afterwards, any number of languages have been used, with "only widely successful" coming off as a weaselly way of saying "most widely successful".

It concludes with "A small performance improvement will generally not justify a decrease in code stability for operating systems...", but as I see it, that's not supported by reality; stuff like https://lwn.net/Articles/871726/ makes it clear Linux kernel programmers will write to conventions no C compiler ever has promised they won't break (multiprocessing largely postdates ISO C) for a speedup. My guess is that if kernel developers got a -kernel switch that gave them everything they claimed to want at the cost of an average of 15% loss in performance, they'd be up in arms.

In any case, all the article really says about "TBAA is unsuited to kernel development." is "it may be possible in ISO C to push all these different types into a union, but that would harm modularity,..." Given that C is about the only language in modern use not to have a module or package system (showing a lack of commitment to modularity) and we're talking about a direct hit to code stability, I'm not overwhelmed by that argument.

Pulling slabs out of struct page

Posted Oct 10, 2021 18:20 UTC (Sun) by Wol (subscriber, #4433) [Link] (6 responses)

> with "only widely successful" coming off as a weaselly way of saying "most widely successful".

How many successful operating systems have been written in C? Two? Unix of course, I don't know what language Windows was written in, but I suspect most of it is C nowadays if it wasn't to start with?

To my knowledge Multics predates C - Fortran? Pr1mos was a multics-derivative - that was Fortran, then they ADDED PL/? (is that PL/1, PL/P, SPL, ... a whole bundle of variants/dialects/whatever). Then they TRIED to port the whole shebang to C and the result was a disaster.

How many other OS ports to C have been a disaster? I strongly suspect C owes its success to Unix, NOT the other way round.

Cheers,
Wol

Pulling slabs out of struct page

Posted Oct 10, 2021 19:58 UTC (Sun) by mpr22 (subscriber, #60784) [Link]

C and Unix's success is mutualistic.

The 1973 rewrite of Unix from PDP-11 assembly to C would eventually allow it to proliferate off of DEC hardware (the first such port being to the Interdata 8/32 in 1978); Unix succeeding, in turn, allowed C to proliferate off of Unix.

Pulling slabs out of struct page

Posted Oct 11, 2021 6:47 UTC (Mon) by dvdeug (guest, #10998) [Link] (2 responses)

It depends on what we call successful and operating system, doesn't it. Windows and most forms of Un*x have kernels written primarily in C. (Windows 3.0 apparently rewrote a bunch of stuff from C into assembly.) MacOS X and Windows NT kernels have some C++, especially at the edges. It's hard to tell for many, but C and C++ seem to dominate, with other programming languages now for one-offs or research.

MacOS seems to have drifted through assembly, Pascal, and then C. Multics was PL/I. PRIMOS was originally Fortran IV, then a PL/1 dialect and Modula-2.

I tend to agree with mpr22 that Unix and C's success were mutual. If nothing else, the Lions book offered a good example of what could be done, and how to do it in C, and Ada and ALGOL-68 were complex, Modula-2 and LISP too academic and not specifically designed for it, PASCAL way too academic and not designed for it, except in a horde of dialects, and BLISS, JOVIAL and PL/S too proprietary and ill-documented. Unix could have won and eventually been rewritten in Modula-2 or something else, had C not been at least good enough.

Pulling slabs out of struct page

Posted Oct 11, 2021 12:36 UTC (Mon) by pizza (subscriber, #46) [Link]

>It depends on what we call successful and operating system, doesn't it.

Don't forget various RTOSes and other embedded stuff -- C (and to a lesser extent, C++) overwhelmingly dominate.

Rust has some promise to supplant things there, but the amount of unsafe boilerplate needed to drive a modern MCU is staggering. Some of that can be automated away but it results in a much steeper curve to being productive.

Pulling slabs out of struct page

Posted Oct 13, 2021 19:36 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

> Windows and most forms of Un*x have kernels written primarily in C. (Windows 3.0 apparently rewrote a bunch of stuff from C into assembly.)

This is true, but Microsoft historically has not made a very strong effort to distinguish between "the kernel" and "the rest of Windows." To some extent, this is an arbitrary line-drawing exercise. For example, you could take the position that every process which runs as SYSTEM is the Windows equivalent of a kernel thread, and therefore large chunks of the Windows kernel are actually written in managed languages like C#, but I imagine some people would violently disagree with that characterization.

Pulling slabs out of struct page

Posted Jan 20, 2022 22:17 UTC (Thu) by yodaiken (guest, #156253) [Link]

Multics scheduler http://web.mit.edu/multics-history/source/ldd_listings/mc...

Pulling slabs out of struct page

Posted Jan 20, 2022 23:05 UTC (Thu) by anselm (subscriber, #2796) [Link]

To my knowledge Multics predates C - Fortran?

According to multicians.org, PL/1, or specifically a subset thereof called EPL.

Pulling slabs out of struct page

Posted Jan 15, 2022 22:31 UTC (Sat) by yodaiken (guest, #156253) [Link] (1 responses)

"Starting with lines like "The C programming language [33] is the first, and, so far,only widely successful programming language that provides operating system developers with a high-level language alternative to assembler (compare to [42])"? BLISS, PL/I and Algol 68 all were used as implementation languages for OSes released before Unix, ..."

See that "(compare to [42])" ? That is a citation. If you look on the bottom of the article you can find

42]William A Wulf. 1972. Systems for systems implementors: some experiences from Bliss. In Proceedings of the December 5-7, 1972, fall joint computer conference, part II. 943–948.

As for Pl/1 and Algol68, I'd love to see references to actual operating systems implemented in either. The Bliss article is interesting and perhaps you should read it.

Thanks for calling my paper "weaselly" though. Impressive criticism.

Pulling slabs out of struct page

Posted Jan 16, 2022 1:29 UTC (Sun) by mgb (guest, #3226) [Link]

I never used it and had no part in it's implementation but I did stand next to the CAP computer once. It had an ALGOL 68C OS.

https://en.wikipedia.org/wiki/CAP_computer

Pulling slabs out of struct page

Posted Oct 8, 2021 19:45 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

What is the reason for placing _refcount at the end of the structure rather than at the beginning?

_refcount

Posted Oct 8, 2021 20:02 UTC (Fri) by corbet (editor, #1) [Link] (1 responses)

For struct slab, it has to be in the same place as with struct page. As for why its location in struct page, that's probably an outcome of years of history. But is there some special reason why _refcount should be at the beginning?

_refcount

Posted Oct 8, 2021 23:53 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

I mean, typically you would want to have all the shared fields to come first in the structure. That's how inheritance is usually done in plain C.

Pulling slabs out of struct page

Posted Oct 8, 2021 21:10 UTC (Fri) by willy (subscriber, #9762) [Link] (2 responses)

Ooh, ooh, I know this one! Hope you like horror films ...

Slub needs to cmpxchg_double() freelist & counters. That means that freelist has to be dword aligned on both 32 and 64 bit. _refcount is word sized on 32-bit and half-word sized on 64-bit. For compactness, we pair it with _mapcount. Now the pair are either one or two words, depending on 32/64 bit. So _refcount has to be after freelist+counters in order for them to be dword aligned on both 32 and 64 bit. It's advantageous to make the main union as large as possible, so _refcount has to go towards the end of the struct.

There are other considerations in the layout of struct page, but these are the ones that pertain to the location of _refcount.

Pulling slabs out of struct page

Posted Oct 8, 2021 23:55 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Ah. I suspected as much. Thanks for the night-time alignment horror story!

Pulling slabs out of struct page

Posted Oct 10, 2021 0:54 UTC (Sun) by vivo (subscriber, #48315) [Link]

this one deserve a place in the weekly "Brief items" page

Pulling slabs out of struct page

Posted Oct 8, 2021 21:36 UTC (Fri) by willy (subscriber, #9762) [Link]

> Those could be eliminated by splitting the structure into three allocator-specific variants, but that would add complication to a patch set that is already large (and set to grow).

I'm glad someone sees where I'm going! Yes, I did think about doing that, but decided not to do it as part of this patch. partial_pages will still need a union, but there will be far less unions once struct slab is defined based on CONFIG options.

Pulling slabs out of struct page

Posted Oct 8, 2021 21:55 UTC (Fri) by willy (subscriber, #9762) [Link] (11 responses)

I have some further thoughts indicating where I'm going at https://kernelnewbies.org/MemoryTypes

The dynamically allocated struct folio/slab/pgtable is where Kent Overstreet and Johannes Weiner want to go. It's more work, with a bigger payoff. We can collaborate on the steps along the way, since so much of the way is shared.

Pulling slabs out of struct page

Posted Oct 9, 2021 15:40 UTC (Sat) by luto (guest, #39314) [Link] (10 responses)

At the risk of asking a horrible question: do we really need the ability to start with a _page_ (PFN mapped to userspace, for example) and find type information?

I don’t think we really need this. We already support, in a very limited way, non-struct-page user mappings. For lightweight operations on user memory, we can use the uaccess functions, and they inherently lock correctly against unmapping. For heavyweight operations, we can look up the VMA. This leaves things that don’t want to pay the full price of finding a VMA. Whether those really exist isn’t quite clear to me on a conceptual level, but there is certainly a lot of code that calls get_user_pages [0] and expects the result to be live until release. (And some FS code may want to do useful IO.)

I wonder if performance could be acceptable if GUP walked the VMA tree to find a refcountable object. Some interesting locking would be needed to compete with get_user_pages_fast.

[0] This is all kinds of messy. KVM does unspeakable and blatantly incorrect to host user memory. Even the normal pattern of GUPping a page interacts in unfortunate ways with COW.

Pulling slabs out of struct page

Posted Oct 9, 2021 15:46 UTC (Sat) by willy (subscriber, #9762) [Link] (7 responses)

> At the risk of asking a horrible question: do we really need the ability to start with a _page_ (PFN mapped to userspace, for example) and find type information?

Yes. Some of the places we need this:

- GUP gets back a page and then calls set_page_dirty(). That needs to figure out whether this is a file/anon/ksm/netpool/DEVICE/... page and call the filesystem if required.

- compaction walks the memmap and needs to figure out what this memory is and whether it can be relocated.

- memory failure gets a physical address and needs to understand how to handle it

There are more, but these should illustrate some of the problems we have to solve.

Pulling slabs out of struct page

Posted Oct 9, 2021 16:08 UTC (Sat) by luto (guest, #39314) [Link] (6 responses)

> - GUP gets back a page and then calls set_page_dirty(). That needs to figure out whether this is a file/anon/ksm/netpool/DEVICE/... page and call the filesystem if required.

Is this done directly in GUP? If so, surely it could work like the fault code and look up the VMA.

> - compaction walks the memmap and needs to figure out what this memory is and whether it can be relocated.

Hmm, this one is legit.

> - memory failure gets a physical address and needs to understand how to handle it

In my dream world, the low-level memory failure / machine check code gets a virtual address and can look up a VMA or vmap area. Making this work with kmap might be interesting.

> There are more, but these should illustrate some of the problems we have to solve.

I wonder if it's possible to reduce the dependency on struct page or equivalent to the point that everything works without it except for some nice-to-have features like compaction. (I'm not saying that the colossal amount of effort involved is worthwhile.)

Pulling slabs out of struct page

Posted Oct 9, 2021 16:59 UTC (Sat) by willy (subscriber, #9762) [Link] (1 responses)

I'm really just trying to avoid the bugs we have where people look at page->mapping and the compiler can't say "this is a tail page, that doesn't do what you think it does". Everybody keeps trying to get me to solve their problems as well.

Please, just let me solve a problem, not rewrite the entire kernel.

Pulling slabs out of struct page

Posted Oct 9, 2021 17:06 UTC (Sat) by luto (guest, #39314) [Link]

I don't want you to rewrite the whole kernel! I'm just contemplating how it _could_ be rewritten if someone were inclined to do so.

(Also, I do care about the KVM mess, and I don't think KVM could have dug itself into quite the hole its in if there hadn't been a struct page to begin with for most user mappings, but fixing that needs a rewrite and a time machine.)

Pulling slabs out of struct page

Posted Oct 10, 2021 14:39 UTC (Sun) by willy (subscriber, #9762) [Link] (3 responses)

> In my dream world, the low-level memory failure / machine check code gets a virtual address and can look up a VMA or vmap area. Making this work with kmap might be interesting.

I don't think your dream world is possible. It's the same problem the page cache has with errors on writeback -- the producer might not be around any more. We might have unmapped the vmap/kmap; the user process that dirtied the cache line might have exited, or just been switched away from.

But more importantly, unless the cache is writethrough, the CPU no longer knows which virtual address(es) were used to dirty the cache line.

Pulling slabs out of struct page

Posted Oct 10, 2021 14:53 UTC (Sun) by luto (guest, #39314) [Link] (2 responses)

As I understand it, on Intel chips that support memory failure recovery, failed writes may not be notified at all. (I’ve at least been told this is true for the TDX style machine checks.)

And Linux’s entry code makes quite weak guarantees about recoverability of machine checks: we make a best (and pretty good) effort to recover from a fault in user code, and we try to recover from kernel code with exception table entries. If normal kernel code without an exception table entry hits a memory failure entry, forget about struct page: we may be 100% dead regardless because we have no idea how to resume execution.

If we hit a machine check with an exception handler, then we know the program counter, and we have a full register file. Figuring out the failed virtual address isn’t much of a problem even if the hardware doesn’t help.

Pulling slabs out of struct page

Posted Oct 10, 2021 14:57 UTC (Sun) by willy (subscriber, #9762) [Link] (1 responses)

Having the full register file doesn't matter if the store that dirtied the cache line was 10ms ago. I can't imagine how any CPU vendor would keep the register state around until the cache line moves from L3 to DRAM

Pulling slabs out of struct page

Posted Oct 10, 2021 15:32 UTC (Sun) by luto (guest, #39314) [Link]

You’re assuming that the CPU will notify the OS at all when a store from L3 to DRAM fails and that the OS actually needs to do anything about it. I don’t know all the nasty details, but it may be possible (and even mandatory?) to mark the memory bad when writeback fails and deliver a fault on a subsequent read.

Pulling slabs out of struct page

Posted Oct 9, 2021 15:55 UTC (Sat) by willy (subscriber, #9762) [Link] (1 responses)

Oh, since you mentioned unspeakable things, the graphics stack does horrendous hacks, so a VMA no longer tells you anything about the page you found. It might be anon, or it might be a page that belongs to a graphics device. And now they want to do that to file mappings too.

Pulling slabs out of struct page

Posted Oct 10, 2021 11:39 UTC (Sun) by pbonzini (subscriber, #60935) [Link]

I won't deny that KVM does the unspeakable, but I think the idioms are more or less common to all users of MMU notifiers.

Pulling slabs out of struct page

Posted Oct 9, 2021 0:18 UTC (Sat) by jhoblitt (subscriber, #77733) [Link]

Do the various slab allocators see much, if any, usage? I vaguely recall that after the switch to slub as the default in the 2.6(?) era the feeling was that slab support would eventually be removed.