Pulling slabs out of struct page
The kernel maintains one page structure for every physical page of memory that it manages. On a typical system with a 4KB page size, that means managing millions of those structures. A page structure tells the kernel about the state of the page it refers to: how it is being used, how many references to it exist, its position in various queues, and more. The required information varies depending on how any given page is being used at the moment; to accommodate this, struct page is a complicated mess of structures and unions. The current definition of struct page makes for good pre-Halloween reading, but those who truly want a good scare may want to see what it looked like before Wilcox cleaned things up for 4.18.
One of the users of struct page is the set of slab allocators, which obtain blocks of pages ("slabs") from the kernel and hand them out in smaller, fixed-size chunks. They are used heavily, and their performance will affect the performance of the system as a whole, so it is not surprising that they reach into the memory-management subsystem at the lowest levels. To support this usage, many of the fields in struct page are there specifically for the slab allocators. Just to complicate things, the kernel has three slab allocators: SLAB (the original allocator, often used by Android), SLUB (often used for desktop and data-center systems), and SLOB (a tiny allocator intended for embedded systems). Each has its own needs for fields in struct page.
Wilcox's patch set creates a new struct slab by removing the relevant fields from struct page. The new structure is, in its entirety:
struct slab { unsigned long flags; union { struct list_head slab_list; struct { /* Partial pages */ struct slab *next; #ifdef CONFIG_64BIT int slabs; /* Nr of slabs left */ int pobjects; /* Approximate count */ #else short int slabs; short int pobjects; #endif }; struct rcu_head rcu_head; }; struct kmem_cache *slab_cache; /* not slob */ /* Double-word boundary */ void *freelist; /* first free object */ union { void *s_mem; /* slab: first object */ unsigned long counters; /* SLUB */ struct { /* SLUB */ unsigned inuse:16; unsigned objects:15; unsigned frozen:1; }; }; union { unsigned int active; /* SLAB */ int units; /* SLOB */ }; atomic_t _refcount; #ifdef CONFIG_MEMCG unsigned long memcg_data; #endif };
As can be seen, this structure still relies heavily on unions to overlay the information that each allocator needs to store with the page. Those could be eliminated by splitting the structure into three allocator-specific variants, but that would add complication to a patch set that is already large (and set to grow).
It is worth noting that struct slab is really struct page in disguise; instances of struct slab overlay the corresponding page structure in the kernel's memory map. It is, essentially, the kernel's view of struct page for pages that are owned by a slab allocator, extricated from its coexistence with all of the other views of that structure. This means that struct slab must be laid out with care; some fields (_refcount, for example) are shared with struct page, and the results of a disagreement over its location would be unfortunate. To ensure that no such calamity occurs, Wilcox has included a set of compile-time tests verifying the offsets of the shared fields.
After that, the remaining patches in the series convert various code in
the slab allocators (and beyond) to use the new type. The SLUB conversion
is done meticulously, in over 40 small steps. Wilcox described the
conversion of the other allocators as "slapdash
", done in a
single patch each. Presumably a later version of the patch set will turn
these proof-of-concept patches into a proper series of their own, but it's
not entirely clear who will do that; Wilcox wrote in the cover letter:
I don't know the slab allocators terribly well, so I would be very grateful if one of the slab maintainers took over this effort. This is kind of a distraction from what I'm really trying to accomplish with folios, although it has found one minor bug.
As of this writing, no slab maintainers (who tend to be thin on the ground in the best of times) have responded to this request.
This might seem like a lot of work to put an old structure into a new form, but there are a number of reasons to want something like this. Just pulling the slab-specific fields out of struct page simplifies that structure significantly. Using a separate type makes it clear which variant of the page structure the code expects to deal with, and it adds a degree of type safety; it is no longer possible to accidentally access the wrong union fields.
But the biggest benefit comes simply from beginning to separate the slab allocators from struct page. Eventually it may become possible to disassociate struct slab from struct page entirely and allocate it dynamically. That would be one small step toward encapsulating struct page within the core memory-management code and hiding it from the rest of the kernel, a change that would ease the much-needed task of improving page-level memory management.
First, though, some variant of this work must make it into the mainline
kernel. It should be an easier process than getting folios merged, but
getting big changes into the memory-management code is never easy. The
relative silence that has greeted this work so far might be a bit
worrisome, especially since Wilcox has requested help, but it is the early
days for this series still. Regardless of how struct slab fares,
though, it provides an indication of the direction that the
memory-management developers are trying to go.
Index entries for this article | |
---|---|
Kernel | Memory management/Memory descriptors |
Posted Oct 8, 2021 16:34 UTC (Fri)
by abatters (✭ supporter ✭, #6932)
[Link] (11 responses)
In that case it is good that the kernel is compiled with -fno-strict-aliasing.
https://en.cppreference.com/w/c/language/object#Strict_al...
Posted Oct 8, 2021 21:12 UTC (Fri)
by willy (subscriber, #9762)
[Link] (10 responses)
https://www.yodaiken.com/2021/10/06/plos-2021-paper-how-i...
Posted Oct 10, 2021 7:03 UTC (Sun)
by dvdeug (guest, #10998)
[Link] (9 responses)
It concludes with "A small performance improvement will generally not justify a decrease in code stability for operating systems...", but as I see it, that's not supported by reality; stuff like https://lwn.net/Articles/871726/ makes it clear Linux kernel programmers will write to conventions no C compiler ever has promised they won't break (multiprocessing largely postdates ISO C) for a speedup. My guess is that if kernel developers got a -kernel switch that gave them everything they claimed to want at the cost of an average of 15% loss in performance, they'd be up in arms.
In any case, all the article really says about "TBAA is unsuited to kernel development." is "it may be possible in ISO C to push all these different types into a union, but that would harm modularity,..." Given that C is about the only language in modern use not to have a module or package system (showing a lack of commitment to modularity) and we're talking about a direct hit to code stability, I'm not overwhelmed by that argument.
Posted Oct 10, 2021 18:20 UTC (Sun)
by Wol (subscriber, #4433)
[Link] (6 responses)
How many successful operating systems have been written in C? Two? Unix of course, I don't know what language Windows was written in, but I suspect most of it is C nowadays if it wasn't to start with?
To my knowledge Multics predates C - Fortran? Pr1mos was a multics-derivative - that was Fortran, then they ADDED PL/? (is that PL/1, PL/P, SPL, ... a whole bundle of variants/dialects/whatever). Then they TRIED to port the whole shebang to C and the result was a disaster.
How many other OS ports to C have been a disaster? I strongly suspect C owes its success to Unix, NOT the other way round.
Cheers,
Posted Oct 10, 2021 19:58 UTC (Sun)
by mpr22 (subscriber, #60784)
[Link]
The 1973 rewrite of Unix from PDP-11 assembly to C would eventually allow it to proliferate off of DEC hardware (the first such port being to the Interdata 8/32 in 1978); Unix succeeding, in turn, allowed C to proliferate off of Unix.
Posted Oct 11, 2021 6:47 UTC (Mon)
by dvdeug (guest, #10998)
[Link] (2 responses)
MacOS seems to have drifted through assembly, Pascal, and then C. Multics was PL/I. PRIMOS was originally Fortran IV, then a PL/1 dialect and Modula-2.
I tend to agree with mpr22 that Unix and C's success were mutual. If nothing else, the Lions book offered a good example of what could be done, and how to do it in C, and Ada and ALGOL-68 were complex, Modula-2 and LISP too academic and not specifically designed for it, PASCAL way too academic and not designed for it, except in a horde of dialects, and BLISS, JOVIAL and PL/S too proprietary and ill-documented. Unix could have won and eventually been rewritten in Modula-2 or something else, had C not been at least good enough.
Posted Oct 11, 2021 12:36 UTC (Mon)
by pizza (subscriber, #46)
[Link]
Don't forget various RTOSes and other embedded stuff -- C (and to a lesser extent, C++) overwhelmingly dominate.
Rust has some promise to supplant things there, but the amount of unsafe boilerplate needed to drive a modern MCU is staggering. Some of that can be automated away but it results in a much steeper curve to being productive.
Posted Oct 13, 2021 19:36 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
This is true, but Microsoft historically has not made a very strong effort to distinguish between "the kernel" and "the rest of Windows." To some extent, this is an arbitrary line-drawing exercise. For example, you could take the position that every process which runs as SYSTEM is the Windows equivalent of a kernel thread, and therefore large chunks of the Windows kernel are actually written in managed languages like C#, but I imagine some people would violently disagree with that characterization.
Posted Jan 20, 2022 22:17 UTC (Thu)
by yodaiken (guest, #156253)
[Link]
Posted Jan 20, 2022 23:05 UTC (Thu)
by anselm (subscriber, #2796)
[Link]
According to multicians.org, PL/1, or specifically a subset thereof called EPL.
Posted Jan 15, 2022 22:31 UTC (Sat)
by yodaiken (guest, #156253)
[Link] (1 responses)
"Starting with lines like "The C programming language [33] is the first, and, so far,only widely successful programming language that provides operating system developers with a high-level language alternative to assembler (compare to [42])"? BLISS, PL/I and Algol 68 all were used as implementation languages for OSes released before Unix, ..."
See that "(compare to [42])" ? That is a citation. If you look on the bottom of the article you can find
42]William A Wulf. 1972. Systems for systems implementors: some experiences from Bliss. In Proceedings of the December 5-7, 1972, fall joint computer conference, part II. 943–948.
As for Pl/1 and Algol68, I'd love to see references to actual operating systems implemented in either. The Bliss article is interesting and perhaps you should read it.
Thanks for calling my paper "weaselly" though. Impressive criticism.
Posted Jan 16, 2022 1:29 UTC (Sun)
by mgb (guest, #3226)
[Link]
Posted Oct 8, 2021 19:45 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
Posted Oct 8, 2021 20:02 UTC (Fri)
by corbet (editor, #1)
[Link] (1 responses)
Posted Oct 8, 2021 23:53 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Oct 8, 2021 21:10 UTC (Fri)
by willy (subscriber, #9762)
[Link] (2 responses)
Slub needs to cmpxchg_double() freelist & counters. That means that freelist has to be dword aligned on both 32 and 64 bit. _refcount is word sized on 32-bit and half-word sized on 64-bit. For compactness, we pair it with _mapcount. Now the pair are either one or two words, depending on 32/64 bit. So _refcount has to be after freelist+counters in order for them to be dword aligned on both 32 and 64 bit. It's advantageous to make the main union as large as possible, so _refcount has to go towards the end of the struct.
There are other considerations in the layout of struct page, but these are the ones that pertain to the location of _refcount.
Posted Oct 8, 2021 23:55 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Oct 10, 2021 0:54 UTC (Sun)
by vivo (subscriber, #48315)
[Link]
Posted Oct 8, 2021 21:36 UTC (Fri)
by willy (subscriber, #9762)
[Link]
I'm glad someone sees where I'm going! Yes, I did think about doing that, but decided not to do it as part of this patch. partial_pages will still need a union, but there will be far less unions once struct slab is defined based on CONFIG options.
Posted Oct 8, 2021 21:55 UTC (Fri)
by willy (subscriber, #9762)
[Link] (11 responses)
The dynamically allocated struct folio/slab/pgtable is where Kent Overstreet and Johannes Weiner want to go. It's more work, with a bigger payoff. We can collaborate on the steps along the way, since so much of the way is shared.
Posted Oct 9, 2021 15:40 UTC (Sat)
by luto (guest, #39314)
[Link] (10 responses)
I don’t think we really need this. We already support, in a very limited way, non-struct-page user mappings. For lightweight operations on user memory, we can use the uaccess functions, and they inherently lock correctly against unmapping. For heavyweight operations, we can look up the VMA. This leaves things that don’t want to pay the full price of finding a VMA. Whether those really exist isn’t quite clear to me on a conceptual level, but there is certainly a lot of code that calls get_user_pages [0] and expects the result to be live until release. (And some FS code may want to do useful IO.)
I wonder if performance could be acceptable if GUP walked the VMA tree to find a refcountable object. Some interesting locking would be needed to compete with get_user_pages_fast.
[0] This is all kinds of messy. KVM does unspeakable and blatantly incorrect to host user memory. Even the normal pattern of GUPping a page interacts in unfortunate ways with COW.
Posted Oct 9, 2021 15:46 UTC (Sat)
by willy (subscriber, #9762)
[Link] (7 responses)
Yes. Some of the places we need this:
- GUP gets back a page and then calls set_page_dirty(). That needs to figure out whether this is a file/anon/ksm/netpool/DEVICE/... page and call the filesystem if required.
- compaction walks the memmap and needs to figure out what this memory is and whether it can be relocated.
- memory failure gets a physical address and needs to understand how to handle it
There are more, but these should illustrate some of the problems we have to solve.
Posted Oct 9, 2021 16:08 UTC (Sat)
by luto (guest, #39314)
[Link] (6 responses)
Is this done directly in GUP? If so, surely it could work like the fault code and look up the VMA.
> - compaction walks the memmap and needs to figure out what this memory is and whether it can be relocated.
Hmm, this one is legit.
> - memory failure gets a physical address and needs to understand how to handle it
In my dream world, the low-level memory failure / machine check code gets a virtual address and can look up a VMA or vmap area. Making this work with kmap might be interesting.
> There are more, but these should illustrate some of the problems we have to solve.
I wonder if it's possible to reduce the dependency on struct page or equivalent to the point that everything works without it except for some nice-to-have features like compaction. (I'm not saying that the colossal amount of effort involved is worthwhile.)
Posted Oct 9, 2021 16:59 UTC (Sat)
by willy (subscriber, #9762)
[Link] (1 responses)
Please, just let me solve a problem, not rewrite the entire kernel.
Posted Oct 9, 2021 17:06 UTC (Sat)
by luto (guest, #39314)
[Link]
(Also, I do care about the KVM mess, and I don't think KVM could have dug itself into quite the hole its in if there hadn't been a struct page to begin with for most user mappings, but fixing that needs a rewrite and a time machine.)
Posted Oct 10, 2021 14:39 UTC (Sun)
by willy (subscriber, #9762)
[Link] (3 responses)
I don't think your dream world is possible. It's the same problem the page cache has with errors on writeback -- the producer might not be around any more. We might have unmapped the vmap/kmap; the user process that dirtied the cache line might have exited, or just been switched away from.
But more importantly, unless the cache is writethrough, the CPU no longer knows which virtual address(es) were used to dirty the cache line.
Posted Oct 10, 2021 14:53 UTC (Sun)
by luto (guest, #39314)
[Link] (2 responses)
And Linux’s entry code makes quite weak guarantees about recoverability of machine checks: we make a best (and pretty good) effort to recover from a fault in user code, and we try to recover from kernel code with exception table entries. If normal kernel code without an exception table entry hits a memory failure entry, forget about struct page: we may be 100% dead regardless because we have no idea how to resume execution.
If we hit a machine check with an exception handler, then we know the program counter, and we have a full register file. Figuring out the failed virtual address isn’t much of a problem even if the hardware doesn’t help.
Posted Oct 10, 2021 14:57 UTC (Sun)
by willy (subscriber, #9762)
[Link] (1 responses)
Posted Oct 10, 2021 15:32 UTC (Sun)
by luto (guest, #39314)
[Link]
Posted Oct 9, 2021 15:55 UTC (Sat)
by willy (subscriber, #9762)
[Link] (1 responses)
Posted Oct 10, 2021 11:39 UTC (Sun)
by pbonzini (subscriber, #60935)
[Link]
Posted Oct 9, 2021 0:18 UTC (Sat)
by jhoblitt (subscriber, #77733)
[Link]
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Wol
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
To my knowledge Multics predates C - Fortran?
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
For struct slab, it has to be in the same place as with struct page. As for why its location in struct page, that's probably an outcome of years of history. But is there some special reason why _refcount should be at the beginning?
_refcount
_refcount
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page
Pulling slabs out of struct page