Slabs, sheaves, and barns
Many kernel developers interact with the slab allocator using functions like kmalloc(), which can allocate objects of any (reasonable) size. There is a lower level to the slab allocator, though, that deals with fixed-size objects; it is used heavily by subsystems that frequently allocate and free objects of the same size. The curious can see all of the special-purpose slabs in their system by looking at /proc/slabinfo. There are many core-kernel operations that involve allocating objects from these slabs and returning them, so the slab allocator has gained a number of features, including NUMA awareness and bulk operations, to accelerate allocation and freeing.
But, naturally, it's never fast enough.
One of the keys to improved performance on today's multiprocessor systems is to avoid interaction between the CPUs whenever possible. Interactions lead to contention, locking delays, and cache-line bouncing, all of which hurt performance, but a CPU that can act like the rest of the system isn't there can go forward at full speed. This understanding has driven the adoption of per-CPU data structures across the kernel. The slab allocator already makes use of per-CPU data, but it still has enough cross-CPU interactions to slow things down.
Sheaves are a concept introduced by Babka's patch set; in essence, they are a per-CPU cache of objects that can be handed out in response to allocation requests without the need to interact with any other CPUs in the system. By default, sheaves are disabled, but they can be enabled for a specific slab by setting a non-zero value in the new field sheaf_capacity in the kmem_cache_args structure passed to kmem_cache_create(). The value is the number of objects that should be cached in a single sheaf; the patch adding sheaf usage to the maple-tree data structure sets it to 32.
When sheaves are enabled, the allocator will maintain a sheaf with the given number of objects for each CPU. An allocation request will be satisfied from this sheaf whenever possible, and freed objects will be placed back into the sheaf if there is room. That turns allocation and free operations into purely local assignments that can be executed quickly; no locking (or even atomic operations) required. There is a second (backup) sheaf maintained for each CPU as well; when the main sheaf is found to be empty, an object will be allocated from the backup sheaf instead. If the main sheaf is full when an object is freed, that object will be placed into the backup sheaf if possible.
When both sheaves are full, there will no longer be a place to stash a freed object with a simple assignment; that is where the "barn" comes in. The barn is simply a place to keep sheaves that are not currently being used for caching by any CPU; there is one barn for each NUMA node in the system. Once a CPU has filled its sheaves, it will try to place one in the barn; if a CPU's sheaves are empty, it will try to obtain a full one from the barn. In either case, this operation is slower, since locking is required to safely access the shared barn, but it is still faster than going all the way into the slab allocator.
The barn holds two sets of sheaves — one for full sheaves, and one for empty sheaves. If a CPU is freeing a lot of objects, it can place its full sheaves into the barn and obtain empty ones to replace them. There is a limit to the number of sheaves the barn can hold; it is wired to ten each for full and empty sheaves in the current patch set. If a CPU tries to place a sheaf into a full barn, that sheaf will be freed, along with any objects it contains, back into the slabs they came from.
As described so far, this series has the potential to greatly accelerate memory-allocation operations for workloads that allocate and free a lot of slab objects. But there are a couple of other features that are added later in the series to make sheaves more useful.
One of those is an enhancement to kfree_rcu(), which will delay the freeing of an object until after a read-copy-update (RCU) grace period has passed, ensuring that the object no longer has any active references. A third per-CPU sheaf is maintained to hold objects freed with kfree_rcu(); once the sheaf fills, it is passed to the RCU subsystem for the grace-period wait. Once that has happened, an attempt will be made to put the sheaf back into the barn.
The other addition is preallocation. There are many places in the kernel where memory must be allocated without delay, and certainly without blocking the allocating thread. There are also numerous code paths that cannot deal with an allocation failure in any reasonable way. In most of these cases, there is an opportunity to preallocate the needed memory before going into the more constrained code. The kernel has long maintained subsystems like mempools to meet this need.
But, if the kernel can go into critical code in the knowledge that it has a per-CPU sheaf full of objects available to it, a lot of these problems (and the need for mempools) go away. To that end, the series provides a set of functions for working with special-purpose sheaves. A call to kmem_cache_prefill_sheaf() will return a sheaf containing at least the requested number of objects, grabbing it out of the barn if possible. Then, kmem_cache_alloc_from_sheaf() can be used to allocate objects from the sheaf with guaranteed success for at least the requested number of objects. Other functions can be used to return a sheaf to the barn or to place more objects into a sheaf. These special-purpose sheaves act similarly to mempools, but they are intended to be short-lived, unlike mempools which usually exist for the life of the system.
This series appears to be relatively uncontroversial, though perhaps
developers are reserving their comments for the upcoming Linux Storage,
Filesystem, Memory-Management, and BPF Summit, to be held in late
March. Given the appetite for faster memory allocation and freeing,
though, sheaves and barns seem likely to be added to the mix at some point.
| Index entries for this article | |
|---|---|
| Kernel | Memory management/Slab allocators |
Posted Feb 24, 2025 17:06 UTC (Mon)
by dmv (subscriber, #168800)
[Link] (7 responses)
Posted Feb 24, 2025 17:10 UTC (Mon)
by corbet (editor, #1)
[Link] (5 responses)
Sorry if that wasn't clear.
Posted Feb 24, 2025 17:15 UTC (Mon)
by dmv (subscriber, #168800)
[Link]
Posted Feb 24, 2025 22:18 UTC (Mon)
by tialaramex (subscriber, #21167)
[Link] (3 responses)
What's weird is that C++ and Java for some reason give methods to some things but not others. There's no particular reason 'x'.is_lowercase() shouldn't work, which is why it does in Rust, there's no real "extra" work to deliver this compared with my_goose.honk() in both cases the compiler just has to figure out what type the object is and then go find the function to call (a "method") based on that type then fill in that object as a parameter to the function. In 1972 in a few hundred kilobytes of RAM I can well believe that's an unaffordable luxury for Dennis and so its absence from C is not a great surprise, but why implement this and then switch it off for the built-in types in a newer language? Search me.
While writing this comment and looking at the Rust docs for Kmalloc I found a doc bug - for a conventional Rust crate I'm pretty clear about how I can report or fix this. But it's not as obvious to me for the kernel code - please tell me I don't need to brave the LKML Jonathan?
The text "if `old_layout` is zero-sized `p` does not need to be a pointer returned by this [`Allocator`]." is unnecessary for free() as we do not provide an old_layout parameter. I can't tell (but hopefully the authors know) whether it is appropriate to simply remove this text, or whether in addition some other text should be provided about zero-size things being freed, or not freed as the case may be.
Posted Feb 25, 2025 10:11 UTC (Tue)
by Sesse (subscriber, #53779)
[Link] (2 responses)
Of course, adding _data members_ to another class from the outside would be nearly impossible in C++'s compilation model (which it, of course, inherited from C).
Posted Feb 25, 2025 20:09 UTC (Tue)
by tialaramex (subscriber, #21167)
[Link] (1 responses)
C++ ADL which maybe you were also gesturing at is just a mess. Your whatever function is indeed treated by C++ as if it's "part of Foo's interface" but only when it was defined in the same namespace so that ADL will find whatever when it is looking up Foo. This causes weird hard to understand issues in real world C++ software because what foo(bar) means will depend on the context in which you wrote it in non-obvious ways, as a result "turning off ADL" via arcane tricks is often needed & there have been attempts to reform ADL or to add explicit "No ADL please" features to the language.
I was serious about that docbug, who do I raise this with? I'd rather raise it as a bug rather than sending a patch because I'm certain those answering the bug ticket will know more than I do about kmalloc
Posted Feb 27, 2025 9:28 UTC (Thu)
by aviallon (subscriber, #157205)
[Link]
Posted Feb 25, 2025 16:01 UTC (Tue)
by rweikusat2 (subscriber, #117920)
[Link]
Posted Feb 24, 2025 17:23 UTC (Mon)
by vbabka (subscriber, #91706)
[Link]
Posted Feb 25, 2025 0:41 UTC (Tue)
by robert.cohen@anu.edu.au (subscriber, #6281)
[Link] (4 responses)
Sheaves can refer to either a collection of pages in bookbinding. Or collections of wheat stalks.
Posted Feb 25, 2025 4:18 UTC (Tue)
by tux3 (subscriber, #101245)
[Link]
Posted Feb 25, 2025 16:15 UTC (Tue)
by paulj (subscriber, #341)
[Link] (2 responses)
Posted Feb 25, 2025 19:14 UTC (Tue)
by JoeBuck (subscriber, #2330)
[Link] (1 responses)
Posted Feb 26, 2025 9:56 UTC (Wed)
by paulj (subscriber, #341)
[Link]
Posted Feb 25, 2025 5:14 UTC (Tue)
by kmeyer (subscriber, #50720)
[Link] (1 responses)
Posted Feb 25, 2025 16:12 UTC (Tue)
by paulj (subscriber, #341)
[Link]
The allocator is available in userspace in libumem (which should port relatively easily to other Unixes, and indeed some have put such ports on github).
Posted Feb 26, 2025 0:43 UTC (Wed)
by neilbrown (subscriber, #359)
[Link]
I guess it depends on what you mean by "similarly" but I think the above is misleading.
A vital property of mempools is that if memory cannot be allocated (because it is all in use), then the mempool allocation will wait for something to be returned to the mempool - and it can be certain that something *will* be returned to the mempool in a reasonable time.
If a mempool provides "struct bio" that can be used to write a page to a device, then you can be sure that waiting on the mempool will only wait until a currently outstanding write to the device completes - then that bio will become available.
So a mempool is not at all about being able to allocate memory with out blocking. It is precisely about blocking at most a predictable amount of time for memory to become available.
In contrast, sheaves seem to be about maximising the chance of allocating without blocking. Preallocating before taking a spinlock but while pinned to a given CPU - so an allocation under the spinlock must succeed - is great, but there is no guarantee that the preallocation will succeed - so code would need a fallback or an indefinite wait. mempools provide a definite wait.
sheaves improve throughput and simplify code. mempools avoid deadlocks.
What exactly is an object?
An "object" is just a piece of memory, yes. No methods or other class-like stuff, this is C we're talking about :)
What exactly is an object?
What exactly is an object?
What exactly is an object?
What exactly is an object?
What exactly is an object?
What exactly is an object?
What exactly is an object?
LSF/MM/BPF
Thanks for the write-up :)
mixed metaphor
However as far as I know, the term barn isn't used in bookbinding.
In bookbinding a collection of sheaves would be a book.
If that term is considered too confusing. I guess you could keep your sheaves on a shelf and still be in
metaphor.
Obviously other bookish terms like stack or library would be a bad idea :-)
Récoltes et semailles
We might be running out of viable book-related words, but other fields show you can do fine even with much more tortured metaphors than we have.
So one might start to wonder whether the problem measures up to the... reams of discussions about it :)
mixed metaphor
mixed metaphor
mixed metaphor
Sheaves
Sheaves
Not much like mempools
