A slab allocator (removal) update

By Jonathan Corbet
May 22, 2023

The kernel developers try hard to avoid duplicating functionality in the kernel, which is enough of a challenge to maintain as it is. So it has often seemed out of character for the kernel to support three different slab allocators (called SLAB, SLOB, and SLUB), all of which handle the management of small memory allocations in similar ways. At the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, slab maintainer Vlastimil Babka updated the group on progress toward the goal of reducing the number of slab allocators in the kernel and gave an overview of what to expect in that area.

Babka started by saying that his original proposal for the session mentioned the SLOB allocator in the title. This allocator, which was optimized for memory-limited systems, has been on the chopping block for a while now. That removal, he announced to applause, happened during the 6.4 merge window. There is a set of configuration options that can be selected to make the SLUB allocator more suitable for small-memory systems. It is now possible to call kfree() on all slab-allocated objects — something that SLOB never supported.

The next step, he said, might be to remove SLAB. That would solve one of his biggest problems: he never figured out how to pronounce SLAB and SLUB so that others could hear the difference. SLAB contains 4,000 lines of code, he said, not all of which is regularly or well tested. He has found parts of the SLAB allocator that have been broken for years. Keeping SLAB around means maintaining a common-code layer used also by SLUB, which complicates maintenance. It also requires reimplementing features; both allocators have implementations of memory control groups, for example, while realtime preemption is only supported by SLUB.

This is not the first time somebody has suggested removing SLAB; he found at least three other times that the idea has come up. Each time the idea is raised, somebody complains about performance regressions when using SLUB. He wanted to know if the same objections would be raised this time.

David Rientjes is one of the developers who has objected to the removal of SLAB in the past. Speaking from a Google perspective, he said that things have come a long way since then. Per-cpu partial slabs help a lot. He has been looking at the benchmark results, and has concluded that, at this point, they can go either way depending on the workload. He did complain that SLUB can have a higher memory overhead; partial slabs make it better, and further progress can be made on that front. At this point, he concluded, he would not object to removing SLAB.

Michal Hocko said that SUSE has been using SLUB for some time; it works better in some cases, and worse in others. The biggest reason to make the change, he said, was that SLUB makes debugging problems easier; he suggested just removing SLAB and fixing any remaining problems afterward. Matthew Wilcox said that, in the past, SLUB performed worse with certain database benchmarks, but that problem has since gone away.

Another attendee asked about SLUB's extra memory overhead; is it something structural, or is it something that can be chipped away at. Babka answered that he was surprised to hear objections about memory overhead. SLUB, it seems, uses about 30% more memory than SLAB to keep track of the memory it manages; he asked whether that translated to significant amount of memory when viewed as an absolute number.

Much of SLUB's additional overhead, he said, could be seen as a structural problem; SLUB gets its performance by using a lot of per-CPU caches. When Christoph Lameter introduced SLUB in 2007, one of his justifications for the addition of another allocator was that SLAB used too much memory for caches. But, Babka said, things have shifted over time. Addressing this memory use would require coming up with another way to get similar performance.

Pasha Tatashin asked whether per-CPU caching still makes sense in systems with hundreds of cores. Babka answered that some per-CPU caching is needed for scalability, but that there might be ways to make it more effective.

Concerns about memory usage notwithstanding, the conclusion from the session was that nobody objects to the removal of the SLAB allocator at this point; Babka plans to post a proposal to the mailing lists and see what kind of reaction it gets. Anybody who objects, he said, should be prepared to show a use case or benchmark that regresses with SLUB so that any remaining problems can be addressed. But this removal should not be held back for the sake of a microbenchmark; if there are concrete problems, the community can discuss how to fix them.

Once that task is complete, he said, it's time to think about what is next. API improvements will become easier once there is only one allocator to change. One idea he had was opt-in, per-CPU caching of array objects which, he said, could improve performance while simultaneously reducing overhead. The ability to allocate in non-maskable interrupt (NMI) context using a per-CPU cache was another idea; there would still be no guarantees that an allocation would succeed, though. That would allow the removal of a BPF-specific allocator.

Perhaps, he said, the allocator could offer guaranteed allocations with some sort of upper bound, much like mempools do now. That could be useful for tasks like the allocation of maple-tree nodes. More generally, he concluded, he would like to find ways to end the reinvention of memory-management functionality outside of the memory-management layer. There are a lot of things being done now that would be better handled in the core memory-management code.

Wilcox had one problem he would like to see a solution for that he called "dcache poisoning". On a system with a lot of memory and little memory pressure, the directory entry (dentry) cache can grow without bound. This can be an especially big problem with workloads creating a lot of negative dentries. The kernel will only run the shrinker when there is memory pressure; by the time that happens, cleaning out the dentry cache can take a long time. Andrew Morton described this as a "dentry cache policy decision", but Babka said that the allocator might be a useful part of a solution to this problem.

Babka closed the session by thanking the attendees and asking them to wish him luck as he proceeds with the SLAB removal.

Index entries for this article
Kernel	Memory management/Slab allocators
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2023

A slab allocator (removal) update

Posted May 22, 2023 18:08 UTC (Mon) by koverstreet (✭ supporter ✭, #4296) [Link] (4 responses)

What exactly is the dcache problem?

As long as running time of the dcache shrinker is linear in the amount of memory being freed, I don't see how runtime would strictly be a problem - if you're allocating a lot of memory, you're going to have to pay to reclaim a lot of memory, this is typical steady state behavior.

Or is this more of a fragmentation problem, since with a large dcache the shrinker isn't likely to be reclaiming objects from the same slabs? That's a trickier problem...

A slab allocator (removal) update

Posted May 22, 2023 18:47 UTC (Mon) by Paf (subscriber, #91811) [Link]

Well, one reason might be that the shrinker can often be run synchronously as part of a memory allocation rather than separately through a thread. Though in that case I believe they just request what’s needed, so I don’t know why it would take a long time to get it just because the cache is large.

A slab allocator (removal) update

Posted May 23, 2023 0:46 UTC (Tue) by brenns10 (subscriber, #112114) [Link] (2 responses)

One issue with dentries is that even if the shrinker runtime is linear, the coefficient can still be pretty large. Any dentry that is in-use (typically open files as well as any dentry with a child, and several other cases) can't be freed, so you need to find a page full of unreferenced dentries. But the shrinker is (a very fuzzy approximation of) LRU, so it runs dentry by dentry until by chance, it has freed enough dentries to free a page. Since dentries are 192 bytes, it takes a while to clear out 21 dentries from the same page by chance.

Thankfully, in the "silly" examples of bad workloads, all 21 dentries were created by the same application looking up non-existent files, and they all got added to the LRU at the same time. But if the workload looks at all more complicated with mixed-lifetime objects, it can get ugly.

That said, it may not be that different from other caches, I haven't looked at a lot of others.

A slab allocator (removal) update

Posted May 23, 2023 5:34 UTC (Tue) by mokki (subscriber, #33200) [Link] (1 responses)

I've maintained CI machines that have 50% of 384 GiB memory in negative dentries.
Often, when the kernel tried to free some more memory, the result was soft lockups of over 1 minute in dmesg.
Flushing manually the dentry cache regularly helped. This was ~6 years ago and hopefully there are now better limits to the max amount of negative dentries.

A slab allocator (removal) update

Posted May 23, 2023 18:03 UTC (Tue) by WolfWings (subscriber, #56790) [Link]

Still no maximum limits for negative dentries at all in any LTS kernels I'm aware of, and they get intermixed entirely with 'positive' dentries.

There's been numerous attempts to improve the situation over the years, LWN articles about it even, but it never makes progress since it's a relatively niche situation without enough visibility to most kernel devs to maintain momentum to improve the situation. I'm not even sure there's any mechanism to stop the dentry cache purge when memory-pressure triggers it... it may literally scan ALL of cache and purge what it can instead of freeing say a few MB and stopping, I haven't checked the code in depth but that's the general behavior I've seen in the past.