A slab allocator (removal) update

Posted May 22, 2023 18:08 UTC (Mon) by koverstreet (✭ supporter ✭, #4296)
Parent article: A slab allocator (removal) update

What exactly is the dcache problem?

As long as running time of the dcache shrinker is linear in the amount of memory being freed, I don't see how runtime would strictly be a problem - if you're allocating a lot of memory, you're going to have to pay to reclaim a lot of memory, this is typical steady state behavior.

Or is this more of a fragmentation problem, since with a large dcache the shrinker isn't likely to be reclaiming objects from the same slabs? That's a trickier problem...

A slab allocator (removal) update

Posted May 22, 2023 18:47 UTC (Mon) by Paf (subscriber, #91811) [Link]

Well, one reason might be that the shrinker can often be run synchronously as part of a memory allocation rather than separately through a thread. Though in that case I believe they just request what’s needed, so I don’t know why it would take a long time to get it just because the cache is large.

A slab allocator (removal) update

Posted May 23, 2023 0:46 UTC (Tue) by brenns10 (subscriber, #112114) [Link] (2 responses)

One issue with dentries is that even if the shrinker runtime is linear, the coefficient can still be pretty large. Any dentry that is in-use (typically open files as well as any dentry with a child, and several other cases) can't be freed, so you need to find a page full of unreferenced dentries. But the shrinker is (a very fuzzy approximation of) LRU, so it runs dentry by dentry until by chance, it has freed enough dentries to free a page. Since dentries are 192 bytes, it takes a while to clear out 21 dentries from the same page by chance.

Thankfully, in the "silly" examples of bad workloads, all 21 dentries were created by the same application looking up non-existent files, and they all got added to the LRU at the same time. But if the workload looks at all more complicated with mixed-lifetime objects, it can get ugly.

That said, it may not be that different from other caches, I haven't looked at a lot of others.

A slab allocator (removal) update

Posted May 23, 2023 5:34 UTC (Tue) by mokki (subscriber, #33200) [Link] (1 responses)

I've maintained CI machines that have 50% of 384 GiB memory in negative dentries.
Often, when the kernel tried to free some more memory, the result was soft lockups of over 1 minute in dmesg.
Flushing manually the dentry cache regularly helped. This was ~6 years ago and hopefully there are now better limits to the max amount of negative dentries.

A slab allocator (removal) update

Posted May 23, 2023 18:03 UTC (Tue) by WolfWings (subscriber, #56790) [Link]

Still no maximum limits for negative dentries at all in any LTS kernels I'm aware of, and they get intermixed entirely with 'positive' dentries.

There's been numerous attempts to improve the situation over the years, LWN articles about it even, but it never makes progress since it's a relatively niche situation without enough visibility to most kernel devs to maintain momentum to improve the situation. I'm not even sure there's any mechanism to stop the dentry cache purge when memory-pressure triggers it... it may literally scan ALL of cache and purge what it can instead of freeing say a few MB and stopping, I haven't checked the code in depth but that's the general behavior I've seen in the past.