LWN: Comments on "SLQB - and then there were four"

SLQB - and then there were four

tej.parkash — Sun, 25 Jan 2009 12:25:04 +0000

can someone explain me how "remote_free_check" flags accomplish the scalability issue.

Who merges?

bgamsa — Mon, 22 Dec 2008 20:23:32 +0000

The design also has some resemblance to work I did at the University of Toronto for NUMA multiprocessors, which in turn was also based on Paul McKenney's work, although enough time has past that I no longer remember all of the details (and I'm not surprised at the resemblance since the driving factors haven't really changed).

condone?

jengelh — Sat, 20 Dec 2008 12:10:02 +0000

Indeed so.

SLQB - and then there were four

jengelh — Sat, 20 Dec 2008 12:09:12 +0000

This is a written forum, not a spoken one.

condone?

BradReed — Sat, 20 Dec 2008 11:40:59 +0000

Sounds more like you condemn the use, rather than condone it.

Free your inner dyslexic:

ncm — Fri, 19 Dec 2008 03:47:45 +0000

OK, "Quibbles" wins by acclamation.

Who merges?

ncm — Thu, 18 Dec 2008 21:37:35 +0000

This is the danger of armchair coding; you're probably right.

An optimized refcount scheme may take only a few bits per object, for most objects, so I was thinking one cache line might hold refcounts for a hundred objects. Also, a dirty cache line is way more expensive than a clean one (because it must be written back, and isn't a candidate for replacement until that's done) so I meant to concentrate the refcount churn, and segregate it from the (commonly) mostly-unchanging objects. This helps most where you have some semblance of locality. Unfortunately things like inode caches don't.

As always, there's no substitute for measuring, but you have to build it first.

Free your inner dyslexic:

smitty_one_each — Thu, 18 Dec 2008 17:49:37 +0000

"Quibbles"

Who merges?

Nick — Thu, 18 Dec 2008 08:41:54 +0000

I don't know how common that really is. For things allocated with significant frequency, I think they should be fairly cache hot at free time (because if they are allocated frequently, they shouldn't have long lifespans).

The exception are caches, that are reclaimed when memory gets low or a watermark is reached (eg. inode cache, dentry cache, etc). However, with these things, they still need to be found in order to be reclaimed, usually via an LRU list, so the object gets hot when it's found and taken off the list.

OK, you could move the refcount and the lru list into another area... but the other problem with that is that in cache objects you expect to have multiple lookups over the lifetime of the object. And if your lookups have to increment a disjoint refcount object, then you increase your cacheline footprint per lookup by effectively an entire line per object. So you trade slower lookups for faster reclaim, which could easily be a bad tradeoff if the cache is effective (which the dcache is, 99% of the time).

Did you have any particular situations in mind?

Who merges?

ncm — Thu, 18 Dec 2008 01:00:46 +0000

Maybe it's a mistake to store the refcounts with the objects. Very often an object is allocated and initialized, and then never modified again, except for the refcount, and perhaps not even looked at again, until freed when the owning process dies. If the allocator provided centralized storage for refcounts, they could be yanked out of the objects, reducing cache churn. Centralizing refcounting could have other benefits, such as allowing optimizations for small counts and for small fluctuations in counts to be abstracted.

Integrating refcounting would change the SLAB interface, but generic refcounting services could be added to all the allocators, providing another way for them to differentiate themselves.

SLQB - and then there were four

Los__D — Wed, 17 Dec 2008 14:01:25 +0000

Since this is an english speaking forum, english approximations are fine.

SLQB - and then there were four

jengelh — Wed, 17 Dec 2008 13:05:20 +0000

That reminds me of the Being Bilingual sketch. And as I see it, sdalley cheated on all four /^th/ words. :-)

SLQB - and then there were four

jengelh — Wed, 17 Dec 2008 12:58:00 +0000

Simple as that - [ɛs][ɛl][kjuː][biː] or, localized for Germany, [ɛs][ɛl][kuː][beː].

BTW, I condone the use of English approximations such as bee for, eh well, English sounds. It does not help non-English speakers at all and only makes for loads of confusion (yes, I am referring to that sibling post above this one). Because that's [beː] for some (most?) people outside the realm of the English language. Please, just ♯ use IPA.

SLQB - and then there were four

sdalley — Wed, 17 Dec 2008 10:59:14 +0000

> Just some thoughts I had reading this text, nothing really thoroughly thought through though.

.. and it's even harder to plough through those details with a tough cough and hiccough ...

SLQB - and then there were four

Nick — Wed, 17 Dec 2008 10:11:12 +0000

See my above comment -- effectively we do allocate from rlist if it is from the same node. Actually, what really happens is that objects from the same node but different CPU are freed straight onto our freelist rather than our rlist -- they only get sent to rlist when our freelist is trimmed. So it's exactly as you suggest.

The issue of cleaning up rlist is interesting. There are so many ways this can be done and it is about the most difficult part of a slab allocator... No, any CPU can be cleaning its rlist at any time, and yes they might all point to a single remote CPU. That's quite unlikely and the critical section is very short, so hopefully it won't be a problem. But I don't claim to know what the best way to do it is.

Very large number of CPUs I am definitely interested in... so I'm hoping to be as good or better than the other allocators here, but we'll see.

Who merges?

Nick — Wed, 17 Dec 2008 09:49:15 +0000

I haven't talked to Paul. I didn't know that.. but I should, now you mention it!

Actually, within the same node, CPUs can allocate objects they've freed which have come from another CPU (on that node). When freeing an object to our freelist, the test that is performed is whether the object's memory is on the same node as the node this CPU is on. This is similar to what SLAB does, and indeed is good for cache (and object management overhead).

Except in very special cases of slabs (RCU-ish ones), the linked list is part of the object itself (when the object is freed, nobody else by definition is using the memory). _Often_ in the kernel I'd say that when an object is freed it is probably recently been touched by the freeing CPU. It would be an unusual situation eg. to have a refcount to the object residing somewhere other than the object itself.

And yes, we have the struct page structure for every page in the system, which can be found with calculations on the object address. And yes, all the slab allocators use this struct page to manage their slabs of objects :)

I agree with your last paragraph too...

SLQB - and then there were four

Nick — Wed, 17 Dec 2008 09:39:32 +0000

The SLAB allocator effectively has similar tunables and watermarks (number of objects in an array-cache, number of objects in l3 lists, alien caches, etc), and it performs very well in very many real world situations.

SLQB uses lists rather than arrays, so it is also more flexible to tune at runtime than SLAB. But in practice it is very difficult to adjust the sizing of these kinds of things at runtime. I've taken the dumb easy (SLAB) way out of it and periodically trim down some of the lists. That can cut back memory usage for unused slabs, and they can grow back if needed (and trim is relatively infrequent so it doesn't harm performance).

Thanks for the article, BTW. I feel like I understand the thing better myself now too :)

SLQB - and then there were four

iq-0 — Wed, 17 Dec 2008 09:09:42 +0000

I think it's a pretty neat/clean design. Though it probably has more overhead on very large number of CPUs system than SLUB. But it's a long time ago that I read about SLUB internals to be really sure.

The big question is:
Why not allocate from 'rlist' when you're done with your own freelist? We don't have to update the metadata so it's a very cheap operation.

Pherhaps it would even be better to put items on 'rlist' to also be put on the 'freelist', so we simply allocate the least recently used item first (probably cache-hot).
Sure the remote item might be bounced back to the other CPU, but clearly the code using it doesn't seem to mind which CPU last used it and with the LRU logic it's just as likely still in our CPU cache. And if it did bounce back in te meanwhile it means we are probably dealing with a slab cache that's not that hard used (or it's usage should be optimized and this would be true for all slab allocaters).

And without having looked at the code I'd assume that only one CPU at a time is cleaning up it's 'rlist' and that usage of the 'remote_free' lock is only a "best effort" locking scheme (try_lock()-ish) because maintenance should really be a background task with minimal overhead. Though this might have problems I overlook (or is already done that way).

Just some thoughts I had reading this text, nothing really thoroughly thought through though.

Who merges?

ncm — Tue, 16 Dec 2008 21:38:10 +0000

I am happy to see SQLB emerge. I hope Nick is in close contact with Paul McKenney on this subject; Paul wrote the allocator for Sequent's Dynix, and has had decades to think about his design choices.

I wonder, though, about the choice to hand all objects back to the CPU that originally allocated them. In particular, suppose we have a large collection of objects freed by CPU A long after they have passed out of B's cache -- e.g., allocated on behalf of a process that has migrated to A. It might be objectively better for CPU B to work on their metadata while it remains in B's cache. Maybe it's better to re-map a fully freed page to an address range that B manages; if we know that CPU A won't be touching that page anyway, it may be unmapped from A's page map at leisure.

I wonder, too, about all these lists. Are the list links adjacent to the actual storage of the objects? Often when an object is freed it has long since passed out of the cache, and touching a list link would bring it back in. A way to avoid touching the actual object is for the metadata to be obtainable by address arithmetic, e.g. masking off the low bits to obtain the address of a page header. Some extra structure helps -- a segment of adjacent pages on (say) a megabyte or gigabyte boundary can share a common header and an array of page headers, so that the containing pages need not be touched, either, and the metadata concentrated in a few busy cache lines. Segment headers pingponging between CPUs would be a bad thing, but each segment might reserve a few cache lines for annotations by other CPUs.

A design appropriate for hundreds of CPUs and hundreds of GB is very different from one for a single CPU and under a GB. It's not clear that any existing SL*B is right for either case, or that any one can be for all. Plenty of room remains for improvement in all regimes, so consolidation seems way premature.

SLQB - and then there were four

tao — Tue, 16 Dec 2008 21:37:15 +0000

SlCube. It's O(n²) in complexity, and was created specifically to make the other MM's look better in comparison :)

SLQB - and then there were four

alonz — Tue, 16 Dec 2008 21:26:09 +0000

I wonder how well this algorithm will do in real life, considering the number of tunable watermarks it needs (size of freelist, size of rlist, size of remote_free, …).

Will it be self-tuning? Even if yes, will whatever heuristics are used for this tuning stand up to dynamic situations?

Pronouncing SLQB

pr1268 — Tue, 16 Dec 2008 20:41:36 +0000

How about "SLICK-bee"? Or "sluh-CUE-bee"?

SLQB - and then there were four

flewellyn — Tue, 16 Dec 2008 20:41:31 +0000

Any suggestions on how to pronounce SLQB? :)

"Guppie."

No real reason, it's just that, since SLQB is unpronounceable, and all the others are variations of "slab", this would serve to differentiate it.

SLQB - and then there were four

BlueLightning — Tue, 16 Dec 2008 20:35:40 +0000

How about "slickwib"?

SLQB - and then there were four

proski — Tue, 16 Dec 2008 20:33:43 +0000

Perhaps we need a "Grumpy Editor's guide to kernel memory allocators" :-)

SLQB - and then there were four

riddochc — Tue, 16 Dec 2008 20:29:33 +0000

Interesting. This calls for an explanation of the difference between the various allocators - it sounds, from this article, that they have a fair amount in common with each other.

Any suggestions on how to pronounce SLQB? :)

SQLB: and then there were four

Los__D — Tue, 16 Dec 2008 20:16:44 +0000

Ah, corrected now.

And damn I need a new keyboard (Ok, new fingers), "at big too"????

SQLB: and then there were four

Los__D — Tue, 16 Dec 2008 19:58:29 +0000

Hehe, it would seem we have an editor at big too fond of SQL ;)