SLQB - and then there were four
Given this situation, one would not necessarily think that the kernel needs yet another allocator. But Nick Piggin thinks that, despite the surfeit of low-level memory managers, there is always room for one more. To that end, he has developed the SLQB allocator which he hopes to eventually see merged into the mainline. According to Nick:
Like the other slab-like allocators, SLQB sits on top of the page allocator and provides for allocation of fixed-sized objects. It has been designed with an eye toward scalability on high-end systems; it also makes a real effort to avoid the allocation of compound pages whenever possible. Avoidance of higher-order (compound page) allocations can improve reliability significantly when memory gets tight.
While there is a fair amount of tricky code in SLQB, the core algorithms are not that hard to understand. Like the other slab-like allocators, it implements the abstraction of a "slab cache" - a lookaside cache from which memory objects of a fixed size can be allocated. Slab caches are used directly when memory is allocated with kmem_cache_alloc(), or indirectly through functions like kmalloc(). In SLQB, a slab cache is represented by a data structure which looks very approximately like the following:
(Note that, to simplify the diagram, a number of things have been glossed over).
The main kmem_cache structure contains the expected global parameters - the size of the objects being allocated, the order of page allocations, the name of the cache, etc. But scalability means separating processors from each other, so the bulk of the kmem_cache data structure is stored in per-CPU form. In particular, there is one kmem_cache_cpu structure for each processor on the system.
Within that per-CPU structure one will find a number of lists of objects. One of those (freelist) contains a list of available objects; when a request is made to allocate an object, the free list will be consulted first. When objects are freed, they are returned to this list. Since this list is part of a per-CPU data structure, objects normally remain on the same processor, minimizing cache line bouncing. More importantly, the allocation decisions are all done per-CPU, with no bad cache behavior and no locking required beyond the disabling of interrupts. The free list is managed as a stack, so allocation requests will return the most recently freed objects; again, this approach is taken in an attempt to optimize memory cache behavior.
SLQB gets its memory in the form of full pages from the page allocator. When an allocation request is made and the free list is empty, SLQB will allocate a new page and return an object from that page. The remaining space on the page is organized into a per-page free list (assuming the objects are small enough to pack more than one onto a page, of course), and the page is added to the partial list. The other objects on the page will be handed out in response to allocation requests, but only when the free list is empty. When the final object on a page is allocated, SLQB will forget about the page - temporarily, at least.
Objects are, when freed, added to freelist. It is easy to foresee that this list could grow to be quite large after a burst of system activity. Allowing freelist to grow without bound would risk tying up a lot of system memory doing nothing while it is possibly needed elsewhere. So, once the size of the free list passes a watermark (or when the page allocator starts asking for help freeing memory), objects in the free list will be flushed back to their containing pages. Any partial pages which are completely filled with freed objects will then be returned back to the page allocator for use elsewhere.
There is an interesting situation which arises here, though: remember that SLQB is fundamentally a per-CPU allocator. But there is nothing that requires objects to be freed on the same CPU which allocated them. Indeed, for suitably long-lived objects on a system with many processors, it becomes probable that objects will be freed on a different CPU. That processor does not know anything about the partial pages those objects were allocated from, and, thus, cannot free them. So a different approach has to be taken.
That approach involves the maintenance of two more object lists, called rlist and remote_free. When the allocator tries to flush a "remote" object (one allocated on a different CPU) from its local freelist, it will simply move that object over to rlist. Occasionally, the allocator will reach across CPUs to take the objects from its local rlist and put them on remote_free list of the CPU which initially allocated those objects. That CPU can then choose to reuse the objects or free them back to their containing pages.
The cross-CPU list operation clearly requires locking, so a spinlock protects remote_free. Working with the remote_free lists too often would thus risk cache line bouncing and lock contention, both of which are not helpful when scalability is a goal. That is why processors accumulate a group of objects in their local rlist before adding the entire list, in a single operation, to the appropriate remote_free list. On top of that, the allocator does not often check for objects in its local remote_free list. Instead, objects are allowed to accumulate there until a watermark is exceeded, at which point whichever processor added the final objects will set the remote_free_check flag. The processor owning the remote_free list will only check that list when this flag is set, with the result that the management of the remote_free list can be done with little in the way of lock or cache line contention.
The SLQB code is relatively new, and is likely to need a considerable
amount of work before it may find its way into the mainline. Nick claims
benchmark results which are roughly comparable with those obtained using
the other allocators. But "roughly comparable" will not, by itself, be
enough to motivate the addition of yet another memory allocator. So
pushing SLQB beyond comparable and toward "clearly better" is likely to be
Nick's next task.
Index entries for this article | |
---|---|
Kernel | Memory management/Slab allocators |
Kernel | Slab allocator |
Posted Dec 16, 2008 19:58 UTC (Tue)
by Los__D (guest, #15263)
[Link] (1 responses)
Posted Dec 16, 2008 20:16 UTC (Tue)
by Los__D (guest, #15263)
[Link]
And damn I need a new keyboard (Ok, new fingers), "at big too"????
Posted Dec 16, 2008 20:29 UTC (Tue)
by riddochc (guest, #43)
[Link] (12 responses)
Any suggestions on how to pronounce SLQB? :)
Posted Dec 16, 2008 20:33 UTC (Tue)
by proski (subscriber, #104)
[Link]
Posted Dec 16, 2008 20:35 UTC (Tue)
by BlueLightning (subscriber, #38978)
[Link]
Posted Dec 16, 2008 20:41 UTC (Tue)
by flewellyn (subscriber, #5047)
[Link] (1 responses)
Any suggestions on how to pronounce SLQB? :) "Guppie." No real reason, it's just that, since SLQB is unpronounceable, and all the others are variations of "slab", this would serve to differentiate it.
Posted Dec 16, 2008 21:37 UTC (Tue)
by tao (subscriber, #17563)
[Link]
Posted Dec 16, 2008 20:41 UTC (Tue)
by pr1268 (guest, #24648)
[Link]
How about "SLICK-bee"? Or "sluh-CUE-bee"?
Posted Dec 17, 2008 12:58 UTC (Wed)
by jengelh (guest, #33263)
[Link] (4 responses)
Simple as that - [ɛs][ɛl][kjuː][biː] or, localized for Germany, [ɛs][ɛl][kuː][beː]. BTW, I condone the use of English approximations such as bee for, eh well, English sounds. It does not help non-English speakers at all and only makes for loads of confusion (yes, I am referring to that sibling post above this one). Because that's [beː] for some (most?) people outside the realm of the English language. Please, just ♯ use IPA.
Posted Dec 17, 2008 14:01 UTC (Wed)
by Los__D (guest, #15263)
[Link] (1 responses)
Posted Dec 20, 2008 12:09 UTC (Sat)
by jengelh (guest, #33263)
[Link]
Posted Dec 16, 2008 21:26 UTC (Tue)
by alonz (subscriber, #815)
[Link] (1 responses)
Will it be self-tuning? Even if yes, will whatever heuristics are used for this tuning stand up to dynamic situations?
Posted Dec 17, 2008 9:39 UTC (Wed)
by Nick (guest, #15060)
[Link]
SLQB uses lists rather than arrays, so it is also more flexible to tune at runtime than SLAB. But in practice it is very difficult to adjust the sizing of these kinds of things at runtime. I've taken the dumb easy (SLAB) way out of it and periodically trim down some of the lists. That can cut back memory usage for unused slabs, and they can grow back if needed (and trim is relatively infrequent so it doesn't harm performance).
Thanks for the article, BTW. I feel like I understand the thing better myself now too :)
Posted Dec 16, 2008 21:38 UTC (Tue)
by ncm (guest, #165)
[Link] (5 responses)
I wonder, though, about the choice to hand all objects back to the CPU that originally allocated them. In particular, suppose we have a large collection of objects freed by CPU A long after they have passed out of B's cache -- e.g., allocated on behalf of a process that has migrated to A. It might be objectively better for CPU B to work on their metadata while it remains in B's cache. Maybe it's better to re-map a fully freed page to an address range that B manages; if we know that CPU A won't be touching that page anyway, it may be unmapped from A's page map at leisure.
I wonder, too, about all these lists. Are the list links adjacent to the actual storage of the objects? Often when an object is freed it has long since passed out of the cache, and touching a list link would bring it back in. A way to avoid touching the actual object is for the metadata to be obtainable by address arithmetic, e.g. masking off the low bits to obtain the address of a page header. Some extra structure helps -- a segment of adjacent pages on (say) a megabyte or gigabyte boundary can share a common header and an array of page headers, so that the containing pages need not be touched, either, and the metadata concentrated in a few busy cache lines. Segment headers pingponging between CPUs would be a bad thing, but each segment might reserve a few cache lines for annotations by other CPUs.
A design appropriate for hundreds of CPUs and hundreds of GB is very different from one for a single CPU and under a GB. It's not clear that any existing SL*B is right for either case, or that any one can be for all. Plenty of room remains for improvement in all regimes, so consolidation seems way premature.
Posted Dec 17, 2008 9:49 UTC (Wed)
by Nick (guest, #15060)
[Link] (3 responses)
Actually, within the same node, CPUs can allocate objects they've freed which have come from another CPU (on that node). When freeing an object to our freelist, the test that is performed is whether the object's memory is on the same node as the node this CPU is on. This is similar to what SLAB does, and indeed is good for cache (and object management overhead).
Except in very special cases of slabs (RCU-ish ones), the linked list is part of the object itself (when the object is freed, nobody else by definition is using the memory). _Often_ in the kernel I'd say that when an object is freed it is probably recently been touched by the freeing CPU. It would be an unusual situation eg. to have a refcount to the object residing somewhere other than the object itself.
And yes, we have the struct page structure for every page in the system, which can be found with calculations on the object address. And yes, all the slab allocators use this struct page to manage their slabs of objects :)
I agree with your last paragraph too...
Posted Dec 18, 2008 1:00 UTC (Thu)
by ncm (guest, #165)
[Link] (2 responses)
Integrating refcounting would change the SLAB interface, but generic refcounting services could be added to all the allocators, providing another way for them to differentiate themselves.
Posted Dec 18, 2008 8:41 UTC (Thu)
by Nick (guest, #15060)
[Link] (1 responses)
The exception are caches, that are reclaimed when memory gets low or a watermark is reached (eg. inode cache, dentry cache, etc). However, with these things, they still need to be found in order to be reclaimed, usually via an LRU list, so the object gets hot when it's found and taken off the list.
OK, you could move the refcount and the lru list into another area... but the other problem with that is that in cache objects you expect to have multiple lookups over the lifetime of the object. And if your lookups have to increment a disjoint refcount object, then you increase your cacheline footprint per lookup by effectively an entire line per object. So you trade slower lookups for faster reclaim, which could easily be a bad tradeoff if the cache is effective (which the dcache is, 99% of the time).
Did you have any particular situations in mind?
Posted Dec 18, 2008 21:37 UTC (Thu)
by ncm (guest, #165)
[Link]
An optimized refcount scheme may take only a few bits per object, for most objects, so I was thinking one cache line might hold refcounts for a hundred objects. Also, a dirty cache line is way more expensive than a clean one (because it must be written back, and isn't a candidate for replacement until that's done) so I meant to concentrate the refcount churn, and segregate it from the (commonly) mostly-unchanging objects. This helps most where you have some semblance of locality. Unfortunately things like inode caches don't.
As always, there's no substitute for measuring, but you have to build it first.
Posted Dec 22, 2008 20:23 UTC (Mon)
by bgamsa (guest, #33057)
[Link]
Posted Dec 17, 2008 9:09 UTC (Wed)
by iq-0 (subscriber, #36655)
[Link] (3 responses)
The big question is:
Pherhaps it would even be better to put items on 'rlist' to also be put on the 'freelist', so we simply allocate the least recently used item first (probably cache-hot).
And without having looked at the code I'd assume that only one CPU at a time is cleaning up it's 'rlist' and that usage of the 'remote_free' lock is only a "best effort" locking scheme (try_lock()-ish) because maintenance should really be a background task with minimal overhead. Though this might have problems I overlook (or is already done that way).
Just some thoughts I had reading this text, nothing really thoroughly thought through though.
Posted Dec 17, 2008 10:11 UTC (Wed)
by Nick (guest, #15060)
[Link]
The issue of cleaning up rlist is interesting. There are so many ways this can be done and it is about the most difficult part of a slab allocator... No, any CPU can be cleaning its rlist at any time, and yes they might all point to a single remote CPU. That's quite unlikely and the critical section is very short, so hopefully it won't be a problem. But I don't claim to know what the best way to do it is.
Very large number of CPUs I am definitely interested in... so I'm hoping to be as good or better than the other allocators here, but we'll see.
Posted Dec 17, 2008 10:59 UTC (Wed)
by sdalley (subscriber, #18550)
[Link] (1 responses)
.. and it's even harder to plough through those details with a tough cough and hiccough ...
Posted Dec 17, 2008 13:05 UTC (Wed)
by jengelh (guest, #33263)
[Link]
That reminds me of the Being Bilingual sketch. And as I see it, sdalley cheated on all four
Posted Jan 25, 2009 12:25 UTC (Sun)
by tej.parkash (guest, #51084)
[Link]
SQLB: and then there were four
SQLB: and then there were four
SLQB - and then there were four
Perhaps we need a "Grumpy Editor's guide to kernel memory allocators" :-)
SLQB - and then there were four
SLQB - and then there were four
SLQB - and then there were four
SLQB - and then there were four
Pronouncing SLQB
SLQB - and then there were four
SLQB - and then there were four
SLQB - and then there were four
I wonder how well this algorithm will do in real life, considering the number of tunable watermarks it needs (size of freelist, size of rlist, size of remote_free, …).SLQB - and then there were four
SLQB - and then there were four
Who merges?
Who merges?
Who merges?
Who merges?
Who merges?
Who merges?
SLQB - and then there were four
Why not allocate from 'rlist' when you're done with your own freelist? We don't have to update the metadata so it's a very cheap operation.
Sure the remote item might be bounced back to the other CPU, but clearly the code using it doesn't seem to mind which CPU last used it and with the LRU logic it's just as likely still in our CPU cache. And if it did bounce back in te meanwhile it means we are probably dealing with a slab cache that's not that hard used (or it's usage should be optimized and this would be true for all slab allocaters).
SLQB - and then there were four
SLQB - and then there were four
SLQB - and then there were four
/^th/
words. :-)
SLQB - and then there were four