Who merges?
Who merges?
Posted Dec 16, 2008 21:38 UTC (Tue) by ncm (guest, #165)Parent article: SLQB - and then there were four
I wonder, though, about the choice to hand all objects back to the CPU that originally allocated them. In particular, suppose we have a large collection of objects freed by CPU A long after they have passed out of B's cache -- e.g., allocated on behalf of a process that has migrated to A. It might be objectively better for CPU B to work on their metadata while it remains in B's cache. Maybe it's better to re-map a fully freed page to an address range that B manages; if we know that CPU A won't be touching that page anyway, it may be unmapped from A's page map at leisure.
I wonder, too, about all these lists. Are the list links adjacent to the actual storage of the objects? Often when an object is freed it has long since passed out of the cache, and touching a list link would bring it back in. A way to avoid touching the actual object is for the metadata to be obtainable by address arithmetic, e.g. masking off the low bits to obtain the address of a page header. Some extra structure helps -- a segment of adjacent pages on (say) a megabyte or gigabyte boundary can share a common header and an array of page headers, so that the containing pages need not be touched, either, and the metadata concentrated in a few busy cache lines. Segment headers pingponging between CPUs would be a bad thing, but each segment might reserve a few cache lines for annotations by other CPUs.
A design appropriate for hundreds of CPUs and hundreds of GB is very different from one for a single CPU and under a GB. It's not clear that any existing SL*B is right for either case, or that any one can be for all. Plenty of room remains for improvement in all regimes, so consolidation seems way premature.
Posted Dec 17, 2008 9:49 UTC (Wed)
by Nick (guest, #15060)
[Link] (3 responses)
Actually, within the same node, CPUs can allocate objects they've freed which have come from another CPU (on that node). When freeing an object to our freelist, the test that is performed is whether the object's memory is on the same node as the node this CPU is on. This is similar to what SLAB does, and indeed is good for cache (and object management overhead).
Except in very special cases of slabs (RCU-ish ones), the linked list is part of the object itself (when the object is freed, nobody else by definition is using the memory). _Often_ in the kernel I'd say that when an object is freed it is probably recently been touched by the freeing CPU. It would be an unusual situation eg. to have a refcount to the object residing somewhere other than the object itself.
And yes, we have the struct page structure for every page in the system, which can be found with calculations on the object address. And yes, all the slab allocators use this struct page to manage their slabs of objects :)
I agree with your last paragraph too...
Posted Dec 18, 2008 1:00 UTC (Thu)
by ncm (guest, #165)
[Link] (2 responses)
Integrating refcounting would change the SLAB interface, but generic refcounting services could be added to all the allocators, providing another way for them to differentiate themselves.
Posted Dec 18, 2008 8:41 UTC (Thu)
by Nick (guest, #15060)
[Link] (1 responses)
The exception are caches, that are reclaimed when memory gets low or a watermark is reached (eg. inode cache, dentry cache, etc). However, with these things, they still need to be found in order to be reclaimed, usually via an LRU list, so the object gets hot when it's found and taken off the list.
OK, you could move the refcount and the lru list into another area... but the other problem with that is that in cache objects you expect to have multiple lookups over the lifetime of the object. And if your lookups have to increment a disjoint refcount object, then you increase your cacheline footprint per lookup by effectively an entire line per object. So you trade slower lookups for faster reclaim, which could easily be a bad tradeoff if the cache is effective (which the dcache is, 99% of the time).
Did you have any particular situations in mind?
Posted Dec 18, 2008 21:37 UTC (Thu)
by ncm (guest, #165)
[Link]
An optimized refcount scheme may take only a few bits per object, for most objects, so I was thinking one cache line might hold refcounts for a hundred objects. Also, a dirty cache line is way more expensive than a clean one (because it must be written back, and isn't a candidate for replacement until that's done) so I meant to concentrate the refcount churn, and segregate it from the (commonly) mostly-unchanging objects. This helps most where you have some semblance of locality. Unfortunately things like inode caches don't.
As always, there's no substitute for measuring, but you have to build it first.
Posted Dec 22, 2008 20:23 UTC (Mon)
by bgamsa (guest, #33057)
[Link]
Who merges?
Who merges?
Who merges?
Who merges?
Who merges?