LWN.net Logo

Slab allocator of the week: SLUB+Queuing

By Jonathan Corbet
June 29, 2010
The SLUB allocator first made its appearance in April, 2007. It went into the mainline shortly thereafter. This allocator was intended to provide better performance while being much more memory efficient than the existing slab allocator. One of the key mechanisms for improving memory use was to get rid of the extensive object queues maintained by slab; with enough processors, those queues can grow to the point that they occupy a significant percentage of total memory even when there is nothing in them. SLUB works well in many workloads, but it has been plagued by regressions on certain benchmarks. So SLUB has never achieved its goal of displacing slab altogether, and developers have talked occasionally about getting rid of it.

But SLUB does better than slab on other benchmarks, and its code is widely held to be more readable than slab - though that is widely held to be faint praise. So, over the years, attempts have been made to improve the SLUB allocator's performance. The latest such attempt is SLUB+Queuing which, according to its developer Christoph Lameter, beats slab on the all-important "hackbench" benchmark.

There are a couple of significant changes in the SLUB+Q patch set which are intended to improve the performance of SLUB. At the top of the list is the restoration of queues to the allocator. SLUB+Q does not use the elaborate queues found in slab, though; there is, instead, a single per-CPU queue containing pointers to free objects belonging to the cache. Allocation operations are now simple, at least when the queue is not empty: the last object in the queue is handed out, and the length of the queue is decreased by one. Freeing into a non-empty queue is similar. So the fast path, in both cases, should be fast indeed.

If a given CPU's queue goes empty, the SLUB+Q allocator must fall back to allocating objects out of pages, perhaps allocating more pages in the process. That, of course, is quite a bit slower. In an attempt to minimize the cost of this slow path, SLUB+Q will go ahead and pre-fill the queue, up to the "batch size" (half of the queue's total length, by default) with free objects. So, in a situation where many more objects are being allocated than freed, the fast allocation path will continue to be used most of the time.

If the queue overflows, instead, the allocator must push objects back into the pages they came from. Once again, the behavior chosen is to prune the queue back to a half-full state; the allocator will not push back all objects in the queue unless the kernel has indicated that it is under serious memory pressure. The default size of the queue is dependent on the object size, but it (along with the batch size) can be changed via a sysfs parameter.

[SLUB free list] The other significant change has to do with how free objects are handled when they are not stored in one of the per-CPU queues. In current mainline kernels, SLUB maintains a list of pages which contain some free objects. Note that it does not keep pages which are fully allocated (those can be simply forgotten about until at least one object contained therein is freed); it also does not keep pages which are fully free (those are handed back to the page allocator). The partial pages contain one or more free objects which are organized into a linked list, as is vaguely shown in the diagram to the right. There is a certain aesthetic value to doing things this way; it uses the free memory itself to keep track of free objects, minimizing the amount of overhead needed for object management.

Unfortunately, there is also a cost to storing list pointers in the freed objects. Chances are good that, by the time the kernel gets around to freeing an object, it will not have been used for a bit; it may well be cache-cold on the freeing CPU. Objects which are on the free list are even more likely to be cache-cold. Putting list pointers into that object will bring it into the CPU cache, incurring a cache miss and, possibly, displacing something which was more useful. The result is a measurable performance hit.

Thus, over time, it has become clear that memory management is more efficient if it can avoid touching the objects which are being managed. The SLUB+Q patches achieve this goal by using a bitmap to track which objects in a given page are free. If the number of objects which can fit into a page is small enough, this bitmap can be stored in the page structure in the system memory map; otherwise it is placed at the end of the page itself. Now the management of free objects just requires tweaking bits in this (small) bitmap; the objects themselves are not changed by the allocator.

The hackbench benchmark works by creating groups of processes, then quickly passing messages between them using sockets. SLUB has tended to perform worse on this benchmark than slab, sometimes significantly so. With the new patches, Christoph has posted benchmark results showing performance which is significantly better than what slab achieves. If these results hold, SLUB+Q will have overcome one of the primary problems seen by SLUB.

It should be noted, though, that performance on a single benchmark is not especially indicative of the performance of a memory allocator in general; SLUB already beat slab on a number of other tests. Memory management performance is a subtle and tricky area to work in. So a lot more testing will be required before it will be possible to say that SLUB+Q has truly addressed SLUB's difficulties without introducing regressions of its own. But the initial indications look good.


(Log in to post comments)

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds