The zsmalloc allocator
The "zsmalloc" allocator, proposed by Seth Jennings, is aimed at a specific use case. The slab allocators work by packing multiple objects into each page of memory; that works well when the objects are small, but can be problematic as they get larger. In the worst case, if a kernel subsystem routinely needs allocations that are just larger than PAGE_SIZE/2, only one object will fit within a page. Slab allocators can attempt to allocate multiple physically-contiguous pages in order to pack those large objects more efficiently, but, on memory-constrained systems, those allocations can become difficult - or impossible. So, on systems that are already tight of memory, large objects will need to be allocated one-per-page, wasting significant amounts of memory through internal fragmentation.
The zsmalloc allocator attempts to address this problem by packing objects into a new type of compound page where the component pages are not physically contiguous. The result can be much more efficient memory usage, but with some conditions:
- Code using this allocator must not require physically-contiguous
memory,
- Objects must be explicitly mapped before use, and
- Objects can only be accessed in atomic context.
Code using zsmalloc must start by creating an allocation pool to work from:
struct zs_pool *zs_create_pool(const char *name, gfp_t flags);
Where name is the name of the pool, and flags will be used to allocate memory for the pool. It is not entirely clear (to your editor, at least) why multiple pools exist; the zs_pool structure is relatively large, and a pool is really only efficient if the number of objects allocated from it is also large. But that's how the API is designed.
A pool can be released with:
void zs_destroy_pool(struct zs_pool *pool);
A warning (or several warnings) will be generated if there are objects allocated from the pool that have not been freed; those objects will become entirely inaccessible after the pool is gone.
Allocating and freeing memory is done with:
void *zs_malloc(struct zs_pool *pool, size_t size); void zs_free(struct zs_pool *pool, void *obj);
The return value from zs_malloc() will be a pointer value, or NULL if the object cannot be allocated. It would be a fatal mistake, though, to treat that pointer as if it were actually a pointer; it is actually a magic cookie that represents the allocated memory indirectly. It might have been better to use a non-pointer type, but, again, that is how the API is designed. Getting a pointer that can actually be used is done with:
void *zs_map_object(struct zs_pool *pool, void *handle); void zs_unmap_object(struct zs_pool *pool, void *handle);
The return value from zs_map_object() will be a kernel virtual address that can be used to access the actual object. The return address is essentially a per-CPU object, so the calling code will be in atomic context until the object is freed with zs_unmap_object(). Note that the handle passed to zs_unmap_object() is the original cookie obtained from zs_malloc(), not the pointer from zs_map_object(). Note also that only one object can be safely mapped at a time on any given CPU.
Internally, zsmalloc divides allocations by object size much like the slab allocators do, but with a much higher granularity - there are 254 possible allocation sizes all less than PAGE_SIZE. For each size, the code calculates an optimum number of pages (up to 16) that will hold an array of objects of that size with minimal loss to fragmentation. When an allocation is made, a "zspage" is created by allocating the calculated number of individual pages and tying them all together. That tying is done by overloading some fields of struct page in a scary way (that is not a criticism of zsmalloc: any additional meanings overlaid onto the already heavily overloaded page structure are scary):
- The first page of a zspage has the PG_private flag set. The
private field points to the second page (if any), while the
lru list structure is used to make a list of zspages of the
same size.
- Subsequent pages are linked to each other with the lru
structure, and are linked back to the first page with the
first_page field (which is another name for private,
if one looks at the structure definition).
- The last page has the PG_private_2 flag set.
Within a zspage, objects are packed from the beginning, and may cross the boundary between pages. The cookie returned from zs_malloc() is a combination of a pointer to the page structure for the first physical page and the offset of the object within the zspage. Making that object accessible to the rest of the kernel at mapping time is a matter of calculating its location, then either (1) mapping it with kmap_atomic() if the object fits entirely within one physical page, or (2) assigning a pair of virtual addresses if the object crosses a physical page boundary.
The primary users of zsmalloc are the zcache and zram mechanisms, both of which are
currently in staging. These subsystems use the transcendent memory abstraction to store
compressed copies of pages in memory. Those compressed pages can still be
a substantial fraction of the (uncompressed) page size, so fragmentation
issue addressed by zsmalloc can be a real problem. Given the specialized
use case and the limitation imposed by zsmalloc, it is not clear that it
will find users elsewhere in the kernel, but one never knows.
Index entries for this article | |
---|---|
Kernel | Memory management/Internal API |
Kernel | Transcendent memory |
Posted Jan 26, 2012 16:25 UTC (Thu)
by sjennings (guest, #74813)
[Link] (1 responses)
Posted Jun 8, 2015 6:48 UTC (Mon)
by vivs0766 (guest, #99130)
[Link]
Posted Jun 8, 2015 6:44 UTC (Mon)
by vivs0766 (guest, #99130)
[Link]
Missing the primary author
Missing the primary author
The zsmalloc allocator