The zsmalloc allocator

By Jonathan Corbet
January 25, 2012

The kernel cannot be said to lack for memory allocation mechanisms. At the lowest level, "memblock" handles chunks of memory for the rest of the system. The page allocator provides memory to the rest of the kernel in units of whole pages. Much of the kernel uses one of the three slab allocators to get memory blocks in arbitrary sizes, but there is also vmalloc() for situations where large, virtually-contiguous regions are needed. Add in various other specialized allocation functions and other allocators (like CMA) and it starts to seem like a true embarrassment of choices. So what's to be done in this situation? Add another one, of course.

The "zsmalloc" allocator, proposed by Seth Jennings, is aimed at a specific use case. The slab allocators work by packing multiple objects into each page of memory; that works well when the objects are small, but can be problematic as they get larger. In the worst case, if a kernel subsystem routinely needs allocations that are just larger than PAGE_SIZE/2, only one object will fit within a page. Slab allocators can attempt to allocate multiple physically-contiguous pages in order to pack those large objects more efficiently, but, on memory-constrained systems, those allocations can become difficult - or impossible. So, on systems that are already tight of memory, large objects will need to be allocated one-per-page, wasting significant amounts of memory through internal fragmentation.

The zsmalloc allocator attempts to address this problem by packing objects into a new type of compound page where the component pages are not physically contiguous. The result can be much more efficient memory usage, but with some conditions:

Code using this allocator must not require physically-contiguous memory,
Objects must be explicitly mapped before use, and
Objects can only be accessed in atomic context.

Code using zsmalloc must start by creating an allocation pool to work from:

    struct zs_pool *zs_create_pool(const char *name, gfp_t flags);

Where name is the name of the pool, and flags will be used to allocate memory for the pool. It is not entirely clear (to your editor, at least) why multiple pools exist; the zs_pool structure is relatively large, and a pool is really only efficient if the number of objects allocated from it is also large. But that's how the API is designed.

A pool can be released with:

    void zs_destroy_pool(struct zs_pool *pool);

A warning (or several warnings) will be generated if there are objects allocated from the pool that have not been freed; those objects will become entirely inaccessible after the pool is gone.

Allocating and freeing memory is done with:

    void *zs_malloc(struct zs_pool *pool, size_t size);
    void zs_free(struct zs_pool *pool, void *obj);

The return value from zs_malloc() will be a pointer value, or NULL if the object cannot be allocated. It would be a fatal mistake, though, to treat that pointer as if it were actually a pointer; it is actually a magic cookie that represents the allocated memory indirectly. It might have been better to use a non-pointer type, but, again, that is how the API is designed. Getting a pointer that can actually be used is done with:

    void *zs_map_object(struct zs_pool *pool, void *handle);
    void zs_unmap_object(struct zs_pool *pool, void *handle);

The return value from zs_map_object() will be a kernel virtual address that can be used to access the actual object. The return address is essentially a per-CPU object, so the calling code will be in atomic context until the object is freed with zs_unmap_object(). Note that the handle passed to zs_unmap_object() is the original cookie obtained from zs_malloc(), not the pointer from zs_map_object(). Note also that only one object can be safely mapped at a time on any given CPU.

Internally, zsmalloc divides allocations by object size much like the slab allocators do, but with a much higher granularity - there are 254 possible allocation sizes all less than PAGE_SIZE. For each size, the code calculates an optimum number of pages (up to 16) that will hold an array of objects of that size with minimal loss to fragmentation. When an allocation is made, a "zspage" is created by allocating the calculated number of individual pages and tying them all together. That tying is done by overloading some fields of struct page in a scary way (that is not a criticism of zsmalloc: any additional meanings overlaid onto the already heavily overloaded page structure are scary):

The first page of a zspage has the PG_private flag set. The private field points to the second page (if any), while the lru list structure is used to make a list of zspages of the same size.
Subsequent pages are linked to each other with the lru structure, and are linked back to the first page with the first_page field (which is another name for private, if one looks at the structure definition).
The last page has the PG_private_2 flag set.

Within a zspage, objects are packed from the beginning, and may cross the boundary between pages. The cookie returned from zs_malloc() is a combination of a pointer to the page structure for the first physical page and the offset of the object within the zspage. Making that object accessible to the rest of the kernel at mapping time is a matter of calculating its location, then either (1) mapping it with kmap_atomic() if the object fits entirely within one physical page, or (2) assigning a pair of virtual addresses if the object crosses a physical page boundary.

The primary users of zsmalloc are the zcache and zram mechanisms, both of which are currently in staging. These subsystems use the transcendent memory abstraction to store compressed copies of pages in memory. Those compressed pages can still be a substantial fraction of the (uncompressed) page size, so fragmentation issue addressed by zsmalloc can be a real problem. Given the specialized use case and the limitation imposed by zsmalloc, it is not clear that it will find users elsewhere in the kernel, but one never knows.

Index entries for this article
Kernel	Memory management/Internal API
Kernel	Transcendent memory

Missing the primary author

Posted Jan 26, 2012 16:25 UTC (Thu) by sjennings (guest, #74813) [Link] (1 responses)

Just wanted to add that the primary designer and author of the zsmalloc allocator is Nitin Gupta. I (Seth) helped with design/bug fixes and submitted the patches since his changes and mine had dependencies.

Missing the primary author

Posted Jun 8, 2015 6:48 UTC (Mon) by vivs0766 (guest, #99130) [Link]

Why can't we maintain a single free list for all free objects in pool of zspage's instead of maintaining a separate freelist for each zspages. I mean to say that if we maintain a single free list then it would be easy for us to allocate objects from freelist instead of looKing for freelist among zspages.

The zsmalloc allocator

Posted Jun 8, 2015 6:44 UTC (Mon) by vivs0766 (guest, #99130) [Link]