The guaranteed contiguous memory allocator

By Jonathan Corbet
March 21, 2025

As a system runs and its memory becomes fragmented, allocating large, physically contiguous regions of memory becomes increasingly difficult. Much effort over the years has gone into avoiding the need to make such allocations whenever possible, but there are times when they simply cannot be avoided. The kernel's contiguous memory allocator (CMA) subsystem attempts to make such allocations possible, but it has never been a perfect solution. Suren Baghdasaryan is is trying to improve that situation with the guaranteed contiguous memory allocator patch set, which includes work from Minchan Kim as well.

In the distant past, Dan Magenheimer introduced the concept of transcendent memory — memory that is not directly addressable, but which can be used opportunistically by the kernel for caching or other purposes. Most of the transcendent-memory work has since gone unused and been removed from the kernel, but the idea persists, and this patch series makes use of it to provide guaranteed CMA.

Specifically, the patch set includes a subsystem called "cleancache", which is a concept that was proposed by Magenheimer in 2012. If the kernel has to dump a page of data, but would like to keep that data around if possible, it can put it into the cleancache, which will stash it aside somewhere. Should the need for that data arise, the kernel can copy it back out of the cleancache — if it is still there. Meanwhile, the page that initially contained that data can be reclaimed for other uses.

Guaranteed CMA then builds on cleancache by allocating a region of physically contiguous memory at boot, when such allocations are relatively easy. That memory is then turned into a cleancache and made available to the kernel. Whenever the memory-management system reclaims pages of file-backed memory, it can choose to place the data from those pages into the cleancache. Should that data be needed, an attempt will be made to retrieve it from the cleancache before rereading it from disk. The memory reserved for CMA is thus available to the kernel when not allocated to a CMA user, but in a restricted way.

At some point, some kernel subsystem will need a large, physically contiguous buffer. Requesting that buffer from the guaranteed CMA subsystem will result in an allocation from the reserved memory, after dropping any cached data that happens to be in the allocated region. This allocation can happen quickly, since that data has been cached with the explicit stipulation that it can be dropped at any time. This approach was proposed by Seongjae Park and Kim in 2014.

This new subsystem is integrated with the existing CMA API, so CMA users need not change to make use of it. The reserved region is set up by way of a devicetree property explicitly requesting the "guaranteed" behavior.

The end result is a version of CMA that is guaranteed to succeed as long as the total allocations do not exceed the size of the reserved area; existing CMA has a higher likelihood of failure. Since CMA usage is often restricted to a problematic device or two with known needs, sizing the reserved area for a specific system should be straightforward.

The other advantage of guaranteed CMA is latency; if the memory is available, it can be allocated quickly. CMA in current kernels may have to migrate data out of the allocated region first, which takes time. The downside is that the memory reserved for guaranteed CMA can only be used for data that can be dropped at will; that will increase the pressure on the rest of the memory in the system.

This patch series was posted just ahead of the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit, where it is currently scheduled for a discussion in the memory-management track. There will probably not be a lot of comments on it ahead of that discussion. The patches are relatively small, though, and do not intrude into the memory-management subsystem on systems where CMA is not in use, so we might just see a transcendent-memory application actually go forward, some 15 years after the idea was first proposed.

Index entries for this article
Kernel	Contiguous memory allocator
Kernel	Memory management/Large allocations

Very interesting solution

Posted Mar 21, 2025 19:38 UTC (Fri) by aviallon (subscriber, #157205) [Link]

I really like the idea of this patch. Being able to loan your reserved memory allocation makes it much less expensive. Much like how loaning your property when you don't need it reduces its cost.

Please explain this a bit more...

Posted Mar 22, 2025 6:39 UTC (Sat) by gwolf (subscriber, #14632) [Link] (5 responses)

I should be understanding the reason for this allocator (I _am_ a teacher of the Operating Systems subject at BSc level!), but am not getting there. Besides having larger contiguous areas that could be used i.e. for DMA transfers or some IPC... Or besides declaring virtual machines, where the hosts can more efficiently “reason” about its memory space if it's contiguous, in an age where all userspace memory is paginated and goes through the virtual memory subsystem... Or maybe thinking about large NUMA-diverse systems where non-local memory is clearly more expensive... Why should the kernel or the userspace care whether its allocation is contiguous or not?

Please explain this a bit more...

Posted Mar 22, 2025 9:34 UTC (Sat) by Wol (subscriber, #4433) [Link]

> Why should the kernel or the userspace care whether its allocation is contiguous or not?

In one word - performance.

Also

> Besides having larger contiguous areas that could be used i.e. for DMA transfers or some IPC..

I believe with certain bits of hardware you need to substitute the word "could" with "must".

At the end of the day, spending ten otherwise idle cycles to gain 100 cycles when under pressure is money well spent, and I get the impression the gains are much bigger than that.

Quite possibly not B.Sc. level though.

Cheers,
Wol

Please explain this a bit more...

Posted Mar 22, 2025 11:32 UTC (Sat) by farnz (subscriber, #17727) [Link]

I see two use cases for this, one embedded, one laptop/desktop:

Embedded hardware often has devices that need a large contiguous memory area to function; things like scanout framebuffers and image sensor processors are often designed to access an area of memory that's been dedicated to them. In the pre-CMA era, you'd have to allocate all of these areas of memory at boot time, and could not reuse the memory for any other purpose (so if you'd allocated space for a 3840x2160 framebuffer, you'd have wasted memory if the user only ever connected to a 1920x1080 screen).
Desktops and laptops with iGPUs can have "stolen memory" that belongs to the iGPU as primary device, not the CPU, and has different access rules as a result (generally resulting in higher latency, but better throughput for GPU access patterns to DRAM). Right now, this memory can only be allocated at boot time, from a small number of choices (e.g. my laptop allows me to choose 512 MiB or 4 GiB for my iGPU's dedicated area).

Both of these are cases where it'd be nice to be able to allocate the large contiguous memory needed at runtime; for the iGPU, you could then reallocate (resetting the iGPU in the process) from whatever the BIOS set to something larger (if you're doing things that benefit from the GPU having more VRAM, rather than accessing "system memory"), or smaller (if your manufacturer sets it high for gaming, LLMs etc, but you're just writing code).

Please explain this a bit more...

Posted Mar 22, 2025 17:36 UTC (Sat) by excors (subscriber, #95769) [Link]

One practical example is the Raspberry Pi. RAM is shared between the CPU and GPU, but in older models the GPU has no IOMMU (or at least not one with 4KB granularity). It's simply incapable of doing virtual memory. Some hardware components (like DMA?) might have explicit scatter/gather support, so the kernel can pass them a list of physical page addresses. But most components don't; buffers are identified by a single physical address and a size, so they must be stored contiguously.

Apparently the Pi4 added an IOMMU for its 3D block, so you can now use regular virtual memory for that, but other parts of the GPU (camera, video, VPU, etc) still need contiguous physical memory.

One extra complication on RPi is that parts of the GPU can only access the first 1GB of physical RAM. (It was designed with 32-bit addressing, but the top 2 bits were used to control cache behaviour). So buffers must be contiguous _and_ allocated from a specific region of RAM. And there are hardware bugs that make some allocations even more constrained.

(I'm not sure if there are very compelling reasons to design hardware without an IOMMU. I'd speculate it's largely because the hardware people see that IOMMUs come with non-zero performance cost, area cost, IP licensing cost, complexity, etc, and they don't really care if they're making life harder for the software people, so their default is to not put one in. And once the software people have got a hack like CMA that sort of works, there's little incentive to fix it in the next generation of the hardware.)

Please explain this a bit more...

Posted Mar 22, 2025 18:46 UTC (Sat) by iabervon (subscriber, #722) [Link] (1 responses)

It's not relevant to this particular feature, but there's also the fact that MMU hardware now supports PTEs that map different numbers of bits of the address, which means that you can save memory on the PTEs and access more memory through the available TLB slots if you use more offset bits per page, but this requires finding physical address ranges that don't have any smaller pages allocated in them.

This doesn't require being able to get contiguous memory, since the kernel can just not use the larger pages, but it is another reason that the kernel would care and feeds into other memory management features the kernel has.

Please explain this a bit more...

Posted Mar 22, 2025 23:45 UTC (Sat) by willy (subscriber, #9762) [Link]

This is being addressed in other ways; see the recent article about Johannes Weiner's patches.