The guaranteed contiguous memory allocator
In the distant past, Dan Magenheimer introduced the concept of transcendent memory — memory that is not directly addressable, but which can be used opportunistically by the kernel for caching or other purposes. Most of the transcendent-memory work has since gone unused and been removed from the kernel, but the idea persists, and this patch series makes use of it to provide guaranteed CMA.
Specifically, the patch set includes a subsystem called "cleancache", which is a concept that was proposed by Magenheimer in 2012. If the kernel has to dump a page of data, but would like to keep that data around if possible, it can put it into the cleancache, which will stash it aside somewhere. Should the need for that data arise, the kernel can copy it back out of the cleancache — if it is still there. Meanwhile, the page that initially contained that data can be reclaimed for other uses.
Guaranteed CMA then builds on cleancache by allocating a region of physically contiguous memory at boot, when such allocations are relatively easy. That memory is then turned into a cleancache and made available to the kernel. Whenever the memory-management system reclaims pages of file-backed memory, it can choose to place the data from those pages into the cleancache. Should that data be needed, an attempt will be made to retrieve it from the cleancache before rereading it from disk. The memory reserved for CMA is thus available to the kernel when not allocated to a CMA user, but in a restricted way.
At some point, some kernel subsystem will need a large, physically contiguous buffer. Requesting that buffer from the guaranteed CMA subsystem will result in an allocation from the reserved memory, after dropping any cached data that happens to be in the allocated region. This allocation can happen quickly, since that data has been cached with the explicit stipulation that it can be dropped at any time. This approach was proposed by Seongjae Park and Kim in 2014.
This new subsystem is integrated with the existing CMA API, so CMA users need not change to make use of it. The reserved region is set up by way of a devicetree property explicitly requesting the "guaranteed" behavior.
The end result is a version of CMA that is guaranteed to succeed as long as the total allocations do not exceed the size of the reserved area; existing CMA has a higher likelihood of failure. Since CMA usage is often restricted to a problematic device or two with known needs, sizing the reserved area for a specific system should be straightforward.
The other advantage of guaranteed CMA is latency; if the memory is available, it can be allocated quickly. CMA in current kernels may have to migrate data out of the allocated region first, which takes time. The downside is that the memory reserved for guaranteed CMA can only be used for data that can be dropped at will; that will increase the pressure on the rest of the memory in the system.
This patch series was posted just ahead of the 2025 Linux Storage,
Filesystem, Memory-Management, and BPF Summit, where it is currently
scheduled for a discussion in the memory-management track. There will
probably not be a lot of comments on it ahead of that discussion. The
patches are relatively small, though, and do not intrude into the
memory-management subsystem on systems where CMA is not in use, so we might
just see a transcendent-memory application actually go forward, some
15 years after the idea was first proposed.
Index entries for this article | |
---|---|
Kernel | Contiguous memory allocator |
Kernel | Memory management/Large allocations |
Posted Mar 21, 2025 19:38 UTC (Fri)
by aviallon (subscriber, #157205)
[Link]
Posted Mar 22, 2025 6:39 UTC (Sat)
by gwolf (subscriber, #14632)
[Link] (5 responses)
Posted Mar 22, 2025 9:34 UTC (Sat)
by Wol (subscriber, #4433)
[Link]
In one word - performance.
Also
> Besides having larger contiguous areas that could be used i.e. for DMA transfers or some IPC..
I believe with certain bits of hardware you need to substitute the word "could" with "must".
At the end of the day, spending ten otherwise idle cycles to gain 100 cycles when under pressure is money well spent, and I get the impression the gains are much bigger than that.
Quite possibly not B.Sc. level though.
Cheers,
Posted Mar 22, 2025 11:32 UTC (Sat)
by farnz (subscriber, #17727)
[Link]
Both of these are cases where it'd be nice to be able to allocate the large contiguous memory needed at runtime; for the iGPU, you could then reallocate (resetting the iGPU in the process) from whatever the BIOS set to something larger (if you're doing things that benefit from the GPU having more VRAM, rather than accessing "system memory"), or smaller (if your manufacturer sets it high for gaming, LLMs etc, but you're just writing code).
Posted Mar 22, 2025 17:36 UTC (Sat)
by excors (subscriber, #95769)
[Link]
Apparently the Pi4 added an IOMMU for its 3D block, so you can now use regular virtual memory for that, but other parts of the GPU (camera, video, VPU, etc) still need contiguous physical memory.
One extra complication on RPi is that parts of the GPU can only access the first 1GB of physical RAM. (It was designed with 32-bit addressing, but the top 2 bits were used to control cache behaviour). So buffers must be contiguous _and_ allocated from a specific region of RAM. And there are hardware bugs that make some allocations even more constrained.
(I'm not sure if there are very compelling reasons to design hardware without an IOMMU. I'd speculate it's largely because the hardware people see that IOMMUs come with non-zero performance cost, area cost, IP licensing cost, complexity, etc, and they don't really care if they're making life harder for the software people, so their default is to not put one in. And once the software people have got a hack like CMA that sort of works, there's little incentive to fix it in the next generation of the hardware.)
Posted Mar 22, 2025 18:46 UTC (Sat)
by iabervon (subscriber, #722)
[Link] (1 responses)
This doesn't require being able to get contiguous memory, since the kernel can just not use the larger pages, but it is another reason that the kernel would care and feeds into other memory management features the kernel has.
Posted Mar 22, 2025 23:45 UTC (Sat)
by willy (subscriber, #9762)
[Link]
Very interesting solution
Please explain this a bit more...
Please explain this a bit more...
Wol
I see two use cases for this, one embedded, one laptop/desktop:
Please explain this a bit more...
Please explain this a bit more...
Please explain this a bit more...
Please explain this a bit more...