The problem of allocating large chunks of physically-contiguous memory has
often been discussed in these pages. Virtual memory, by its nature, tends
to scatter pages throughout the system; the kernel does not have to be
running for long before free pages which happen to be next to each other
become scarce. For many years, the way kernel developers have dealt with
this problem has been to avoid dependencies on large contiguous allocations
whenever possible. Kernel code which tries to allocate more than two
physically-contiguous pages is rare.
Recently the need for large contiguous allocations has been growing. One
source of demand is huge pages, and the transparent huge pages feature in particular.
Another is an old story with a new twist: hardware which is unable to
perform scatter/gather DMA. Any device which can only do DMA to a
physically contiguous area requires (in the absence of an I/O memory
management unit) a physically contiguous buffer to work with. This
requirement is often a sign of relatively low-end (stupid) hardware; one
could hope that such hardware would become scarce over time. What we are
seeing, though, are devices which have managed to gain capabilities while
retaining the contiguous DMA requirement. For example, there are video
capture engines which can grab full high-definition data, perform a number
of transformations on it, but still need a contiguous buffer for the
result. The advent of high definition video has aggravated the problem -
those physically-contiguous buffers are now quite a bit bigger and harder
to allocate than they were before.
Almost one year ago, LWN looked at the
contiguous memory allocator (CMA) patches which were meant to be an
answer to this problem. This patch set followed the venerable tradition of
reserving a chunk of memory at boot time for the sole purpose of satisfying
large allocation requests. Over the years, this technique has been used by
the "bigphysarea" patch, or simply by booting the kernel with a
mem= parameter that left a range of physical memory unused. The
pmem driver also allocates memory chunks from a reserved range. This
approach certainly works; nearly 20 years of experience verifies that. The
down side is that the reserved memory is not available for any other use;
if the device is not using the memory, it simply sits idle. That kind of
waste tends to be unpopular with kernel developers - and users.
For that reason and more, the CMA patches were never merged. The problem
has not gone away, though, and neither have the developers who are working
on it. The latest version of the CMA patch
set looks quite a bit different; while some issues still need to be
resolved, this patch set looks like it may have a much better chance of
getting into the mainline.
The CMA allocator can still work with a reserved region of memory, but that
is clearly not the intended mode of operation. Instead, the new CMA tries
to maintain regions of memory where contiguous chunks can be created when
the need arises.
To that end, CMA relies on the "migration type" mechanism built deeply into
the memory management code.
Within each zone, blocks of pages are marked
as being for use by pages which are (or are not) movable or reclaimable.
Movable pages are, primarily, page cache or anonymous memory pages; they
are accessed via page tables and the page cache radix tree. The contents
of such pages can be moved somewhere else as long as the tables and tree
are updated accordingly. Reclaimable pages, instead, might possibly be
given back to the kernel on demand; they hold data structures like the
inode cache. Unmovable pages are usually those for which the kernel has
direct pointers; memory obtained from kmalloc() cannot normally be
moved without breaking things, for example.
The memory management subsystem tries to keep movable pages together. If
the goal is to free a larger chunk by moving pages out of the way, it only
takes a single nailed-down page to ruin the whole effort. By grouping
movable pages, the kernel hopes to be able to free larger ranges on demand
without running into such snags. The memory
compaction code relies on these ranges of movable pages to be able to
do its job.
CMA extends this mechanism by adding a new "CMA" migration type; it works
much like the "movable" type, but with a couple of differences. The "CMA"
type is sticky; pages which are marked as being for CMA should never have
their migration type changed by the kernel. The memory allocator will
never allocate unmovable pages from a CMA area, and, for any other use, it
only allocates CMA pages when alternatives are not available. So, with
luck, the areas of memory which are marked for use by CMA should contain
only movable pages, and it should have a relatively high number of free
In other words, memory which is marked for use by CMA remains available to
the rest of the system with the one restriction that it can only contain
movable pages. When a driver comes along with a need for a contiguous
range of memory, the CMA allocator can go to one of its special ranges and
try to shove enough pages out of the way to create a contiguous buffer of
the appropriate size. If the pages contained in that area are truly
movable (the real world can get in the way sometimes), it should be
possible to give that driver the buffer it needs. When that buffer is not
needed, though, the memory can be used for other purposes.
One might wonder why this mechanism is needed, given that memory compaction
can already create large physically-contiguous chunks for transparent
hugepages. That code works: your editor's system, as of this writing, has
about 25% of its memory allocated as huge pages. The answer is that DMA
buffers present some different requirements than huge pages. They may be
larger, for example; transparent huge pages are 2MB on most architectures,
while DMA buffers can be 10MB or more. There may be a need to place DMA buffers
in specific ranges of memory if the underlying hardware is weird enough -
and CMA developer Marek Szyprowski seems to have some weird hardware
indeed. Finally, a 2MB huge page must also have 2MB alignment, while the
alignment requirements for DMA buffers are normally much more relaxed. The
CMA allocator can grab just the required amount of memory (without rounding
the size up to the next power of two as is done in the buddy allocator)
without worrying about overly stringent alignment demands.
The CMA patch set provides a set of functions for setting up regions of
memory and creating "contexts" for specific ranges. Then there are simple
cm_alloc() and cm_free() functions for obtaining and
releasing buffers. It is expected, though, that device drivers will never
invoke CMA directly; instead, awareness of CMA will be built into the DMA
support functions. When a driver calls a function like
dma_alloc_coherent(), CMA will be used automatically to satisfy
the request. In most situations, it should "just work."
One of the remaining concerns about CMA has to do with how the special
regions are set up in the first place. The current scheme expects that
some special calls will be made in the system's board file; it is a very
ARM-like approach. The intent is to get rid of board files, so something
else will have to be found. Moving this information to the device tree is
not really an option either, as Arnd Bergmann pointed out; it is really a policy decision.
Arnd is pushing for some sort of reasonable default setup that works on
most systems; quirks for systems with special challenges can be added
The end result is that there's likely to be at least one more iteration of
this patch set before it gets into the mainline. But CMA addresses a real
need which has been met, thus far, with out-of-tree hacks of varying
kludginess. This code has the potential to make physically-contiguous
allocations far more reliable while minimizing the impact on the rest of
the system. It seems worth having.
to post comments)