LWN.net Logo

CMA and ARM

By Jonathan Corbet
July 5, 2011
LWN recently looked (again) at the contiguous memory allocator (CMA) patch set; CMA is intended to provide large, contiguous DMA buffers to drivers without requiring that memory be set aside for that exclusive purpose. CMA was recently reposted with the idea that it is nearly ready for merging. There is a clear desire to see this code get at least into the -mm tree, even if it is not yet quite ready for the mainline. Most reviewers are pleased with CMA; it would seem that there are very few roadblocks remaining. Except that, as it turns out, one big obstacle remains.

Over the years, LWN has also looked at ARM's special memory management challenges. Recent ARM CPUs are, like those implementing other architectures, becoming more complex in order to improve performance. So ARM processors can now do speculative prefetching of memory contents in surprising ways. This prefetching works well on cached memory, but should not be used on memory that has been marked as uncached. An additional complication comes from the fact that virtual memory systems can have more than one mapping for a given range of memory, and caching is a feature of the mapping, not the memory itself. So one might well wonder what happens if different mappings have different caching attributes. On recent ARM processor designs, what happens is officially undefined; in practice, it can mean problems like corrupted memory, machine checks, or simple hangs. As it happens, kernel developers normally go out of their way to avoid that kind of behavior.

The current CMA mechanism is used as an allocator behind dma_alloc_coherent(), which creates a cache-coherent DMA buffer. In the absence of bus-snooping hardware that is able to notice when a DMA transfer changes memory, "cache-coherent" is likely to mean simply "uncached." So CMA must, on such systems, create an uncached range of memory to hand back to the requesting driver. That is easily done, and all should be well...at least, unless there happens to be another mapping to the same memory with different caching attributes.

Unfortunately, conflicting mappings can come about easily on a Linux system. One of the first things the kernel does as it boots is to create a "linear mapping" which provides kernel-space virtual addresses for most or all of the memory present in the system. The kernel cannot manipulate memory directly without such a mapping; putting as much of memory as possible into a persistent mapping thus makes sense. On a 32-bit system, just under 1GB of memory can be mapped this way (64-bit systems can always map all of memory and will be able to do so for quite some time yet). This kernel-mapped memory is called "low memory"; almost all allocations of memory for the kernel's use come from the low memory area. Naturally, low memory is mapped with caching enabled; to do otherwise would destroy the performance of the system. If a region of low memory is turned into a DMA buffer with an uncached mapping, the system will have two conflicting mappings for the same memory and will have moved into "undefined behavior" territory.

These conflicting mappings are the reason behind ARM maintainer Russell King's strong opposition to the merging of CMA in its current form. He believes that the code is unsafe on ARM systems; it should not, he says, be merged until the mapping problem has been solved. The interesting thing is that the existing DMA API has the same problem on ARM; dma_alloc_coherent() uses vanilla alloc_pages() to obtain a buffer, then changes the caching attributes before giving the buffer back to the caller. The addition of CMA does not make ARM's DMA API any more or less safe than it was before; it just perpetuates an existing problem.

Russell has a patch pending for 3.1 which addresses this problem by setting aside a chunk of memory which is never mapped into the kernel's address space. With this memory pool available, coherent DMA mappings can be set up without endangering the operation of the system. The whole reason CMA exists, though, is to provide large, contiguous buffers without the need to set aside memory; Russell's approach thus defeats the entire purpose. The pressures which have led to the creation of CMA will not go away anytime soon, so it seems that another solution is needed. Arnd Bergmann has outlined two possibilities, neither of which is entirely pleasant:

  • CMA could be changed to only allocate from the high memory zone. High memory is (by definition) not in the kernel's linear mapping, so no other mappings should exist. The problem with this approach is that it forces the use of high memory on all systems; ARM-based systems are reaching the point where some of them need high memory anyway, but that need is not, yet, universal. Getting enough memory into the high memory zone to be useful could require moving the boundary and shrinking low memory; that is not desirable because low memory is often a limiting resource already. Even if that obstacle can be overcome, the ARM architecture poses unique challenges which would make a high memory implementation hard.

  • Memory that has been turned into a coherent DMA buffer could simply be removed from the kernel's linear mapping until the buffer is no longer needed. This approach seems simple until one remembers that the kernel uses huge pages for the linear mapping. Splitting those huge pages into smaller pages would increase translation lookaside buffer (TLB) contention, reducing the performance of the system as a whole.

Compared to these alternatives, simply setting aside a chunk of memory at boot time might not look like such a bad idea after all. CMA developer Marek Szyprowski's plan appears to be to go with the second of those two alternatives; he thinks that it can be done without significantly hurting performance.

In truth, the best tradeoff will almost certainly differ from one platform to the next. In some situations, memory will be tight enough that a significant runtime penalty to avoid making static DMA buffers seems worthwhile; on others, setting aside a bit of memory may not be a real problem. So what may come of all this is a set of choices to be made when configuring a kernel. There does not appear to be a single solution which just works for everybody on the horizon at this time.


(Log in to post comments)

CMA and ARM

Posted Oct 18, 2011 2:32 UTC (Tue) by bgat (guest, #20709) [Link]

I don't think the performance degradation due to memory splits will be significant, because CMA will not be used as a general allocator: it will be used for very large allocations, of which there will be few. Thus we are talking about a small number of splits in a typical system.

But Russell's objection is a significant one. If we go with CMA as-is, we risk an API whose behavior changes across platforms. We can't have that, or the whole point of CMA (in my view) is lost. We may as well stick with the magic numbers in our code that we have now.

Unfortunately, CMA is at least two tangentially-related things at once: a command-line syntax for expressing large memory regions that the kernel shall avoid, and an implementation for allocations from those regions. Embedded developers are in critical need of both, but I for one would be pretty happy if we could just get the API in for now so that we could standardize what the solution would look like.

At least in the near term, I wouldn't care if the mainlined "CMA" API was just dma_alloc_coherent() under the hood. I am willing to accept severe caveats in its use so long as my drivers I am writing now stay reasonably source-compatible with the implementation of CMA as it improves over time.

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds