|
|
Log in / Subscribe / Register

DMA-BUF cache handling: Off the DMA API map (part 1)

June 4, 2020

This article was contributed by John Stultz

Recently, the DMA-BUF heaps interface was added to the 5.6 kernel. This interface is similar to ION, which has been used for years by Android vendors. However, in trying to move vendors to use DMA-BUF heaps, we have begun to see how the DMA API model doesn't fit well for modern mobile devices. Additionally, the lack of clear guidance in how to handle cache operations efficiently, results in vendors using custom device-specific optimizations that aren't generic enough for an upstream solution. This article will describe the nature of the problem; the upcoming second installment will look at the path toward a solution.

The kernel's DMA APIs are all provided for the sharing of memory between the CPU and devices. The traditional DMA API has, in recent years, been joined by additional interfaces such as ION, DMA-BUF, and DMA-BUF heaps. But, as we will see, the problem of efficiently supporting memory sharing is not yet fully solved.

As an interface, ION was poorly specified, allowing applications to pass custom, opaque flags and arguments to vendor-specific, out-of-tree heap implementations. Additionally, since the users of these interfaces only ran on the vendors' devices with their custom kernel implementations, little attention was paid to trying to create useful generic interfaces. So multiple vendors might use the same heap ID for different purposes, or they might implement the same heap functionality but using different heap IDs and flag options. Even worse, many vendors drastically changed the ION interface and implementation itself, so that there was little in common between vendor ION implementations other than their name and basic functionality. ION essentially became a playground for out-of-tree and device-specific vendor hacks.

Meanwhile, the general dislike of the interface upstream meant that objections to the API often obfuscated the deeper problems that vendors were using ION to solve. But now that the DMA-BUF heaps interface is upstream, some vendors are trying to migrate from their ION heap implementations (and, hopefully, submit the result upstream). In doing so, they are starting to wonder how they will implement some of the functionality and optimizations they were able to obtain with ION while using the more constrained DMA-BUF heaps interface.

A side effect of trying to cajole vendors into pushing their heap functionality upstream is learning more about the details and complexities of how vendors use DMA-BUFs. Since performance is important to mobile vendors, they spend lots of time and effort optimizing how data moves through the device. Specifically, they use buffer sharing not just for moving data between the CPU and a device, but for sharing data between different devices in a pipeline. Often, data is generated by one device, then processed by multiple other devices without the CPU ever accessing it.

For example, a camera sensor may capture raw data to a buffer; that buffer is then passed to an image signal processor (ISP), which applies a set of corrections and adjustments. The ISP will generate one buffer that is passed directly to the display compositor and rendered directly to the screen. The ISP also produces a second buffer that is converted by an encoder to produce yet another buffer that can then be passed to a neural-network engine for face detection (which is then used for focus correction on future frames).

This model of multi-device buffer sharing is common in mobile systems, but isn't as common upstream, and it exposes some limitations of the existing DMA API — particularly when it comes to cache handling. Note that while both the CPU and devices can have their own caches, in this article I'm specifically focusing on the CPU cache; device caches are left to be handled by their respective device drivers.

The DMA API

When we look at the existing DMA API, we see that it implements a clear model that handles memory sharing between the CPU and a single device. The DMA API is particularly careful about how "ownership" — with respect to the CPU cache — of a buffer is handled, in order to avoid data corruption. By default, memory is considered part of the CPU's virtual memory space and the CPU is the de-facto owner of it. It is assumed that the CPU may read and write the memory freely; it is only when allowing a device to do a DMA transaction on the memory that the ownership of the memory is passed to the device.

The DMA API describes two types of memory architecture, called "consistent" and "non-consistent" (or sometimes "coherent" and "non-coherent"). With consistent-memory architectures, changes to memory contents (even when done by a device) will cause any cached data to be updated or invalidated. As a result, a device or CPU can read memory immediately after a device or CPU writes to it without having to worry about caching effects (though the DMA API notes that the CPU cache may need to be flushed before devices can read). Much of the x86 world deals with consistent memory (with some exceptions, usually dealing with GPUs), however in the Arm world, we see many devices that are not coherent with the CPU and are thus non-consistent-memory architectures. That said, as Arm64 devices gain functionality like PCIe, there can often be a mix of coherent and non-coherent devices on a system.

With non-consistent memory, additional care has to be taken to properly handle the cache state of the CPU to avoid corrupting data. If the DMA API's ownership rules are not followed, the device could write to memory without the CPU's knowledge; that could cause the CPU to use stale data in its cache. Similarly, the CPU could flush stale data from its cache to overwrite the newly device-written memory. Data corruption is likely to result either way.

If you're interested in learning more, Laurent Pinchart's ELC 2014 presentation on the DMA API is great; the slides [PDF] are also available.
Thus, the DMA API rules help establish proper cache handling in a generic fashion, ensuring that the CPU cache is invalidated if the device is writing to the memory and flushed before the device reads the memory. Normally, these cache operations are done when the buffer ownership is transferred between the CPU and the device, such as when the memory is mapped and then unmapped from the DMA device (via functions like dma_map_single()).

From the DMA API perspective, sharing buffers with multiple devices is the same as sharing with a single device, except that the sharing is done in a series of discrete operations. The CPU allocates a buffer, then passes ownership of that buffer to the first device (potentially flushing the CPU cache). The CPU then allows the device to do the DMA and unmaps the buffer (potentially invalidating the CPU cache) when the operation is complete, bringing the ownership back to the CPU. Then the process is repeated for the next device and the device after.

The problem here is that those cache operations add up, especially when the CPU isn't actually touching the buffer in between. Ideally, if we were sharing the buffer with a series of cache-incoherent devices, the CPU cache would be initially flushed, then the buffer could be used by devices in series without additional cache operations. The DMA API does allow for some flexibility here, so there are ways to have mapping operations skip CPU syncing; there are also the dma_sync_*_for_cpu/device() calls which allow explicit cache operations to be done while there is an existing mapping. But these are "expert-only" tools provided without much guidance, and they trust that drivers take special care when using these optimizations.

DMA-BUFs

DMA-BUFs were introduced to provide a generic way for applications and drivers to share a handle to a memory buffer. The DMA-BUFs themselves are created by a DMA-BUF exporter, which is a driver that can allocate a specific type of memory but that also provides hooks to handle mapping and unmapping the buffer in various ways for the kernel, user space, or devices.

The general usage flow of DMA-BUFs for a device is as follows (see the dma_buf_ops structure for more details):

dma_buf_attach()
Attaches the buffer to a device (that will use the buffer in the future). The exporter can try to move the buffer if needed to make it accessible to the new device or return an error. The buffer can be attached to multiple devices.

dma_buf_map_attachment()
Maps the buffer into an attached device's address space. The buffer can be mapped by multiple attachments.

dma_buf_unmap_attachment()
Unmaps the buffer from the attached device's address space.

dma_buf_detach()
Signals that the device is finished with the buffer; the exporter can do whatever cleanup it needs.

If we were looking at this from the classic DMA API perspective, we would consider a DMA-BUF to be normally owned by the CPU. Only when dma_buf_map_attachment() was called would the buffer ownership transfer to the device (with the associated cache flushing). Then on dma_buf_unmap_attachment(), the buffer would be unmapped and ownership would return to the CPU (again with the proper cache invalidation required). This in effect would make the DMA-BUF exporter the entity responsible for complying with the DMA API rules of ownership.

The trouble with this scheme arises with a buffer pipeline consisting of a number of devices, where the CPU doesn't actually touch the buffer. Following the DMA API and calling dma_map_sg() and dma_unmap_sg() on each dma_buf_map_attachment() and dma_buf_unmap_attachment() call results in lots of cache-maintenance operations, which dramatically impacts performance. This was viscerally felt by ION users after a cleanup series landed in 4.12 that caused ION to use the DMA API properly. Previously, it had lots of hacks and was not compliant with the DMA API, resulting in buffer corruption in some cases; see the slides from Laura Abbott’s presentation for more details. This compliance cleanup caused a dramatic performance drop for ION users, which resulted in some vendors reverting back to the 4.9 ION code in their 4.14-based products, and others creating their own hacks to improve performance.

So how can we have DMA-BUF exporters that better align with the DMA API, but do so with the performance needed for modern devices when using buffer pipelines with multiple devices? In the second part of this article, we will continue discussing some of the unique semantics and flexibility in DMA-BUF that allows drivers to potentially avoid this performance impact (by going somewhat "off-road" from the DMA API usage guidelines), as well as the downsides of what that flexibility allows. Finally, we'll share some thoughts as to how these downsides might be avoided.

Index entries for this article
KernelDevice drivers/Support APIs
KernelDirect memory access
KernelION
GuestArticlesStultz, John


to post comments

DMA-BUF cache handling: Off the DMA API map (part 1)

Posted Jun 5, 2020 18:56 UTC (Fri) by estansvik (guest, #127963) [Link]

A true cliffhanger this is. Looking forward to part 2.

DMA-BUF cache handling: Off the DMA API map (part 1)

Posted Jun 8, 2020 12:09 UTC (Mon) by ncultra (guest, #121511) [Link]

Well-written and useful, thank you.

DMA-BUF cache handling: Off the DMA API map (part 1)

Posted Jun 11, 2020 18:46 UTC (Thu) by harisphnx (subscriber, #139363) [Link]

Would an additional function like "dma_buf_transfer_mapping()" do the trick?

This can simply map the buffer to a new device's address space without actually handing it back to the CPU.

DMA-BUF cache handling: Off the DMA API map (part 1)

Posted Jun 16, 2020 13:38 UTC (Tue) by punit (subscriber, #87729) [Link] (4 responses)

Nice writeup!

Though there is one thing that doesn't seem obvious -

"the buffer would be unmapped and ownership would return to the CPU (again with the proper cache invalidation required)"

Why do the CPU caches needs to be invalidated when transferring the ownership of the buffer back from the device. One reason I could think of was due to speculative fetches but that shouldn't be the case if the buffer is not mapped. What am I missing?

DMA-BUF cache handling: Off the DMA API map (part 1)

Posted Jun 16, 2020 19:34 UTC (Tue) by excors (subscriber, #95769) [Link] (3 responses)

If I understand correctly, the buffer typically never gets unmapped from the CPU's page table, because that would be an unnecessary performance cost. On dma_map_*/dma_unmap_*, it just gets added to / removed from the device's IOMMU page table or equivalent (if there is one).

Software on the CPU has to promise not to touch the buffer while the device is using it, else there will be unpredictable behaviour due to non-coherent caches. But the CPU hardware might read that memory anyway (prefetching, speculative execution, etc), so the CPU caches have to be invalidated after the device has stopped writing and before any software starts reading. That happens automatically when unmapped from the device, or can be done manually with the dma_sync_* APIs.

DMA-BUF cache handling: Off the DMA API map (part 1)

Posted Jun 17, 2020 14:41 UTC (Wed) by punit (subscriber, #87729) [Link]

Thanks for the explanation! Indeed skipping the memory unmap on CPU will save the cost of page table updates and tlb maintenance.

DMA-BUF cache handling: Off the DMA API map (part 1)

Posted Jun 18, 2020 6:00 UTC (Thu) by dxin (guest, #136611) [Link] (1 responses)

My question here is, why does releasing ownership from a device implies CPU must acquire the ownership? Why can't we let ownership "float" briefly? (If we are not sure CPU needs it, then don't let CPU acquire it and don't do cache operation on it).

DMA-BUF cache handling: Off the DMA API map (part 1)

Posted Jun 18, 2020 20:53 UTC (Thu) by excors (subscriber, #95769) [Link]

It shouldn't imply that (but presumably there are historical reasons why it does). The CPU shouldn't be so self-important - it's just one of many devices that might share access to some memory, many of which might have their own page tables and caches.

Part 2 (https://lwn.net/Articles/822521/) discusses some moves in that direction.

I think ION already started from the extreme of that direction: its typical use case is e.g. to have the camera write an uncompressed video frame to memory, then pass it simultaneously to the video encoder and to the 3D graphics engine (which renders onto a framebuffer that then gets passed to the display hardware), then recycle the buffer once they have all finished with it. The CPU never touches the pixels. Often the pixels are stored in a proprietary cache-efficient tiled layout that makes it infeasible for the CPU to directly access the pixels at all. (When necessary for compatibility, the image can be copied and format-converted into a CPU-accessible format, but performance may be very bad). To replace ION, DMA-BUF has to support that kind of thing.


Copyright © 2020, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds