DMA-BUF cache handling: Off the DMA API map (part 1)
Recently, the DMA-BUF heaps interface was added to the 5.6 kernel. This interface is similar to ION, which has been used for years by Android vendors. However, in trying to move vendors to use DMA-BUF heaps, we have begun to see how the DMA API model doesn't fit well for modern mobile devices. Additionally, the lack of clear guidance in how to handle cache operations efficiently, results in vendors using custom device-specific optimizations that aren't generic enough for an upstream solution. This article will describe the nature of the problem; the upcoming second installment will look at the path toward a solution.
The kernel's DMA APIs are all provided for the sharing of memory between the CPU and devices. The traditional DMA API has, in recent years, been joined by additional interfaces such as ION, DMA-BUF, and DMA-BUF heaps. But, as we will see, the problem of efficiently supporting memory sharing is not yet fully solved.
As an interface, ION was poorly specified, allowing applications to pass custom, opaque flags and arguments to vendor-specific, out-of-tree heap implementations. Additionally, since the users of these interfaces only ran on the vendors' devices with their custom kernel implementations, little attention was paid to trying to create useful generic interfaces. So multiple vendors might use the same heap ID for different purposes, or they might implement the same heap functionality but using different heap IDs and flag options. Even worse, many vendors drastically changed the ION interface and implementation itself, so that there was little in common between vendor ION implementations other than their name and basic functionality. ION essentially became a playground for out-of-tree and device-specific vendor hacks.
Meanwhile, the general dislike of the interface upstream meant that objections to the API often obfuscated the deeper problems that vendors were using ION to solve. But now that the DMA-BUF heaps interface is upstream, some vendors are trying to migrate from their ION heap implementations (and, hopefully, submit the result upstream). In doing so, they are starting to wonder how they will implement some of the functionality and optimizations they were able to obtain with ION while using the more constrained DMA-BUF heaps interface.
A side effect of trying to cajole vendors into pushing their heap functionality upstream is learning more about the details and complexities of how vendors use DMA-BUFs. Since performance is important to mobile vendors, they spend lots of time and effort optimizing how data moves through the device. Specifically, they use buffer sharing not just for moving data between the CPU and a device, but for sharing data between different devices in a pipeline. Often, data is generated by one device, then processed by multiple other devices without the CPU ever accessing it.
For example, a camera sensor may capture raw data to a buffer; that buffer is then passed to an image signal processor (ISP), which applies a set of corrections and adjustments. The ISP will generate one buffer that is passed directly to the display compositor and rendered directly to the screen. The ISP also produces a second buffer that is converted by an encoder to produce yet another buffer that can then be passed to a neural-network engine for face detection (which is then used for focus correction on future frames).
This model of multi-device buffer sharing is common in mobile systems, but isn't as common upstream, and it exposes some limitations of the existing DMA API — particularly when it comes to cache handling. Note that while both the CPU and devices can have their own caches, in this article I'm specifically focusing on the CPU cache; device caches are left to be handled by their respective device drivers.
The DMA API
When we look at the existing DMA API, we see that it implements a clear model that handles memory sharing between the CPU and a single device. The DMA API is particularly careful about how "ownership" — with respect to the CPU cache — of a buffer is handled, in order to avoid data corruption. By default, memory is considered part of the CPU's virtual memory space and the CPU is the de-facto owner of it. It is assumed that the CPU may read and write the memory freely; it is only when allowing a device to do a DMA transaction on the memory that the ownership of the memory is passed to the device.
The DMA API describes two types of memory architecture, called "consistent" and "non-consistent" (or sometimes "coherent" and "non-coherent"). With consistent-memory architectures, changes to memory contents (even when done by a device) will cause any cached data to be updated or invalidated. As a result, a device or CPU can read memory immediately after a device or CPU writes to it without having to worry about caching effects (though the DMA API notes that the CPU cache may need to be flushed before devices can read). Much of the x86 world deals with consistent memory (with some exceptions, usually dealing with GPUs), however in the Arm world, we see many devices that are not coherent with the CPU and are thus non-consistent-memory architectures. That said, as Arm64 devices gain functionality like PCIe, there can often be a mix of coherent and non-coherent devices on a system.
With non-consistent memory, additional care has to be taken to properly handle the cache state of the CPU to avoid corrupting data. If the DMA API's ownership rules are not followed, the device could write to memory without the CPU's knowledge; that could cause the CPU to use stale data in its cache. Similarly, the CPU could flush stale data from its cache to overwrite the newly device-written memory. Data corruption is likely to result either way.
From the DMA API perspective, sharing buffers with multiple devices is the same as sharing with a single device, except that the sharing is done in a series of discrete operations. The CPU allocates a buffer, then passes ownership of that buffer to the first device (potentially flushing the CPU cache). The CPU then allows the device to do the DMA and unmaps the buffer (potentially invalidating the CPU cache) when the operation is complete, bringing the ownership back to the CPU. Then the process is repeated for the next device and the device after.
The problem here is that those cache operations add up, especially when the CPU isn't actually touching the buffer in between. Ideally, if we were sharing the buffer with a series of cache-incoherent devices, the CPU cache would be initially flushed, then the buffer could be used by devices in series without additional cache operations. The DMA API does allow for some flexibility here, so there are ways to have mapping operations skip CPU syncing; there are also the dma_sync_*_for_cpu/device() calls which allow explicit cache operations to be done while there is an existing mapping. But these are "expert-only" tools provided without much guidance, and they trust that drivers take special care when using these optimizations.
DMA-BUFs
DMA-BUFs were introduced to provide a generic way for applications and drivers to share a handle to a memory buffer. The DMA-BUFs themselves are created by a DMA-BUF exporter, which is a driver that can allocate a specific type of memory but that also provides hooks to handle mapping and unmapping the buffer in various ways for the kernel, user space, or devices.
The general usage flow of DMA-BUFs for a device is as follows (see the dma_buf_ops structure for more details):
- dma_buf_attach()
- Attaches the buffer to a device (that will use the buffer in the future). The exporter can try to move the buffer if needed to make it accessible to the new device or return an error. The buffer can be attached to multiple devices.
- dma_buf_map_attachment()
- Maps the buffer into an attached device's address space. The buffer can be mapped by multiple attachments.
- dma_buf_unmap_attachment()
- Unmaps the buffer from the attached device's address space.
- dma_buf_detach()
- Signals that the device is finished with the buffer; the exporter can do whatever cleanup it needs.
If we were looking at this from the classic DMA API perspective, we would consider a DMA-BUF to be normally owned by the CPU. Only when dma_buf_map_attachment() was called would the buffer ownership transfer to the device (with the associated cache flushing). Then on dma_buf_unmap_attachment(), the buffer would be unmapped and ownership would return to the CPU (again with the proper cache invalidation required). This in effect would make the DMA-BUF exporter the entity responsible for complying with the DMA API rules of ownership.
The trouble with this scheme arises with a buffer pipeline consisting of a number of devices, where the CPU doesn't actually touch the buffer. Following the DMA API and calling dma_map_sg() and dma_unmap_sg() on each dma_buf_map_attachment() and dma_buf_unmap_attachment() call results in lots of cache-maintenance operations, which dramatically impacts performance. This was viscerally felt by ION users after a cleanup series landed in 4.12 that caused ION to use the DMA API properly. Previously, it had lots of hacks and was not compliant with the DMA API, resulting in buffer corruption in some cases; see the slides from Laura Abbott’s presentation for more details. This compliance cleanup caused a dramatic performance drop for ION users, which resulted in some vendors reverting back to the 4.9 ION code in their 4.14-based products, and others creating their own hacks to improve performance.
So how can we have DMA-BUF exporters that better align with the DMA API,
but do so with the performance needed for modern devices when using buffer
pipelines with multiple devices?
In the second part of this article, we will continue discussing some of the
unique semantics and flexibility in DMA-BUF that allows drivers to
potentially avoid this performance impact (by going somewhat "off-road"
from the DMA API usage guidelines), as well as the downsides of what that
flexibility allows. Finally, we'll share some thoughts as to how these
downsides might be avoided.
| Index entries for this article | |
|---|---|
| Kernel | Device drivers/Support APIs |
| Kernel | Direct memory access |
| Kernel | ION |
| GuestArticles | Stultz, John |
