A security fix briefly breaks DMA

By Jonathan Corbet
April 1, 2022

In theory, direct memory access (DMA) operations are simple to understand; a device transfers data directly to or from a memory buffer managed by the CPU. Almost all contemporary devices perform DMA, since it would not be possible to obtain the needed performance without it. Like so many things, DMA turns out to be a bit more complicated in practice. That complexity led to an erroneous patch, intended to improve security, breaking DMA for some devices in 5.17 and some stable kernels.

The simple model of a DMA transfer looks something like this:

There is a buffer in memory that can be accessed — for either reading or writing — by both the CPU and the peripheral device, and either side can access the buffer at will. Some systems actually work this way but, in most others, various complications come into play. For example, there is almost certainly a set of caches between the CPU and the memory buffer. If the CPU writes data to the buffer with the intent of transferring it to the device, that data may be resident in the cache for some time before being flushed to main memory. If the device is instructed to read the data from the buffer, it may not see the data in cache, resulting in incorrect (corrupted) data being written to the device.

Similarly, on many architectures, the CPU cache may be entirely unaware of data written to the DMA buffer by the device. If the CPU tries to read that data, it could instead get stale data from the cache, once again resulting in corrupted data. Long experience in the kernel-development community has shown that users have a certain tendency to become irate when that happens.

Ownership, direction, and bounce buffers

The kernel's DMA-support layer has grown a set of mechanisms designed to prevent data corruption and the unhappiness that follows from it. In particular, the DMA API can avoid the pitfalls that come with systems that are not cache-coherent — where copies of data in different locations may not be synchronized with each other. When dealing with non-cache-coherent (or "streaming") mappings, driver code must keep track of two important attributes: the ownership of the buffer and the direction that the data is moving. Ownership describes which side (the CPU or the device) is able to access the buffer at any given time; while data direction describes what the owner will be doing with the buffer. There are two functions (among others) that manage these attributes for a given DMA buffer:

    void dma_sync_single_for_device(struct device *dev, dma_addr_t addr,
				    size_t size, enum dma_data_direction dir);
    void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
				 enum dma_data_direction dir);

When a DMA buffer is first mapped, the ownership belongs to the CPU. A call to dma_sync_single_for_device() sets the ownership to the device, while dma_sync_single_for_cpu() brings the ownership back to the CPU. The dir argument to both functions describes the direction that data is moving; it is one of DMA_NONE, DMA_TO_DEVICE, DMA_FROM_DEVICE, or DMA_BIDIRECTIONAL. Note that terms like "read" and "write" are not used here, since in almost any DMA transaction one side is reading from the buffer while the other is writing to it.

The combination of the ownership and the direction tells the DMA layer what needs to be done at any given time. If the CPU is passing ownership to the device, and the direction is DMA_TO_DEVICE, then the DMA layer must take care to flush any data from the CPU caches to the buffer so that the device sees the correct data. When taking ownership back from the device, with a direction of DMA_FROM_DEVICE, the DMA layer should instead invalidate any cached data, since it may well be incorrect. As long as the driver is careful to set the ownership and direction of the buffer properly at all times, there should be no problems with data corruption.

There is one other complication to cover in order to understand what went wrong. There are times when the device can perform DMA, but it cannot directly access the buffer that the CPU is using. The buffer may be in a region of memory that the device cannot reach, it may be split into too many discontiguous pieces, or there may be an I/O memory-management unit (IOMMU) complicating the picture. When this happens, a separate DMA-layer module (called the "software I/O translation lookaside buffer" or SWIOTLB) allocates a "bounce buffer" that is accessible to the device:

In this situation, data must often be copied between the two buffers. If the CPU is sending data to the device, for example, data must be copied ("bounced") from the CPU's buffer to the bounce buffer, and the address of the bounce buffer handed to the device. Data will be copied in the opposite direction when data is being received from the device. The good news is that the DMA API is able to handle this case transparently as long as the ownership calls described above are properly used. Device drivers need not know whether a bounce buffer is in use or not.

A problem with bounce buffers

There is a potential security problem with bounce buffers, though. If data is being transferred from the device to the CPU, the DMA code will, when ownership of the DMA buffer returns to the CPU, copy the contents of the bounce buffer back to the CPU's buffer. If, however, the device did not write a full buffer's worth of data, then some of the data copied out of the bounce buffer will have originated from somewhere else. It may come from a previous I/O operation, or from an entirely unrelated kernel subsystem. The device driver is then likely to copy this data back to user space; indeed, the buffer might be directly mapped into user space to begin with. The end result is that data is leaked from the kernel to user space.

It makes sense to zero the bounce buffer before setting it up, but DMA buffers are often used many times, and it is not normal to zero them between operations. So even a bounce buffer that was zeroed at the beginning may accumulate unrelated data and expose it to user space. A real-world example of this problem was deemed CVE-2018-1000204 and fixed in 2018, but that fix is only partial, since it only zeroes the buffer at allocation time.

Halil Pasic wrote a more complete fix that was merged in February; a followup fix was added in early March. It works by changing the behavior of dma_sync_single_for_device(). If the data direction is DMA_TO_DEVICE, the contents of the CPU's buffer must be copied to the bounce buffer, and the DMA API has always done that. Pasic's patch caused that same copy to happen for all operations, even when the direction is DMA_FROM_DEVICE.

Normally, it would not make sense to copy data into the bounce buffer for DMA_FROM_DEVICE; the device is just going to overwrite it anyway. But this copy ensures that, if the device only writes part of the bounce buffer, the remainder of the CPU's buffer will not be overwritten by random data when the bounce buffer is copied back; instead, it will get back a copy of the data that was already there. In a sense, it makes the end result reflect what the device has actually done, in that only data written by the device will change in the CPU's buffer.

The latter fix was merged for the 5.17-rc8 development kernel — less than two weeks before the final 5.17 release — and quickly found its way into the 5.16.15 and 5.15.29 stable updates.

Regression

On March 23, Oleksandr Natalenko reported that this change broke the ath9k wireless driver. There followed an extended discussion where it took a while to figure out what was really going on. Robin Murphy was initially incredulous:

I'm extremely puzzled how any driver could somehow be dependent on non-device-written data getting replaced with random crap, given that it wouldn't happen with a real IOMMU, or if SWIOTLB just didn't need to bounce, and the data would hardly be deterministic either.

The source of the problem was eventually narrowed down to some code in ath_edma_get_buffers(). This code executes a sequence that looks like this:

    dma_sync_single_for_cpu(..., DMA_FROM_DEVICE);
    if (! a_packet_is_ready()) {
        dma_sync_single_for_device(..., DMA_FROM_DEVICE);
	return false;
    }

The code is taking ownership of the buffer, checking to see if the device has put a packet there, then returning ownership to the device if there is no packet available. The key point here is that, while this is happening, the device is still writing to the buffer; that packet could be arriving while the driver is checking for it. This procedure worked just fine until the change went in; with the new behavior, the dma_sync_single_for_device() call copies the CPU buffer into the bounce buffer, potentially overwriting data that was just placed there by the device. This happens often enough, it seems, to reliably break the device.

Developers like Christoph Hellwig initially saw the ath9k behavior as a bug, and felt that the problem should be fixed there. Murphy described this behavior as "a violation of the DMA API". The problem was that the device was allowed to keep writing to the DMA buffer even though the ownership had shifted to the CPU. It seemed, for a moment, that the ath9k driver could be fixed and the bounce-buffering change could remain.

Linus Torvalds disagreed strongly, though, for a few different reasons. He noted that ath9k might not be the only driver that shows this kind of problem; it just showed up first there because those adapters are widely used. If other drivers do similar things, users and developers could end up chasing bugs for a long time. Some of those bugs might well make it into production kernel releases before being noticed.

More to the point, though, he stated that the ath9k driver's behavior was correct, and the bounce-buffer change was not. Specifically, he pointed out that the dma_sync_single_to_device() call specified DMA_FROM_DEVICE. In that situation, he claimed, the bounce-buffer implementation should do nothing at all; the data is coming from the device, so the CPU has no business copying data into the bounce buffer. So the only right thing to do is to revert the patch.

In the end, reverting the commit is exactly what happened, though it was later partly reinstated to preserve the parts of the patch that were not problematic. So the ath9k driver works again, and potential bugs in an unknown number of other drivers have been avoided. The problem of leaking random data out of the bounce buffer remains unsolved; a different approach will need to be found to resolve that one.

Index entries for this article
Kernel	Direct memory access
Kernel	swiotlb

A security fix briefly breaks DMA

Posted Apr 1, 2022 14:46 UTC (Fri) by post-factum (subscriber, #53836) [Link] (3 responses)

I sincerely hoped LWN would make a nice summary of that LKML thread, and here we are :). Thank you a lot!

A security fix briefly breaks DMA

Posted Apr 1, 2022 15:15 UTC (Fri) by johill (subscriber, #25196) [Link] (2 responses)

Couple of notes, having participated in the thread:

According to the documentation, the call(s) shouldn't even specify DMA_FROM_DEVICE, it/they should specify DMA_BIDIRECTIONAL as the buffer is documented that way. That would probably break ath9k too, haven't checked the exact code though.
According to Linus's thinking on it, however, the DMA_FROM_DEVICE actually makes sense ...
The ownership model doesn't work as documented, at least not for devices such as ath9k, since here the device is writing while the driver is reading (which you touch upon), but that implies sort of "dual ownership" or weaking the ownership to "write ownership" perhaps?
There was a whole sub-discussion about cache invalidate etc., e.g. with a reference to the ARC (I think?) documentation, which was also quite worthwhile IMHO.

In any case, for me one of the take-aways from the thread was that the documented DMA model is actually not applicable in practice (i.e. it's wrong), which might've been worth going into even if nobody has stepped up yet to try to fix it.

A security fix briefly breaks DMA

Posted Apr 1, 2022 18:32 UTC (Fri) by mb (subscriber, #50428) [Link] (1 responses)

Thanks for the additional insight.

But could you please elaborate, why DMA_BIDIRECTIONAL should be used? I don't quite get why.

A security fix briefly breaks DMA

Posted Apr 2, 2022 12:49 UTC (Sat) by johill (subscriber, #25196) [Link]

Oh, it probably *shouldn't* be used in this case, practically speaking. But (and credit goes to Halil for quoting this to me in https://lore.kernel.org/r/20220327051502.63fde20a.pasic@linux.ibm.com) the documentation states:

        void
        dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle,
                                size_t size,
                                enum dma_data_direction direction)

        void
        dma_sync_single_for_device(struct device *dev, dma_addr_t dma_handle,
                                   size_t size,
                                   enum dma_data_direction direction)

        void
        dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
                            int nents,
                            enum dma_data_direction direction)

        void
        dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
                               int nents,
                               enum dma_data_direction direction)

Synchronise a single contiguous or scatter/gather mapping for the CPU
and device. With the sync_sg API, all the parameters must be the same
as those passed into the single mapping API. With the sync_single API,
you can use dma_handle and size parameters that aren't identical to
those passed into the single mapping API to do a partial sync.

Which, at least in our reading, seems to imply that for SG all must be the same, for single all but handle/size must be the same. Practically, that doesn't seem to be desired, and it's not even entirely clear what these should mean for DMA_BIDIRECTIONAL.

A security fix briefly breaks DMA

Posted Apr 1, 2022 16:56 UTC (Fri) by larkey (guest, #104463) [Link] (1 responses)

There's a small typo in the index link of the article: "DIrect memory access" (note the capital "I").

Really great writeup!

Typo reports

Posted Apr 1, 2022 16:59 UTC (Fri) by corbet (editor, #1) [Link]

Fixed. For future reference, please send typo reports to lwn@lwn.net so as to not clutter the comment stream.

A security fix briefly breaks DMA

Posted Apr 1, 2022 17:43 UTC (Fri) by mario-campos (subscriber, #152845) [Link] (6 responses)

That actually made a lot of sense! THIS is why I read LWN :)

On the matter at hand, though... Why not just create an additional `size_t` parameter in the DMA API and only bounce/copy the necessary data (instead of the full buffer)?

A security fix briefly breaks DMA

Posted Apr 1, 2022 18:30 UTC (Fri) by mb (subscriber, #50428) [Link] (4 responses)

In most cases it's probably unknown how much the device actually wrote, unless the driver actually looks at the data written.

A security fix briefly breaks DMA

Posted Apr 2, 2022 3:20 UTC (Sat) by developer122 (guest, #152928) [Link] (2 responses)

This sounds a lot like terrible hardware design.

A security fix briefly breaks DMA

Posted Apr 4, 2022 8:17 UTC (Mon) by taladar (subscriber, #68407) [Link]

That doesn't really matter, if you are the kernel and essentially want a design that works with all hardware you have to expect some terribly designed hardware too.

A security fix briefly breaks DMA

Posted Apr 7, 2022 15:16 UTC (Thu) by kpfleming (subscriber, #23250) [Link]

I'm not sure that's fair.

You tell a NIC it can receive a packet directly into a buffer. It does so. You don't know how large the packet was until you look into the buffer to find out.

The NIC could conceivably communicate the 'bytes written' amount in some other location outside the buffer, but that significantly complicates the data handling.

A security fix briefly breaks DMA

Posted Apr 2, 2022 6:24 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

I would tend to expect that the vast majority of hardware falls into one of two cases:

* It's something like a NIC, and the data is eventually going to end up in some sort of userspace buffer (e.g. via read() or recv()). This implies that some part of the kernel is going to have to figure out the size at some point, because you can't leak extraneous data to userspace, so you might as well do it in the driver.
* It's something like a keyboard or capture card, and the data has a fixed or configurable size and/or format, which the driver should already know because it presumably configured the size/format in the first place.

Is there really a lot of variable-sized input hardware where the input *doesn't* eventually end up in userspace?

A security fix briefly breaks DMA

Posted Apr 2, 2022 16:46 UTC (Sat) by khim (subscriber, #9252) [Link]

Because it wouldn't solve anything: DMA mechanism itself has no way to check how many bytes were actually written, only the driver can know that.

And using value passed along from the driver to make information unavailable to said driver… what would that achieve? The driver already can ignore the part that it doesn't need.

A security fix briefly breaks DMA

Posted Apr 2, 2022 19:55 UTC (Sat) by iabervon (subscriber, #722) [Link] (3 responses)

I don't see how the bounce buffer copy and the ongoing DMA avoid interacting in such a way that a_packet_is_ready() can return true when not all of the packet was in the bounce buffer in time to get copied out. It seems like you'd need a sync operation that let you specify the order that the bounce buffer is read in a way that wouldn't be optimal for a sync operation that was only reliable if the device was no longer writing when you called it.

A security fix briefly breaks DMA

Posted Apr 3, 2022 8:36 UTC (Sun) by farnz (subscriber, #17727) [Link]

In the bounce buffer case, it can't. But on a lot of systems, especially those in WiFi APs, the DMA is happening into the CPU buffer directly (no bounce buffer), and there it's an optimization to say "hey, packet copied in, resync and continue".

A security fix briefly breaks DMA

Posted Apr 3, 2022 8:54 UTC (Sun) by dvrabel (subscriber, #9500) [Link] (1 responses)

The device writes the status word (with its RxDone bit) _after_ writing the packet and other metadata. Thus, if the CPU sees RxDone it knows all the other data has been written.

A security fix briefly breaks DMA

Posted Apr 3, 2022 17:40 UTC (Sun) by iabervon (subscriber, #722) [Link]

That's what I thought, but how does the CPU know that none of the sync happened before the status word was written, with the copy of the status word happening after it was written? You'd need to make sure that the first thing that gets copied during the sync is the status word, and I'm not finding any documentation that the sync happens in any particular order.

I expect that the copy actually happens from low to high, and the status word is at the beginning, so it always works, but relies on a property of the core API that was never promised.

A security fix briefly breaks DMA

Posted Apr 3, 2022 9:38 UTC (Sun) by Flow (subscriber, #82408) [Link] (2 responses)

Great summary, as always. However

> It makes sense to zero the bounce buffer before setting it up, but DMA buffers are often used many times, and it is not normal to zero them between operations. So even a bounce buffer that was zeroed at the beginning may accumulate unrelated data and expose it to user space.

I would be thankful if someone could elaborate on the process that is causing unrelated data to accumulate. This is not immediately obvious to me. I suspect that is potentially the device or its driver that is re-using the buffer for further purposes?

A security fix briefly breaks DMA

Posted Apr 4, 2022 14:51 UTC (Mon) by imMute (guest, #96323) [Link]

>I suspect that is potentially the device or its driver that is re-using the buffer for further purposes?

This. It's fairly common for drivers to setup a ring of buffers that get reused. As for how the data can be "unrelated" imagine a NIC driver pulling packets off the device - each packet may very well be destined for entirely different processes, but they all use the same set of buffers over and over again.

A security fix briefly breaks DMA

Posted Apr 4, 2022 15:33 UTC (Mon) by johill (subscriber, #25196) [Link]

Well in this case we're talking about swiotlb, so imagine driver 1 maps a buffer A and gets buffer A' in swiotlb, fills some data into it (perhaps it's a crypto driver and there's some key data). Then it unmaps because it's no longer in use. Some time later driver 2 maps a buffer B and also gets A' in the swiotlb shadow data, so now all of the stale data from driver 1 is exposed to driver 2's device for DMA.