LWN: Comments on "Error handling for I/O memory management units"

Error handling for I/O memory management units

Cyberax — Mon, 25 Aug 2014 19:21:49 +0000

As I understand, currently GPUs have access only to some RAM regions, not the whole RAM. Though it's changing with the new modern heterogeneous architectures.

Command buffers are also scheduled to be run exclusively, so that gives _some_ protection. Lots of downsides (you can't run for too long, else you can starve other users) but it's also changing.

Error handling for I/O memory management units

jzbiciak — Mon, 25 Aug 2014 14:52:15 +0000

My understanding of the support at least some GPUs provide (whether or not Linux natively leverages it) is that you can provide an MMU context with a particular command stream. There isn't a global mapping table so that the GPU can see the union of mappings across all requestors. Rather, command streams coming from X get checked against an MMU context associated with X, and command streams coming from Y get checked against an MMU context associated with Y.

And within that framework, my understanding is that GPUs can trigger page faults, and that that is not an error. At least, that's what AMD's Kaveri was promising some time ago, and what I've seen in some other vendors' GPU+MMU pitches.

So I repeat my question: Does 'error' in the article refer to page faults in general, or an actual application error?

Error handling for I/O memory management units

intgr — Mon, 25 Aug 2014 11:49:43 +0000

Does this mean that, if 2 users are both running code on the GPU, they can access and corrupt each other's data?

And without an IOMMU they can access all physical memory?

And Linux 3.15 merged patches to allow GPGPU (OpenCL) access to any unprivileged user by default (via DRM render nodes)?

And no checking could possibly be done of the code being executed?

PLEASE tell me I am misunderstanding something.

Error handling for I/O memory management units

Cyberax — Mon, 25 Aug 2014 10:22:37 +0000

OpenCL supports arbitrary pointer arithmetic. It's impossible to statically check the command stream for correctness.

Error handling for I/O memory management units

cladisch — Mon, 25 Aug 2014 10:19:39 +0000

Allowing userspace to use the GPU to read/write any memory would be a security hole.
The GPU driver checks the command stream for correctness.

Error handling for I/O memory management units

Cyberax — Mon, 25 Aug 2014 09:22:16 +0000

We now have GPUs that can access the system RAM, and it's certainly possible for them to get IOMMU errors while running user space-supplied code.

Error handling for I/O memory management units

cladisch — Mon, 25 Aug 2014 08:43:36 +0000

DMA is always done from/to buffers that have been allocated or locked by the device driver.

With DMA, there is no such thing as a minor fault; any IOMMU fault is the result of a bug in the OS/driver or in the hardware.

Error handling for I/O memory management units

jzbiciak — Mon, 25 Aug 2014 05:47:39 +0000

I have a related question. What is meant by "error"?

If a device (CPU or anything else) asks an MMU for a virtual to physical address translation for a given read or write, the MMU can report back a fault. However, the request itself may not have been an actual error. Rather, the fault could have arisen for many other reasons, including:

The page may have been unmapped by the OS but still resident. The fault informs the OS that it's still active. (A so-called minor fault.)
The page may have been marked read-only (because it's shared or clean) but the device requested to write, so the OS needs to mark it dirty and possibly perform a COW. (Another sort of minor fault, I believe.)
The page may have been paged out. (A major fault, but still not an error.)

Are these situations considered errors or not? This comment in particular made me wonder:

David pointed out that with some devices, graphics adapters in particular, users do not want the device to stop even in the presence of an error. One command stream may fault and be stopped, but others running in parallel should be able to continue. So a more subtle response is often necessary.

Is that so that the kernel can service a page fault for a given task, or is it the more general situation that you don't want the entire display system brought down by an errant task? (It seems like the latter would be pretty key baseline functionality. The former, though, is something I believe AMDs Kaveri was promising.)

So does "error" in this article mean "faults that would result in SIGSEGV or similar if a userland CPU task did it", or "faults that corresponds to minor/major page faults in an otherwise behaving userland CPU task, in addition to those which would cause SIGSEGVs?"

Error handling for I/O memory management units

dlang — Sun, 24 Aug 2014 03:51:29 +0000

I could be wrong, but I think the answer is that without an IOMMU the DMA will either succeed or fail, the only think that knows this is the thing trying to do the DMA

however with an IOMMU, the IOMMU can now report that the device attempted to access memory it's not allowed to.

The question is what should be done when a device misbehaves, and how should it be reported?

Error handling for I/O memory management units

neilbrown — Sat, 23 Aug 2014 22:34:12 +0000

These are perfectly good answers for why an IOMMU is a valuable thing to have, but don't seem to answer the question: what sort of error can you get from an IOMMU.

If you have a system without an IOMMU, then it is quite possible to program a DMA engine in some device to access an illegal address - maybe some address where there isn't any memory. Presumably an error gets reported .. or maybe it doesn't. Maybe it just silently fails.

If you add an IOMMU, then that greatly increases the range of addresses that are illegal for any given device, but surely the device will just fail in exactly the same way that it did before. I don't see any new sorts of errors. I must be missing something.

So I'm still hoping someone can explain to me what sort of errors one can get from an IOMMU.

Error handling for I/O memory management units

corsac — Sat, 23 Aug 2014 13:25:22 +0000

Or someone could remotely take control of your network card and tries to DMA write from there, compromising the host.

Error handling for I/O memory management units

mcpherrinm — Thu, 21 Aug 2014 05:43:19 +0000

An IOMMU gives devices virtual memory, basically. This is similar to what we do with process address spaces. We're talking about a device doing something equivalent to a process segfaulting.

For example, a GPU may load a GL shader program with a bug that causes out of bounds reads to happen on the GPU. If executed by a malicious user, that could leak information the system doesn't intend them to have access to.

Error handling for I/O memory management units

neilbrown — Wed, 20 Aug 2014 23:03:47 +0000

Pardon my ignorance, but what sort of errors are we talking about here? What can "go wrong with an IOMMU"?

The article mentions "bad addresses" but presumably you can program DMA to bad addresses even without an IOMMU ... and presumably we try not to even when we have one.

Thanks.