Error handling for I/O memory management units

Posted Aug 25, 2014 5:47 UTC (Mon) by jzbiciak (guest, #5246)
In reply to: Error handling for I/O memory management units by neilbrown
Parent article: Error handling for I/O memory management units

I have a related question. What is meant by "error"?

If a device (CPU or anything else) asks an MMU for a virtual to physical address translation for a given read or write, the MMU can report back a fault. However, the request itself may not have been an actual error. Rather, the fault could have arisen for many other reasons, including:

The page may have been unmapped by the OS but still resident. The fault informs the OS that it's still active. (A so-called minor fault.)
The page may have been marked read-only (because it's shared or clean) but the device requested to write, so the OS needs to mark it dirty and possibly perform a COW. (Another sort of minor fault, I believe.)
The page may have been paged out. (A major fault, but still not an error.)

Are these situations considered errors or not? This comment in particular made me wonder:

David pointed out that with some devices, graphics adapters in particular, users do not want the device to stop even in the presence of an error. One command stream may fault and be stopped, but others running in parallel should be able to continue. So a more subtle response is often necessary.

Is that so that the kernel can service a page fault for a given task, or is it the more general situation that you don't want the entire display system brought down by an errant task? (It seems like the latter would be pretty key baseline functionality. The former, though, is something I believe AMDs Kaveri was promising.)

So does "error" in this article mean "faults that would result in SIGSEGV or similar if a userland CPU task did it", or "faults that corresponds to minor/major page faults in an otherwise behaving userland CPU task, in addition to those which would cause SIGSEGVs?"

Error handling for I/O memory management units

Posted Aug 25, 2014 8:43 UTC (Mon) by cladisch (✭ supporter ✭, #50193) [Link] (6 responses)

DMA is always done from/to buffers that have been allocated or locked by the device driver.

With DMA, there is no such thing as a minor fault; any IOMMU fault is the result of a bug in the OS/driver or in the hardware.

Error handling for I/O memory management units

Posted Aug 25, 2014 9:22 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

We now have GPUs that can access the system RAM, and it's certainly possible for them to get IOMMU errors while running user space-supplied code.

Error handling for I/O memory management units

Posted Aug 25, 2014 10:19 UTC (Mon) by cladisch (✭ supporter ✭, #50193) [Link] (4 responses)

Allowing userspace to use the GPU to read/write any memory would be a security hole.
The GPU driver checks the command stream for correctness.

Error handling for I/O memory management units

Posted Aug 25, 2014 10:22 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

OpenCL supports arbitrary pointer arithmetic. It's impossible to statically check the command stream for correctness.

Error handling for I/O memory management units

Posted Aug 25, 2014 11:49 UTC (Mon) by intgr (subscriber, #39733) [Link] (2 responses)

Does this mean that, if 2 users are both running code on the GPU, they can access and corrupt each other's data?

And without an IOMMU they can access all physical memory?

And Linux 3.15 merged patches to allow GPGPU (OpenCL) access to any unprivileged user by default (via DRM render nodes)?

And no checking could possibly be done of the code being executed?

PLEASE tell me I am misunderstanding something.

Error handling for I/O memory management units

Posted Aug 25, 2014 14:52 UTC (Mon) by jzbiciak (guest, #5246) [Link]

My understanding of the support at least some GPUs provide (whether or not Linux natively leverages it) is that you can provide an MMU context with a particular command stream. There isn't a global mapping table so that the GPU can see the union of mappings across all requestors. Rather, command streams coming from X get checked against an MMU context associated with X, and command streams coming from Y get checked against an MMU context associated with Y.

And within that framework, my understanding is that GPUs can trigger page faults, and that that is not an error. At least, that's what AMD's Kaveri was promising some time ago, and what I've seen in some other vendors' GPU+MMU pitches.

So I repeat my question: Does 'error' in the article refer to page faults in general, or an actual application error?

Error handling for I/O memory management units

Posted Aug 25, 2014 19:21 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

As I understand, currently GPUs have access only to some RAM regions, not the whole RAM. Though it's changing with the new modern heterogeneous architectures.

Command buffers are also scheduled to be run exclusively, so that gives _some_ protection. Lots of downsides (you can't run for too long, else you can starve other users) but it's also changing.