Wait/wound mutexes

Posted May 2, 2013 7:54 UTC (Thu) by airlied (subscriber, #9104)
Parent article: Wait/wound mutexes

Oh there is code to use it, we have code in the drm TTM layer doing ordered buffer reservations with backoff for deadlocks, and we need something at a higher level for dma-buf to use if we are sharing buffers between multiple GPU drivers or other misc drivers.

So Maarten has code to port TTM over to this infrastructure already in a branch, and has posted it to dri-devel previously I think.

Wait/wound mutexes

Posted May 2, 2013 8:07 UTC (Thu) by mlankhorst (subscriber, #52260) [Link] (1 responses)

Yeah, the full branch is at http://cgit.freedesktop.org/~mlankhorst/linux/log/ , with TTM converted and a WIP to do the same on intel and sync shared dma-bufs between radeon/nouveau and intel. The actual sharing part is still a bit hacky, and less reviewed. This is because it involves synchronization between multiples gpu's, which is a step further.

Wait/wound mutexes

Posted May 2, 2013 9:03 UTC (Thu) by blackwood (guest, #44174) [Link]

Yeah, the big reason for pushing these ww mutexes into core code from ttm (where they are called reservations) is to enable cross-device synchronization of dma access to shared buffer objects (aka dma_bufs). Currently access is completely unsynchronized in the kernel, so if userspace doesn't block (which it really shouldn't), displaying a frame rendered on a discrete gpu on the integrated one will horribly tear.

Rob Clark started a proposal for dma_fences (now in Maarten's branch), which are attached to each dma_buf taking part in any given gpu render batch (or any other dma op affecting a dma_buf).

Now if you naively walk all the dma_bufs, lock each of them one-by-one and attach a new set of fences, races with other threads have a peculiar effect: If you're unlucky you can create a synchronization loop between a bunch of buffers and fences, and since these fences can be fully hw-based (i.e. never wake up the cpu to do the syncing) you'll end up with deadlocked gpus, each waiting on the other.

Hw sync/wait deadlocks between different drivers is the kind of fun I don't want to deal with, so we need a multi-object locking scheme which works cross-devices.

Note that i915 isn't currently based on ttm (and personally I'm not known as a big fan of ttm), but the proposed ww mutex stuff is massively improved:
- not intermingled with all the gpu memroy management craziness ttm also does
- sane, clear api (we grade on a curve in gpu-land ...) with decent documentation
- _really_ good debug support - while writing the kerneldoc for Maarten's code I've checked whether any kind of interface abuse would be caught. Furthermore we have a slowpath injection debug option to exercise the contended case (-EDEADLK) with single-threaded tests.

Now one critique I hear often is "why can't you guys use something simpler?". Reasons against that:
- current ttm based drivers (radeon, nouveau, ...) already deal with this complexity. Furthermore gpus tend to die randomly, so all the command submission ioctls for the big drivers (i915, radeon, nouveau) are already fully restartable. Ditching a tiny bit of code to handle the ww mutex slowpath won't sched complexity.
- Current drivers rely on the -EALREADY semantics for correctnes. Crazy, I know, but like I've said: We grade on a scale ... Any simple scheme would so need to support this, too. Toghether with the first point you won't really have be able to achieve any reduction in interface complexity for drivers.
- Thanks to a big discussion with Peter Zijlstra we have a rather solid plan forward for PI-boosting and bounded lock acquisition in linux-rt.

Thus far all the proposed "simple" schemes fall short in one place or the other.

Also, cross-device sync with dma_buf/fence isn't the only use-case I see rolling in:
- i915 is in desperate need of a finer-grained locking scheme. We run into ugly OOM issues all over the place due to our current "one lock to rule them all" design. We duct-tape over the worst with horrible hacks, but spurious OOMs are still too easy to hit. Moving to a per-object lock scheme using ww mutexes will fix that. Plan B would be to use ttm, but that'd be _really_ disruptive, and I don't like ttm that much.
- We're in the process of switching to per-object locking for drm modeset objects, but the complex (and dynamic) graph nature of all the connections poses some interesting issues. Ww mutexes would naturally solve this problem, too.
-Daniel

Wait/wound mutexes

Posted May 2, 2013 9:14 UTC (Thu) by blackwood (guest, #44174) [Link]

I've forgotten to stress that in my big reply to Maarten's comment a bit, so let's reiterate: Current kernels already ship with these mad deadlock-avoiding mutexes, they're simply called reservations instead of w/w mutexes.

So we have both code using w/w mutexes, and it's not really a new concept for drivers/gpu/drm. Last paragraphs needs to be updated a bit ;-)