LWN: Comments on "Wait/wound mutexes"

Nesting wait/wound mutexes?

vomlehn — Sat, 13 Jul 2013 03:22:40 +0000

I can certainly see that notification that you have to unlock currently held mutexes to avoid a deadlock is a Good Thing and could simplify life. Still, given that you may have to undo, and then redo, work when you unlock an m/m mutex and then reacquire it, I am left with the question of knowing how far you *have to* back off in order to avoid a deadlock. It would be nice if you didn't have to release every single m/m mutex that you hold.

It may be the case that, as a practical matter, this is not a performance issue. I can see, however, a maintenance issue. One subsystem may acquire a m/m mutex, then call another subsystem that acquired a different m/m mutex. This M/M mutex implementation implies that the called subsystem will have to know that the caller holds an m/m mutex and return to that subsystem (or use a callback to that subsystem) to release the first mutex. Ick.

I can conceive of the m/m mutex code keeping track of the m/m mutexes held by a given task and providing feekback on whether any more mutexes need to be released, which simplifies the problem a little from the performance perspective.

All-in-all, I'm not sure this is ready for prime time.

They're mutexes, Jim, but not as we know them

dlang — Wed, 15 May 2013 20:49:58 +0000

It may be "only an implementation detail" in that the user of the API never deals with the sequence number.

But this detail is critical to avoiding the risk of permanent starvation of some thread.

They're mutexes, Jim, but not as we know them

mlankhorst — Wed, 15 May 2013 13:45:07 +0000

Correct. The current implementation preserves the sequence number. But it's an implementation detail, the user of the api will never have to work with the sequence numbers directly.

They're mutexes, Jim, but not as we know them

ksandstr — Mon, 13 May 2013 17:40:05 +0000

The distinct ordering of locks is required to prevent the client from turning a set of N locks, out of which M are required, into a very expensive spinlock; the low-level idea being that the lock that violates ordering (and generates the EDEADLK status) is a different one each time, and that its being locked may end up not being due to the current client's other locks. WW mutexes prevent this by sleeping the backing-off client until the conflicting mutex has been released at least once, which is strictly required for its progress.

But as you say, for algorithms where the needed set of locks cannot change between backout and retry, your solution is adequate. I've found that those algorithms generally aren't candidates for fine-grained locking, though that's neither here or there.

Personally, if wait/wound mutexes remove these halfway esoteric concerns from mainstream kernel code (along with the entire idea of lock ordering), I'll be happy as a clam.

They're mutexes, Jim, but not as we know them

dlang — Sat, 11 May 2013 02:21:12 +0000

from the article

> But, since the sequence number increases monotonically, a once-wounded thread must eventually reach a point where it has the highest priority and will win out.

They don't say explicitly that the sequence number is maintained, but I don't see what this would mean otherwise.

They're mutexes, Jim, but not as we know them

brong — Fri, 10 May 2013 23:44:36 +0000

Hang on - are you saying that the sequence number you were allocated persists even after you back out and try again?

My understanding was that once you're wounded, you have to restart from scratch. If you're restarting with the same (low) sequence number rather than being allocated a brand new one, then I see your point. Otherwise, I see starvation possibilities, the horror.

They're mutexes, Jim, but not as we know them

dlang — Fri, 10 May 2013 19:17:40 +0000

> but it seems the other threads may burn unlimited CPU time attempting to take locks and retrying in pathological cases.

undefined CPU time, but not unlimited.

each thread gets a sequence number when they start trying to get a lock, and when two threads conflict, the one with the lower sequence number wins.

As a result, every thread is guaranteed to be able to get the lock in preference to any threads that were first tried to get the lock after it did.

This puts a outer bound on the amount of CPU it will waste in the meantime (admittedly, not a bound that you can readily calculate since you don't know how long the locks will be held by processes that have priority over you)

They're mutexes, Jim, but not as we know them

heijo — Fri, 10 May 2013 17:08:36 +0000

The procedure can simply be to remember the lock you wanted to take in all previous attempts and start the next attempt by sorting them by lock order and taking them.

Eventually you'll succeed, although it may take time quadratic in the total number of locks that may be taken across attempts.

Wait/wound on the other hand guarantees that the first-coming thread will progress in linear time in the number of locks, but it seems the other threads may burn unlimited CPU time attempting to take locks and retrying in pathological cases.

This could be fixable by having the "later" thread wait directly for all earlier threads to be done, to save power at the expense of latency, although this is probably not an issue in practice.

They're mutexes, Jim, but not as we know them

ksandstr — Thu, 09 May 2013 21:48:20 +0000

There are at least two concrete reasons for using wait/wound mutexes over the more common "define a lock ordering, and (very carefully) violate it with try-lock primitives" scheme of fine-grained multiple locking.

The first is that a failure to try-lock requires some sort of a fall-back procedure that either doesn't violate locking order, or does so with a different try-lock subject than in previous iterations. Coming up with that procedure is an enormous hassle, and cramps many a concurrent design. Wait/wound mutexes would seem to avoid this hazard by permitting the wounded thread to resume only after the contended mutex has been released at least once.

The second is that (depending on the interface) wait/wound mutexes could reduce the "slumbering herd" effect that occurs when a thread holds a number of mutexes but then has to wait on one more, repeating down the tree. (this effect also tends to magnify non-mutex waits through the mutex scheme, making it especially pernicious.) This reduction would happen by having the wounded thread wait for the contended mutex only after releasing its own, thereby allowing sufficiently unrelated threads to proceed unimpeded. The net gain is lower aggregate latency under contention.

Wait/Wake

jake — Thu, 02 May 2013 21:16:01 +0000

> I see two usages intersprinkled, "wake/wound" and "wait/wound".

Indeed. Typo alert. "wait/wound" is correct, fixed now, thanks.

jake

Wait/Wake

ncm — Thu, 02 May 2013 21:07:37 +0000

I see two usages intersprinkled, "wake/wound" and "wait/wound". Is this what amounts to a benign disagreement over spelling (I note pronunciation is about the same), or did I miss something?

Wait/wound mutexes

blackwood — Thu, 02 May 2013 09:14:15 +0000

I've forgotten to stress that in my big reply to Maarten's comment a bit, so let's reiterate: Current kernels already ship with these mad deadlock-avoiding mutexes, they're simply called reservations instead of w/w mutexes.

So we have both code using w/w mutexes, and it's not really a new concept for drivers/gpu/drm. Last paragraphs needs to be updated a bit ;-)

Wait/wound mutexes

blackwood — Thu, 02 May 2013 09:03:48 +0000

Yeah, the big reason for pushing these ww mutexes into core code from ttm (where they are called reservations) is to enable cross-device synchronization of dma access to shared buffer objects (aka dma_bufs). Currently access is completely unsynchronized in the kernel, so if userspace doesn't block (which it really shouldn't), displaying a frame rendered on a discrete gpu on the integrated one will horribly tear.

Rob Clark started a proposal for dma_fences (now in Maarten's branch), which are attached to each dma_buf taking part in any given gpu render batch (or any other dma op affecting a dma_buf).

Now if you naively walk all the dma_bufs, lock each of them one-by-one and attach a new set of fences, races with other threads have a peculiar effect: If you're unlucky you can create a synchronization loop between a bunch of buffers and fences, and since these fences can be fully hw-based (i.e. never wake up the cpu to do the syncing) you'll end up with deadlocked gpus, each waiting on the other.

Hw sync/wait deadlocks between different drivers is the kind of fun I don't want to deal with, so we need a multi-object locking scheme which works cross-devices.

Note that i915 isn't currently based on ttm (and personally I'm not known as a big fan of ttm), but the proposed ww mutex stuff is massively improved:
- not intermingled with all the gpu memroy management craziness ttm also does
- sane, clear api (we grade on a curve in gpu-land ...) with decent documentation
- _really_ good debug support - while writing the kerneldoc for Maarten's code I've checked whether any kind of interface abuse would be caught. Furthermore we have a slowpath injection debug option to exercise the contended case (-EDEADLK) with single-threaded tests.

Now one critique I hear often is "why can't you guys use something simpler?". Reasons against that:
- current ttm based drivers (radeon, nouveau, ...) already deal with this complexity. Furthermore gpus tend to die randomly, so all the command submission ioctls for the big drivers (i915, radeon, nouveau) are already fully restartable. Ditching a tiny bit of code to handle the ww mutex slowpath won't sched complexity.
- Current drivers rely on the -EALREADY semantics for correctnes. Crazy, I know, but like I've said: We grade on a scale ... Any simple scheme would so need to support this, too. Toghether with the first point you won't really have be able to achieve any reduction in interface complexity for drivers.
- Thanks to a big discussion with Peter Zijlstra we have a rather solid plan forward for PI-boosting and bounded lock acquisition in linux-rt.

Thus far all the proposed "simple" schemes fall short in one place or the other.

Also, cross-device sync with dma_buf/fence isn't the only use-case I see rolling in:
- i915 is in desperate need of a finer-grained locking scheme. We run into ugly OOM issues all over the place due to our current "one lock to rule them all" design. We duct-tape over the worst with horrible hacks, but spurious OOMs are still too easy to hit. Moving to a per-object lock scheme using ww mutexes will fix that. Plan B would be to use ttm, but that'd be _really_ disruptive, and I don't like ttm that much.
- We're in the process of switching to per-object locking for drm modeset objects, but the complex (and dynamic) graph nature of all the connections poses some interesting issues. Ww mutexes would naturally solve this problem, too.
-Daniel

Wait/wound mutexes

mlankhorst — Thu, 02 May 2013 08:07:55 +0000

Yeah, the full branch is at http://cgit.freedesktop.org/~mlankhorst/linux/log/ , with TTM converted and a WIP to do the same on intel and sync shared dma-bufs between radeon/nouveau and intel. The actual sharing part is still a bit hacky, and less reviewed. This is because it involves synchronization between multiples gpu's, which is a step further.

Wait/wound mutexes

airlied — Thu, 02 May 2013 07:54:34 +0000

Oh there is code to use it, we have code in the drm TTM layer doing ordered buffer reservations with backoff for deadlocks, and we need something at a higher level for dma-buf to use if we are sharing buffers between multiple GPU drivers or other misc drivers.

So Maarten has code to port TTM over to this infrastructure already in a branch, and has posted it to dri-devel previously I think.