A surprise with mutexes and reference counts

Posted Dec 6, 2013 12:19 UTC (Fri) by sorokin (guest, #88478)
Parent article: A surprise with mutexes and reference counts

Looks like this is a bug in mutex implementation, not a bug in reference counting code.

CPU2 from Linus example was able to

1. lock mutex
2. see other thread side effects
3. make its own side effects
4. unlock mutex
5. free memory of mutex

while CPU1 was still unlocking mutex.

The requirement for mutex must be: if I have managed to lock mutex, the other threads must not access mutex memory.

The relaxed requirement could be: if I have managed to lock/make side effects/unlock mutex, other thread must not access mutex memory.

A surprise with mutexes and reference counts

Posted Dec 7, 2013 23:34 UTC (Sat) by giraffedata (guest, #1954) [Link] (2 responses)

Whether it's a mutex bug depends upon what you expect from a mutex, in particular when you consider to be the instant the lock is released.

As we usually define timing in library calls, we would say the lock is released sometime during the execution of mutex_unlock(), with no more precision than that. In the same way, we would say that mutex_unlock() can access its arguments at any time during its execution. With that definition, you obviously cannot put the lock inside the object being protected by the lock, because that means the mutex_unlock() could conceivably access the protected object after it has released the lock.

But since that prevents a very useful locking paradigm -- one that allows an object to be somewhat autonomous and just disappear on its own when its last reference goes away -- most of us are used to a more precise guarantee: the lock gets released at some instant after mutex_unlock's last reference to the lock. I presume that's actually written into official specifications for things like POSIX mutexes.

That stronger guarantee is what the mutexes in question fail to provide; I don't know if the designer intended that or not. If he did, then the reference counting code that assumed the stronger guarantee is where the bug is.

A surprise with mutexes and reference counts

Posted Dec 10, 2013 8:13 UTC (Tue) by sorokin (guest, #88478) [Link]

> Whether it's a mutex bug depends upon what you expect
> from a mutex, in particular when you consider to be the
> instant the lock is released.

Completely argee. Whether the bug is in mutex or in reference counting code depends on the contract of the mutex. Still I believe, that mutex that can access mutex memory for undefinite amount of time after unlocking can be used safely only in very simple programs.

E.g. Linus said that you can keep mutex outside of object you are protecting. But this does not fix anything. How can we guarantee that CPU2 after removing reference counting object will not remove object where mutex reside? After removing reference counting object CPU2 can do anything.

As we decided that mutex that can access mutex memory for undefinite amount is not safe, the only valid kind of mutex is one that has some requirement when it stops accessing its memory after unlock. So I tried to devise those requirements.

> the lock gets released at some instant after mutex_unlock's last reference to the lock

This is very strong requirement. If it is feasible to implement effectively -- good, if not we can relax our requirement as I did.

A surprise with mutexes and reference counts

Posted Dec 10, 2013 8:20 UTC (Tue) by sorokin (guest, #88478) [Link]

Another solution can be having a function that must be called before mutex memory is freed. Something like C++ destructor for mutex. This function must ensure, that all other threads finished unlocking mutex.

A surprise with mutexes and reference counts

Posted Dec 22, 2013 23:37 UTC (Sun) by dpotapov (guest, #46495) [Link] (1 responses)

The above code is definitely incorrect. Even if a pthread mutex was used instead, the code would be still incorrect.

The correct use of a pthread mutex requires to invoke mutex_destroy() before freeing memory, and any sane implementation of pthreads takes an internal spinlock (or do something similar) and thus fixing the above problem.

So mutex_lock and mutex_unlock functions are fine as they are. There is no need for additional requirements. It is just that the Linux kernel does not provide mutex_destroy, because there is no resources to free. However, it means that you can use such a mutex safely in patterns like the code above.

A surprise with mutexes and reference counts

Posted Nov 27, 2014 10:42 UTC (Thu) by tvld (guest, #59052) [Link]

No, the code snippet in the article would be correct under pthreads. The Austin Group has clarified that they want this to work: http://austingroupbugs.net/view.php?id=811
C++11 makes a similar requirement. IIRC, C11 isn't specific, but they usually default to semantic equality with C++11 and/or POSIX in the synchronization constructs.

The underlying issue is that there is both a fast path and a slow path for wake-up, and other threads can wake up in both ways because mutex release first wakes up through the fast-path and then throw the slow path; then, it can happen that the slow path wake-up runs concurrently with another thread having the lock acquired.

This affects glibc's mutex implementation too. There's no segfault or such, but there can be pending FUTEX_WAKE calls to memory locations that might have been unmapped or reused for a different futex. That can introduce spurious wake-ups for FUTEX_WAIT calls on other futexes, which violates the promise FUTEX_WAIT makes for a return value of 0.

I see two meaningful solutions for this:

(1) Clarify that spurious wake-ups (ie, wake-ups that happen but are not due to FUTEX_WAKE calls by the code that created this futex) are allowed when FUTEX_WAIT returns 0. This changes the futex contract, but it's incompletely documented anyway. This doesn't break any code in glibc or
other futex-using code that I'm aware of. Generally, unless you have a one-shot synchronization mechanism, typical futex uses will have to tolerate past slow-path wake-ups anyway, so need to be robust to spurious wake-ups. Also, all programs using glibc will have been and are affected by such spurious wake-ups anyway.

(2) Introduce a new futex type or variant of FUTEX_WAIT whose contract explicitly allows spurious wake-ups; combine with a FUTEX_WAKE that wakes only these calls or futexes. Has the benefit that nothing in contract of the original futex changes, but requires a kernel implementation change and many futex users will need to change (e.g., glibc).

Solving this in userspace entirely is possible, but would degrade performance. If you just allow slow-path wake-up, lock release latency increases and thus scalability decreases. Trying to implement some of the deferred memory reclamation schemes that Paul mentioned is hard for process-shared mutexes.

For more background, see https://sourceware.org/ml/libc-alpha/2014-04/msg00075.html