A surprise with mutexes and reference counts

By Jonathan Corbet
December 4, 2013

A kernel crash first reported in July for the 3.10 kernel has finally been diagnosed. In the process, some suspect locking in the pipe code has been fixed, but there is a more interesting outcome as well: it appears that an "obviously correct" reference count code pattern is not so correct after all. Understanding why requires looking at the details of how mutexes are implemented in the Linux kernel.

Conceptually, a kernel mutex is a simple sleeping lock. Once a lock is obtained with mutex_lock(), the owner has exclusive access to the protected resources. Any other thread attempting to acquire the same lock will block until the lock becomes free. In practice, the implementation is a little more complex than that. As was first demonstrated in the Btrfs code, performance can be improved if a thread that encounters an unavailable lock actively spins for a brief period (in the hope that the lock will be quickly released) before going to sleep. A running thread that is able to quickly grab a mutex in this manner is likely to be cache-hot, so this type of optimistic spinning tends to improve the overall throughput of the system. Needless to say, the period of spinning should be relatively short; it also will not happen at all if another thread is already spinning or if the owner of the mutex is not actually running.

With that background in place, consider a piece of code like the following which manipulates a structure called "s"; that structure is protected by a mutex embedded within it:

    int free = 0;

    mutex_lock(&s->lock);
    if (--s->refcount == 0)
        free = 1
    mutex_unlock(&s->lock);
    if (free)
	kfree(s);

This code looks like it should work fine; the structure is only freed if it is known that no more references to it exist. But, as Linus figured out while pondering the old bug report, the above code is not correct at all with the current mutex implementation.

The core of the mutex data structure includes an atomic counter (count) and a spinlock (wait_lock). When the lock is free, count has a value of one. A call to mutex_lock() on an available lock will simply decrement count to zero and continue; that is the fast path that, with luck, will be executed most of the time. Should the lock be taken, though, count will be set to a negative number (to indicate that somebody else wants the lock), and the frustrated caller will possibly spin, waiting for count to be set to one once again.

When the call to mutex_unlock() comes, the fast path (executed when count is zero) is, once again, simple: count is incremented back to a value of one. Should count be negative, though, the slow-path code must be run to ensure that waiting threads know that the lock has become available. In a simplified form, this code does the following:

    spin_lock(&lock->wait_lock);
    atomic_set(&lock->count, 1);
    wake_up_process(/* the first waiting thread */);
    spin_unlock(&lock->wait_lock);

The problem can now be seen: the thread which is spinning, waiting for the lock to be free, will break out of its loop once it sees the effect of the atomic_set() call above. So, while the original lock holder is calling wake_up_process() to wake somebody who is waiting for the lock, a thread on another CPU is already proceeding in the knowledge that it owns the same lock. If that thread quickly frees the data structure containing the lock, the final spin_unlock() call will make changes to memory that has already been freed and, possibly, allocated to another user. It is an uncommon race to hit, but, as the original bug report shows, it can occasionally happen.

All this led Linus to conclude:

In other words, it's unsafe to protect reference counts inside objects with anything but spinlocks and/or atomic refcounts. Or you have to have the lock *outside* the object you're protecting (which is often what you want for other reasons anyway, notably lookup).

There is an obvious question that immediately arises from this conclusion: how many other places in the kernel might be affected by this kind of bug? Mutexes and reference counts are both heavily used in the kernel; there must certainly be other places that use the two of them together in an unsafe manner (though Linus is doubtful that they exist). Needless to say, it would be nice to fix that code; the real question is how that might be done. One option is to audit all mutex-using code to find problematic usage, but that would be a long and unpleasant task — one that is unlikely to ever be completed in a comprehensive manner.

The alternative — fixing mutexes to eliminate the failure mode described above — seems easier and more robust in the long term. Ingo Molnar suggested a couple of ways in which that could be done, both based on the understanding that the use of both count and wait_lock to protect parts of the lock is at the root of the problem. The first suggestion was to eliminate count and turn wait_lock into a sort of special, tri-state spinlock; the other, instead, is to use count to control all access to the lock. Either approach, it seems, has the potential to reduce the size of the mutex structure and reduce architecture-specific code along with fixing the problem.

As of this writing, no patches have been posted. It would be surprising, though, if a fix for this particular problem did not surface by the time the 3.14 merge window opens. Locking problems are hard enough to deal with when the locking primitives have simple and easily understood behavior; having subtle traps built into that layer of the kernel is a recipe for a lot of long-term pain.

Index entries for this article
Kernel	Locking mechanisms/Mutexes
Kernel	Race conditions
Kernel	Reference counting

A surprise with mutexes and reference counts

Posted Dec 5, 2013 9:24 UTC (Thu) by runekock (subscriber, #50229) [Link] (9 responses)

The example in the article:

int free = 0; mutex_lock(&s->lock); if (--s->refcount == 0) free = 1 mutex_unlock(&s->lock); if (free) kfree(s);

is unsafe even barring any unexpected implementation details of the lock. There is nothing preventing new users of the data structure incrementing refcount between the mutex_unlock and the kfree (unless that is prevented by some other mechanism, of course).

It seems to me that protecting the allocation of any kind of lock with itself is unsafe in general. Though I bet Paul McKenney could get away with it, it's probably a thing to avoid in code touched by the average hacker.

A surprise with mutexes and reference counts

Posted Dec 5, 2013 9:36 UTC (Thu) by Karellen (subscriber, #67644) [Link] (2 responses)

Sorry, I don't see the problem at all.

Where do these "new users of the data structure" come from? In the case we're talking about, at the point the mutex lock is acquired, the refcount is 1. The current thread is the only thread with a reference to the protected data structure. At that point, there should be no other users at all, running or not, who *could* be able to alter the refcount or any other part of the data structure.

A surprise with mutexes and reference counts

Posted Dec 5, 2013 13:52 UTC (Thu) by zmower (subscriber, #3005) [Link] (1 responses)

There's another thread on a different CPU doing the same thing.

Runekock's right. But for some reason Linus thinks the second CPU does the free. I'll take his word for it. ;)

A surprise with mutexes and reference counts

Posted Dec 5, 2013 16:02 UTC (Thu) by PaXTeam (guest, #24616) [Link]

> There's another thread on a different CPU doing the same thing.

then you have a refcount underflow bug to begin with, nothing to do with the problem discussed here.

A surprise with mutexes and reference counts

Posted Dec 5, 2013 16:44 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Even Paul McKenney is going to have to provide some way of preventing the memory protected by the lock from being freed while acquiring the lock.

One way approach is to use a deferred-reclamation scheme, such as RCU (my personal favorite), SLAB_DESTROY_BY_RCU, hazard pointers (Maged Michael's personal favorite), a general-purpose garbage collector, or even the rough-and-ready approach of simply never freeing any memory. Another approach is to make use of a lock or reference counter that lives outside of the memory being protected, the so-called "hand-over-hand locking" being one (slow!) example, and hashed arrays of locks being another (also slow!) example. Yet another approach is to use a scheme that allows the reference-count increment to proceed only if the pointer to the memory remains unchanged, for example, using MC68020's dcas instruction or one of the transactional-memory approaches. Of course, in the case of transactional memory, make sure to take into account the forward-progress properties!

If the only protection for a piece of memory is a lock inside that piece of memory, life is hard. As far as I know, Gamsa et al. were the first to point this out in their 1999 OSDI paper: http://www.usenix.org/events/osdi99/full_papers/gamsa/gam....

A surprise with mutexes and reference counts

Posted Dec 5, 2013 21:45 UTC (Thu) by bfields (subscriber, #19510) [Link] (3 responses)

There is nothing preventing new users of the data structure incrementing refcount between the mutex_unlock and the kfree

Assume the object is created with a positive refcount and thereafter only ever decremented, never incremented.

Then once the refcount goes to zero it's never going positive again. So the problem you're concerned about doesn't happen. But there's still a problem:

The problem is we assume that it's still safe to call mutex_unlock(&s->lock) after dropping our own reference.

But in fact it's possible that we decrement to 1, then another thread jumps in and decrements to 0 while our mutex_unlock(&s->lock) is still in progress and frees s out from under us.

A surprise with mutexes and reference counts

Posted Dec 5, 2013 22:00 UTC (Thu) by bfields (subscriber, #19510) [Link]

(But yes, agreed on the "unsafe in general" comment, I would've thought the same. While also sympathizing with anyone caught by surprise by the fact that another user could acquire and drop the lock while we still haven't returned from the unlock!)

A surprise with mutexes and reference counts

Posted Dec 6, 2013 17:50 UTC (Fri) by Karellen (subscriber, #67644) [Link] (1 responses)

How can "another thread jump in and decrement to 0" before our unlock is complete? Until our unlock is complete, we still hold the lock. And until we've completed the unlock, no other thread should be able to grab the lock in order to drop the refcount to 0.

A surprise with mutexes and reference counts

Posted Dec 6, 2013 18:41 UTC (Fri) by nybble41 (subscriber, #55106) [Link]

If you define "unlock is complete" as the point where other threads are allowed to run in the critical section, then you're right, that can't happen. If I understand the issue correctly, the problem is that this happens before we return from mutex_unlock(), and mutex_unlock() continues to access the memory for the mutex after it's been unlocked and other threads have been allowed to run. These other threads could free the memory which holds the mutex, leading to an invalid memory access in mutex_unlock().

A surprise with mutexes and reference counts

Posted Dec 17, 2013 14:06 UTC (Tue) by chrisV (guest, #43417) [Link]

> ... is unsafe even barring any unexpected implementation details of the
> lock. There is nothing preventing new users of the data structure
> incrementing refcount between the mutex_unlock and the kfree (unless
> that is prevented by some other mechanism, of course)

That isn't an issue. Reference counted structures like this have to be regarded as defunct once the last thread with access to it has acquired the mutex in order to decrement the count to 0. The deferring of the freeing of memory until after the internal mutex has been unlocked is an implementation detail. You could not access the structure again once the count has reached zero even if the freeing of memory were to take place within the lock (which of course it can't).

You have to marshal access to ensure that once the count has reached 0 the structure is not accessed again, irrespective of the deferred freeing, such as by having all threads acquire their references before any one of them releases one. That gives rise to other issues concerning the locking strategy for the structure, which are not related to the bug with which this article is concerned.

A surprise with mutexes and reference counts

Posted Dec 6, 2013 12:19 UTC (Fri) by sorokin (guest, #88478) [Link] (5 responses)

Looks like this is a bug in mutex implementation, not a bug in reference counting code.

CPU2 from Linus example was able to

1. lock mutex
2. see other thread side effects
3. make its own side effects
4. unlock mutex
5. free memory of mutex

while CPU1 was still unlocking mutex.

The requirement for mutex must be: if I have managed to lock mutex, the other threads must not access mutex memory.

The relaxed requirement could be: if I have managed to lock/make side effects/unlock mutex, other thread must not access mutex memory.

A surprise with mutexes and reference counts

Posted Dec 7, 2013 23:34 UTC (Sat) by giraffedata (guest, #1954) [Link] (2 responses)

Whether it's a mutex bug depends upon what you expect from a mutex, in particular when you consider to be the instant the lock is released.

As we usually define timing in library calls, we would say the lock is released sometime during the execution of mutex_unlock(), with no more precision than that. In the same way, we would say that mutex_unlock() can access its arguments at any time during its execution. With that definition, you obviously cannot put the lock inside the object being protected by the lock, because that means the mutex_unlock() could conceivably access the protected object after it has released the lock.

But since that prevents a very useful locking paradigm -- one that allows an object to be somewhat autonomous and just disappear on its own when its last reference goes away -- most of us are used to a more precise guarantee: the lock gets released at some instant after mutex_unlock's last reference to the lock. I presume that's actually written into official specifications for things like POSIX mutexes.

That stronger guarantee is what the mutexes in question fail to provide; I don't know if the designer intended that or not. If he did, then the reference counting code that assumed the stronger guarantee is where the bug is.

A surprise with mutexes and reference counts

Posted Dec 10, 2013 8:13 UTC (Tue) by sorokin (guest, #88478) [Link]

> Whether it's a mutex bug depends upon what you expect
> from a mutex, in particular when you consider to be the
> instant the lock is released.

Completely argee. Whether the bug is in mutex or in reference counting code depends on the contract of the mutex. Still I believe, that mutex that can access mutex memory for undefinite amount of time after unlocking can be used safely only in very simple programs.

E.g. Linus said that you can keep mutex outside of object you are protecting. But this does not fix anything. How can we guarantee that CPU2 after removing reference counting object will not remove object where mutex reside? After removing reference counting object CPU2 can do anything.

As we decided that mutex that can access mutex memory for undefinite amount is not safe, the only valid kind of mutex is one that has some requirement when it stops accessing its memory after unlock. So I tried to devise those requirements.

> the lock gets released at some instant after mutex_unlock's last reference to the lock

This is very strong requirement. If it is feasible to implement effectively -- good, if not we can relax our requirement as I did.

A surprise with mutexes and reference counts

Posted Dec 10, 2013 8:20 UTC (Tue) by sorokin (guest, #88478) [Link]

Another solution can be having a function that must be called before mutex memory is freed. Something like C++ destructor for mutex. This function must ensure, that all other threads finished unlocking mutex.

A surprise with mutexes and reference counts

Posted Dec 22, 2013 23:37 UTC (Sun) by dpotapov (guest, #46495) [Link] (1 responses)

The above code is definitely incorrect. Even if a pthread mutex was used instead, the code would be still incorrect.

The correct use of a pthread mutex requires to invoke mutex_destroy() before freeing memory, and any sane implementation of pthreads takes an internal spinlock (or do something similar) and thus fixing the above problem.

So mutex_lock and mutex_unlock functions are fine as they are. There is no need for additional requirements. It is just that the Linux kernel does not provide mutex_destroy, because there is no resources to free. However, it means that you can use such a mutex safely in patterns like the code above.

A surprise with mutexes and reference counts

Posted Nov 27, 2014 10:42 UTC (Thu) by tvld (guest, #59052) [Link]

No, the code snippet in the article would be correct under pthreads. The Austin Group has clarified that they want this to work: http://austingroupbugs.net/view.php?id=811
C++11 makes a similar requirement. IIRC, C11 isn't specific, but they usually default to semantic equality with C++11 and/or POSIX in the synchronization constructs.

The underlying issue is that there is both a fast path and a slow path for wake-up, and other threads can wake up in both ways because mutex release first wakes up through the fast-path and then throw the slow path; then, it can happen that the slow path wake-up runs concurrently with another thread having the lock acquired.

This affects glibc's mutex implementation too. There's no segfault or such, but there can be pending FUTEX_WAKE calls to memory locations that might have been unmapped or reused for a different futex. That can introduce spurious wake-ups for FUTEX_WAIT calls on other futexes, which violates the promise FUTEX_WAIT makes for a return value of 0.

I see two meaningful solutions for this:

(1) Clarify that spurious wake-ups (ie, wake-ups that happen but are not due to FUTEX_WAKE calls by the code that created this futex) are allowed when FUTEX_WAIT returns 0. This changes the futex contract, but it's incompletely documented anyway. This doesn't break any code in glibc or
other futex-using code that I'm aware of. Generally, unless you have a one-shot synchronization mechanism, typical futex uses will have to tolerate past slow-path wake-ups anyway, so need to be robust to spurious wake-ups. Also, all programs using glibc will have been and are affected by such spurious wake-ups anyway.

(2) Introduce a new futex type or variant of FUTEX_WAIT whose contract explicitly allows spurious wake-ups; combine with a FUTEX_WAKE that wakes only these calls or futexes. Has the benefit that nothing in contract of the original futex changes, but requires a kernel implementation change and many futex users will need to change (e.g., glibc).

Solving this in userspace entirely is possible, but would degrade performance. If you just allow slow-path wake-up, lock release latency increases and thus scalability decreases. Trying to implement some of the deferred memory reclamation schemes that Paul mentioned is hard for process-shared mutexes.

For more background, see https://sourceware.org/ml/libc-alpha/2014-04/msg00075.html