LWN.net Logo

Adaptive spinning futexes

By Jonathan Corbet
May 11, 2010
As a general rule, a well-written program should, when it needs a resource currently owned by another program, step aside and allow other work to proceed until that resource becomes available. When it comes to low-level synchronization primitives, though, this rule does not always hold. Better overall system performance can often be achieved if a program busy-waits rather than sleeping. If the wait is short, the performance benefits that come from giving the resource to an already-running, cache-hot process outweigh the cost of the busy wait.

The best-supported (by the kernel) user-space synchronization primitive is the futex. Darren Hart has been working on a patch series intended to bring adaptive spinning to futexes in an attempt to improve the performance of multi-threaded applications. These patches, while still marked as "not ready for inclusion," have evolved considerably over time.

The core idea is simple: if a process attempts to acquire a futex which is already owned by another, it will spin in an acquisition loop until the holding process either releases the futex or is scheduled out. If all goes well, the new process will be able to grab the futex quickly and get on with its work in the most efficient way. In practice, adaptive spinning generally outperforms regular futexes, but only occasionally does better than the highly tweaked, assembly-coded adaptive spinning mutex code used by the pthreads library.

Adaptive spinning requires that the kernel know which process currently owns the futex; that is a minor problem because the current futex operations do not provide that information. So a new locking operation is required in situations where adaptive spinning is to be used.

There is an alternative approach which has been recommended by some developers: do the spinning in user space rather than in the kernel. User-space spinning might just be faster, but it's trickier, because it's harder for user space to know whether the current holder of a futex is executing or not. Providing the requisite information will require the design of a special (and fast) API - work which has not yet been done.


(Log in to post comments)

Adaptive spinning futexes

Posted May 13, 2010 10:10 UTC (Thu) by etienne_lorrain@yahoo.fr (guest, #38022) [Link]

On a Hyper-Threaded processor, it is probably best to let the other
side of the processor finish its job and release the futex. Maybe the
futex should know on which processor it is locked.

Adaptive spinning futexes

Posted May 13, 2010 11:38 UTC (Thu) by farnz (guest, #17727) [Link]

That gets into implementation dependent knowledge - on some implementations of hyperthreading, if one thread is just reading a memory location, testing the value, and looping, almost all the execution units will be spare for the other hyperthread. If that thread releases the lock quickly, the thread that has been spinning is instantly ready to work, and you get maximum benefits from hyperthreading.

Adaptive spinning futexes

Posted May 13, 2010 14:13 UTC (Thu) by ejr (subscriber, #51652) [Link]

And in this situation the lock will be in L1 data cache, while the data *and* instructions needed for switching contexts will be in L2 at best, most likely in memory. Spinning within a shared cache is good for performance (not discussing energy, many other factors in play for that). Short locking times where the lock sits in shared cache with access times in the 10s of cycles call for spinning.

A problem with adaptive spinlocks occurs when you spin on something bounced between memory / caches / processors. By the time you swap out, you've eaten a large cost and blown a lot of memory traffic. I suspect the kernel's in a better position to know what's where with less cache overhead than user-space adaptive spinlocks, so this is sounding potentially great.

Adaptive spinning futexes

Posted May 17, 2010 17:06 UTC (Mon) by dvhart (guest, #19636) [Link]

The second step here is to see about exposing the information only the kernel has currently to userspace. This provides userspace with the same advantage, and avoids the overhead of the syscalls. One tricky part is choosing where to put this data in memory. The other potential drawback here is that it may only work for process private futexes, so only threads of a single process could use it, while the kernel implementation works with threads and processes.

Adaptive spinning futex implementation

Posted May 15, 2010 0:54 UTC (Sat) by vomlehn (subscriber, #45588) [Link]

I actually implemented conditional user-space spinning in 2.4 some years ago and it had really nice performance. It relied on having a "callboard", i.e. a piece of memory that indicated, for each thread in which you were interested, whether it was running or not. The memory is registered with the kernel, which is responsible for updating it when the process state changes.

So, the idea is that you store the thread ID (or an index for the thread) in the spinlock. When you want the lock, you do a conditional load and store with you tid/thread index. If you get back zero, nobody has the lock and you're done. Otherwise, you use the tid/thread index to check the callboard to see whether the thread holding the lock is running. If so, you loop again. If thread holding the lock isn't running, you make a system call that sleeps until the spinlock value is zero (you pass the address to check in the system call).

The performance of this was good, but the nicest part was that there wasn't a significant performance drop-off as the number of contenders for the lock goes up. I no longer recall whether I was using a 4- or 8 processor machine, but I *think* it was 8. From the caching standpoint, the callboard is read often/write rarely, always a good thing. If your conditional load and store only causes a cache conversion to modified if the write actually happens, you also have good cache behavior there. (When I was working on this, the processor I was using would actually record a cache modification even if the condition wasn't met. Ick.)

The work never got pushed back and nothing ever came of it that I know of, which was kinda sad. Oracle had requested that we do this.

Adaptive spinning futex implementation

Posted May 17, 2010 17:10 UTC (Mon) by dvhart (guest, #19636) [Link]

In your 2.4 implementation, you allocated the memory in userspace and then told the kernel where it was? Along the same lines as SET_TID_ADDRESS(2)? Did you also pin this memory?

Adaptive spinning futex implementation

Posted May 18, 2010 1:45 UTC (Tue) by vomlehn (subscriber, #45588) [Link]

Yes, you allocated the memory anyway you wanted, but using shared memory reduced the amount of work the kernel had to do because it could update the process' states in only one place. Plus you needed shared memory for the spinlock part anyway. You're right about pinning the memory, too. We updated the state in the scheduler, so you couldn't go to sleep while the memory was paged in.

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds