Ticket spinlocks
On the x86 architecture, in the 2.6.24 kernel, a spinlock is represented by an integer value. A value of one indicates that the lock is available. The spin_lock() code works by decrementing the value (in a system-wide atomic manner), then looking to see whether the result is zero; if so, the lock has been successfully obtained. Should, instead, the result of the decrement option be negative, the spin_lock() code knows that the lock is owned by somebody else. So it busy-waits ("spins") in a tight loop until the value of the lock becomes positive; then it goes back to the beginning and tries again.
Once the critical section has been executed, the owner of the lock releases it by setting it to 1.
This implementation is very fast, especially in the uncontended case (which is how things should be most of the time). It also makes it easy to see how bad the contention for a lock is - the more negative the value of the lock gets, the more processors are trying to acquire it. But there is one shortcoming with this approach: it is unfair. Once the lock is released, the first processor which is able to decrement it will be the new owner. There is no way to ensure that the processor which has been waiting the longest gets the lock first; in fact, the processor which just released the lock may, by virtue of owning that cache line, have an advantage should it decide to reacquire the lock quickly.
One would hope that spinlock unfairness would not be a problem; usually, if there is serious contention for locks, that contention is a performance issue even before fairness is taken into account. Nick Piggin recently revisited this issue, though, after noticing:
This sort of runtime difference is certainly undesirable. But lock unfairness can also create latency issues; it is hard to give latency guarantees when the wait time for a spinlock can be arbitrarily long.
Nick's response was a new spinlock implementation which he calls "ticket spinlocks." Under the initial version of this patch, a spinlock became a 16-bit quantity, split into two bytes:
Each byte can be thought of as a ticket number. If you have ever been to a store where customers take paper tickets to ensure that they are served in the order of arrival, you can think of the "next" field as being the number on the next ticket in the dispenser, while "owner" is the number appearing in the "now serving" display over the counter.
So, in the new scheme, the value of a lock is initialized (both fields) to zero. spin_lock() starts by noting the value of the lock, then incrementing the "next" field - all in a single, atomic operation. If the value of "next" (before the increment) is equal to "owner," the lock has been obtained and work can continue. Otherwise the processor will spin, waiting until "owner" is incremented to the right value. In this scheme, releasing a lock is a simple matter of incrementing "owner."
The implementation described above does have one small disadvantage in that it limits the number of processors to 256 - any more than that, and a heavily-contended lock could lead to multiple processors thinking they had the same ticket number. Needless to say, the resulting potential for mayhem is not something which can be tolerated. But the 256-processor limit is an unwelcome constraint for those working on large systems, which already have rather more processors than that. So the add-on "big ticket" patch - also merged for 2.6.25 - uses 16-bit values when the configured maximum number of processors exceeds 256. That raises the maximum system size to 65536 processors - who could ever want more than that?
With the older spinlock implementation, all processors contending for a
lock fought to see who could grab it first. Now they wait nicely in line
and grab the lock in the order of arrival. Multi-thread run times even
out, and maximum latencies are reduced (and, more to the point, made
deterministic). There is a slight cost to the new implementation, says
Nick, but that gets very small on contemporary processors and is
essentially zero relative to the cost of a cache miss - which is a common
event when dealing with contended locks. The x86 maintainers clearly
thought that the benefits of eliminating the unseemly scramble for
spinlocks exceeded this small cost; it seems unlikely that others will disagree.
Index entries for this article | |
---|---|
Kernel | Spinlocks |
Posted Feb 7, 2008 1:57 UTC (Thu)
by Lennie (subscriber, #49641)
[Link]
The x86 maintainers clearly thought that the benefits of eliminating the unseemly scramble for spinlocks exceeded this small cost; it seems unlikely that others will disagree. Not only do I not disagree, I'd love to run a kernel that has this cost, well actually I want the advantages, not so much the disadvantages. It sounds like a really nice improvement.
Posted Feb 7, 2008 4:01 UTC (Thu)
by elanthis (guest, #6227)
[Link] (3 responses)
Posted Feb 7, 2008 5:06 UTC (Thu)
by walken (subscriber, #7089)
[Link]
Posted Feb 7, 2008 18:33 UTC (Thu)
by cpeterso (guest, #305)
[Link]
Posted Feb 9, 2008 7:40 UTC (Sat)
by Nick (guest, #15060)
[Link]
Posted Feb 7, 2008 6:08 UTC (Thu)
by eru (subscriber, #2753)
[Link] (1 responses)
Posted Feb 7, 2008 14:28 UTC (Thu)
by corbet (editor, #1)
[Link]
Posted Feb 7, 2008 9:25 UTC (Thu)
by and (guest, #2883)
[Link] (2 responses)
Posted Feb 7, 2008 13:13 UTC (Thu)
by jzbiciak (guest, #5246)
[Link]
Posted Feb 7, 2008 14:20 UTC (Thu)
by corbet (editor, #1)
[Link]
Posted Feb 7, 2008 21:10 UTC (Thu)
by jengelh (guest, #33263)
[Link] (5 responses)
Posted Feb 7, 2008 23:01 UTC (Thu)
by brianomahoney (guest, #6206)
[Link] (4 responses)
Posted Feb 8, 2008 6:51 UTC (Fri)
by zlynx (guest, #2285)
[Link]
Posted Feb 8, 2008 10:43 UTC (Fri)
by rvfh (guest, #31018)
[Link]
Posted Feb 8, 2008 19:04 UTC (Fri)
by jd (guest, #26381)
[Link] (1 responses)
There's also the problem of what to do if a maximum of N processes can control something. Do you allocate N spinlocks? That ruins any quality of service you're trying to attain. Or if a process has claimed one spinlock, do you bar it from reclaiming it (or claiming one of the other N) until some sort of fair service guarantee has been achieved?
There seem to be many semi-independent problems involved here, with spinlocks ending up used for 1:1, 1:N, M:1 and M:N situations. in practice, some of these may never really apply, but even if one of the other options applies once, you've a candidate for leading the Department of Headaches.
Do we need smarter spinlocks? Something on top of "dumb" spinlocks to add any extra capabilities? Something that divides the problem-space up such that smarter spinlocks aren't needed? Big spinlocks, such that they can stay as they are and survive in more generalized situations?
Posted Feb 9, 2008 14:09 UTC (Sat)
by willy (subscriber, #9762)
[Link]
Posted Feb 9, 2008 14:07 UTC (Sat)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (10 responses)
Posted Feb 11, 2008 9:00 UTC (Mon)
by Nick (guest, #15060)
[Link] (9 responses)
Posted Feb 11, 2008 15:27 UTC (Mon)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (5 responses)
Posted Feb 12, 2008 17:45 UTC (Tue)
by bgamsa (guest, #33057)
[Link] (3 responses)
Posted Feb 12, 2008 18:08 UTC (Tue)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (2 responses)
Posted Jul 28, 2008 22:35 UTC (Mon)
by okeydoke (guest, #46751)
[Link] (1 responses)
Posted Jul 28, 2008 22:37 UTC (Mon)
by okeydoke (guest, #46751)
[Link]
Posted Dec 12, 2009 9:44 UTC (Sat)
by bankkus (guest, #62476)
[Link]
Is that a correct statement? Not an x86 guru but whatever instruction is used to atomically increment the 16bit number is that guaranteed to complete in hw? In other words consider a 16 way processor if all threads incremented simulataneously is there a guarantee all 16 threads increment (in whatever order) or is it possible that on some of them the thread can context switch out before execution. If such is true then it does not guarantee no starvation.
Posted Feb 12, 2008 19:51 UTC (Tue)
by eSk (guest, #45221)
[Link] (2 responses)
Posted Feb 14, 2008 7:53 UTC (Thu)
by goaty (guest, #17783)
[Link] (1 responses)
Posted Feb 14, 2008 16:38 UTC (Thu)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
Posted Feb 16, 2008 22:38 UTC (Sat)
by xorbe (guest, #3165)
[Link]
Posted Nov 14, 2008 0:17 UTC (Fri)
by orangetide (guest, #47730)
[Link]
Not sure what the advantages of the 8-bit or 16-bit version are over putting everything in a 32-bit value. Most of my systems like to talk to RAM in 32-bit chunks (or larger). And all of my linux systems have 32-bit registers (or larger).
Ticket spinlocks
Beautiful
Seeing algorithms like these come into being is nothing short of beautiful.
Is this an old trick just now brought to the Linux kernel, or something new Nick came up with?
Beautiful
I think the "lamport's bakery" algorithm is similar to this. There may be subtle differences
though, I'm not an expert in locking algorithms...
20 years too late?
I believe DEC implemented a similar lock queue algorithm in the Topaz operating system for
their Firefly multiprocessor system in the 1980s. I think their system avoided polling the
spinlock, thus eliminating memory bus contention. They had some sort of cascading queue of
locks.
If your spinlocks have this much contention, maybe they should sleep (in a wait queue) instead
of spinning?
Beautiful
Hi, this is Nick,
I must say first that Jon's writeup is very good, much better than my own attempts to explain
the problem/solution! :)
Secondly, I cannot take credit for the basic algorithm. I didn't actually look at any ticket
lock implementations before implementing this one, however the basic idea of implementing a
fifo using a continually incrementing head and tail index is not new at all.
If anything, I might take credit for distilling it down to somewhere close to the optimal x86
implementation. This is based just on the fact that I had a famous CPU designer ask if they
could use it as their company's reference ticket lock assembly sequence! (but I don't consider
myself an x86 guru, so I wouldn't be surprised if someone else coded this up well before me!)
The main trick I use, as Jon's article points to, is to increment the "next" part of the lock
and load the "owner" part in a single instruction. I do this by using 16 bit wide memory
operations on the 8+8 bit wide ticket lock.
The basic implementation of the algorithm would first increment the next part, and then load
the owner part. Doing it like this is probably fine, but it will increase the load on the L1
cache or on some CPUs it might have to invoke the store forwarding logic or even have to stall
until the store exists the store queue.
To answer cpeterso's question, yes in most situations, if you have enough contention on your
lock that starvation is a problem then sure you have a larger problem with your algorithm.
However, what can happen is that you have a common-case-uncontended lock but 0.001% of your
users will actually generate a lot of contention on that lock. Even if the critical section
itself is very tiny, the cost of transferring the lock cacheline will end up being a limiting
factor as the frequency of acquisition and the core count increases. It is, IMO, better to
survive the corner cases gracefully than to squeeze another couple of cycles out of the
algorithm in the case that you expect. (Also don't forget that often sleeping locks are
implemented using a spinlock)
I could easily be mistaken, but I believe it will result in a better kernel. We'll see.
Isn't there a naming inconsistency in the algorithm description as it now stands (at 05:59:44 UTC)? The box diagram and the following paragraph refer to fields "next" and "owner", but the next paragraph after that talks about "next" and "current". I assume "owner" and "current" refer to the same thing?
Ticket spinlocks
They do refer to the same thing; I've fixed it. Sorry.
Ticket spinlocks
Ticket spinlocks
> That raises the maximum system size to 32768 processors.
hm, why is this a signed 16 bit number when the "small" version of the
patch uses unsigned 8 bit values? Unsigned 16 bit integers allow up to
65536 processors...
Ticket spinlocks
Are there other structures that limit Linux to 32768? Perhaps that's the context of the
comment. I looked at the patch and didn't see anything that really tied it to signedness,
since the comparisons are all equals compares.
That was a slip of the brain as I was running out of time to get the article done. Of course the limit is 65536; the article has been fixed. Sorry for the confusion.
Processor limit
Ticket spinlocks
>That raises the maximum system size to 65536 processors - who could ever want more than that?
SGI.
Ticket spinlocks
While I suspect it will be a little while before tightly clustered
systems begin to feel this limitation, history is littered with
_this_should_be_enough_ comments however the algorithm obviously
extends to u_64 (next|current) with little extra cost and that will surely
do
Ticket spinlocks
Right, no one could ever need more than 4G processor cores. :)
Ticket spinlocks
I think you missed the ironical reference to Bill Gates' alleged comment "640K of memory
should be enough for anybody." :)
Bill Gates claims this is a urban legend though.
The more sophisticated the spinlock allocation mechanism, the greater the overheads. Simply having a wide enough table (and it can be as wide as the UPID mechanism) would be fast, but costly on memory. Spinlocks are also not going to be under equal demand. Perhaps spinlocks can be created with a given ticket width (to reflect expected demand) or have the ticket width grow once a high water mark is exceeded (to reflect actual demand).
Spinlocks
Spinlocks
For N processes, you would use a semaphore initialised to N, not a spinlock. If you wanted N
cpus to be able to do something, then I guess we could introduce a counting spinlock, similar
to a semaphore. But we've not needed one yet, and I doubt we will.
We have many things implemented on top of spinlocks -- the BKL, semaphores, mutexes, rwlocks,
rwsems and the late and unlamented brwlock (obsoleted by RCU). Then there's specialised
beasts such as bitlocks and seqlocks.
Ticket spinlocks
I have to ask... What happens if a task is interrupted/NMIed/whatever-ed just as its turn
arrives? Especially given a long-running handler? Or is this scenario sufficiently rare so
as to have little or no performance impact?
Ticket spinlocks
That's a good point. And yes I guess if that happens, then that will cause all other waiters
to spin longer as well.
Although I'm not sure that it is a problem in practice. The theory is that most of the time we
have no contention, then FIFO locks are no worse than unfair locks. Actually we could have up
to 1 process spinning on the lock, and NMI/IRQ delay behaviour shouldn't be any different than
unfair locks. It is only when you start getting more contention that it could start to be a
problem...
But the heavily contended case often means performance has tanked anyway, and we are actually
willing to sacrifice some performance to ensure fairness and forward progress.
In short -- I'm hoping that the situation you ask about should not be a real problem.
Unfortunately, as with a lot of these tricky interactions and uncommon workloads, it may be a
release or two before we hear anything about it. But ticket locks are definitely something for
kernel devs to keep in mind if investigating some unusual performance regression in future ;)
Ticket spinlocks
Indeed, you do have to have two tasks spinning on the lock before the NMI/IRQ delay could
possibly cause a problem. So I also hope that this effect is not a problem in practice, but
as you say, testing and experience will be required.
And ticket locks do have nice real-time properties, as they get rid of the possibility of
starvation!
Has anyone looked at MCS locks recently in the context of Linux and modern hardware. I believe it has more requirements in terms of hardware primitives, but it does provide queueing and local spinning.
Ticket spinlocks
Ticket spinlocks
We did take a look a few years ago. Of course, pure MCS locks have an incompatible API, but
the K42 folks at IBM Research came up with a variant of MCS locks that matched the
Linux-kernel spin_lock() primitives. If I remember correctly, MCS locks outperformed the
various NUMA-aware locking primitives, but were outperformed by simple spinlocks.
I would expect that ticket locks would out-perform MCS locks at low contention levels (the
common case in modern Linux kernels). I am not sure how MCS locks and ticket locks would
compare at high levels of contention -- the ticket locks avoid the "thundering herd" problem,
so might do quite well.
Any benchmark results out there?
Ticket spinlocks
Ticket spinlocks perform terribly for uncontested and highly contested cases. Test cases on
LKML do not show any of this (a very trivial test case was done that does not show lock *and*
unlock cost).
Here are relevant numbers on 16-core (8x2) AMD64 machine:
Highly contested, MCS has 30724 a/s (average) and ticket locks have 6315 a/s (average). In the
uncontested case, MCS has 836744 a/s (average) and ticket locks have 688817 a/s (average).
Both implement FIFO fairness, anyways, so I do not expect relevance WRT "thundering herd"
problem.
MCS will perform better than the Anderson locks implemented under the name of "Fairlocks" and
"J-Locks" for Linux, also (I assume these are the NUMA-aware locking primitives you are
talking about?).
There are other implementations that perform better than both of these. Ticket lock with
proportional back-off helps alleviate performance issues in the highly contested case.
Ticket spinlocks
This is my MCS implementation and an optimized (for pipeline) Linux ticket lock
implementation. The results are not for the K42-variant MCS locks originally provided to
Linux.
Ticket spinlocks
>> the possibility of starvation!
Ticket spinlocks and MP-guest kernels
There is also the case for multi-vCPU guests running in a VM. For regular spinlocks you have
the problem that a guest vCPU may be preempted while holding a spinlock, thereby potentially
causing large delays for other vCPUs (i.e., think spinlock wait times in the order of hundreds
of milliseconds rather than nanoseconds). A guest vCPU0 is preempted while holding a
spinlock. Guest vCPU1 is scheduled and spends its whole timeslice wating for the usually
shortly held spinlock to be released; not knowing that the holder of the spinlock is not
really running at the time.
One might argue that it's stupid to run an MP guest system on a single physical processor, but
the above scenario is also present for MP guests running on multiple physical processors. You
only get around the problem if you perform strict gang-scheduling of all the vCPUs. The other
option is to add heuristics to prevent prempting guest vCPUs holding spinlocks.
Using ticket spinlocks the problem with preempting lock-holders is increased. Not only do you
have the problem that preempting a lock-holder is bad for performance, but since all the
waiters must be granted the lock in FIFO order one had better make sure that the different
lock contenders are scheduled in the proper order. Failure to do so will massively increase
the lock waiting time. With regular spinlocks you have the chance that any one of the waiting
vCPUs can grab the lock and be done with it quicky. With ticket spinlocks you don't have this
option.
I would expect that anyone trying to run a multi-MP guest kernel will run into this
performance problem rather quickly (subject to number of vCPUs and kernel workload, of
course).
Ticket spinlocks and MP-guest kernels
You could get a lovely cascade going. VCPU0 grabs the spinlock, then is pre-empted. VCPU1
spends its whole time slice waiting for the spinlock. VCPU2 is scheduled, and starts queuing
for the spinlock. VCPU0 is scheduled, releases the spinlock, goes to sleep. The host scheduler
looks for another thread to schedule. VCPU1 and VCPU2 have just been busy-waiting so they get
penalised. VCPU3, VCPU4, VCPU5, etc., each get scheduled in turn, each run until they hit the
spinlock in question, and start busy waiting. The cascade can continue until we run out of
virtual CPUs. If the rate at which CPUs manage to acquire the lock is slower than the rate at
which CPUs attempt to acquire the lock, the cascade can continue forever!
Although on a general-purpose system the likelihood that all the CPUs would keep trying to
acquire the same spinlock is pretty small, virtual guests are often used to run specific
workloads. Since all the virtual processors are doing the same kind of work, it is quite
likely that they would all keep contending the same spinlock. Which is exactly the situation
that ticket spinlocks are supposed to help with!
Probably anyone who's running MP guest kernels had already got each virtual CPU locked to a
different real CPU, or is running with realtime scheduling, precisely to avoid this kind of
problem. If not, then ticket spinlocks should provide them with all the motivation they need!
:-)
Ticket spinlocks and MP-guest kernels
Gang scheduling within the hypervisor would be one way to avoid this issue, and without the
need to lock the VCPUs to real CPUs.
Ticket spinlocks
FINALLY! I have always wondered why an approach like this wasn't attempted earlier. This is
the good stuff, my friends.
Why not 1 byte tickets?