LWN: Comments on "Ticket spinlocks"

Ticket spinlocks

bankkus — Sat, 12 Dec 2009 09:44:41 +0000

>> And ticket locks do have nice real-time properties, as they get rid of
>> the possibility of starvation!

Is that a correct statement? Not an x86 guru but whatever instruction is used to atomically increment the 16bit number is that guaranteed to complete in hw? In other words consider a 16 way processor if all threads incremented simulataneously is there a guarantee all 16 threads increment (in whatever order) or is it possible that on some of them the thread can context switch out before execution. If such is true then it does not guarantee no starvation.

Why not 1 byte tickets?

orangetide — Fri, 14 Nov 2008 00:17:59 +0000

Limit yourself to 16 processors and cram the whole works into two nybbles. Sort of the opposite direction as the 65536 cpu version.

Not sure what the advantages of the 8-bit or 16-bit version are over putting everything in a 32-bit value. Most of my systems like to talk to RAM in 32-bit chunks (or larger). And all of my linux systems have 32-bit registers (or larger).

Ticket spinlocks

okeydoke — Mon, 28 Jul 2008 22:37:02 +0000

This is my MCS implementation and an optimized (for pipeline) Linux ticket lock
implementation. The results are not for the K42-variant MCS locks originally provided to
Linux.

Ticket spinlocks

okeydoke — Mon, 28 Jul 2008 22:35:20 +0000

Ticket spinlocks perform terribly for uncontested and highly contested cases. Test cases on
LKML do not show any of this (a very trivial test case was done that does not show lock *and*
unlock cost).

Here are relevant numbers on 16-core (8x2) AMD64 machine:
Highly contested, MCS has 30724 a/s (average) and ticket locks have 6315 a/s (average). In the
uncontested case, MCS has 836744 a/s (average) and ticket locks have 688817 a/s (average).
Both implement FIFO fairness, anyways, so I do not expect relevance WRT "thundering herd"
problem.

MCS will perform better than the Anderson locks implemented under the name of "Fairlocks" and
"J-Locks" for Linux, also (I assume these are the NUMA-aware locking primitives you are
talking about?).

There are other implementations that perform better than both of these. Ticket lock with
proportional back-off helps alleviate performance issues in the highly contested case.

Ticket spinlocks

xorbe — Sat, 16 Feb 2008 22:38:28 +0000

FINALLY!  I have always wondered why an approach like this wasn't attempted earlier.  This is
the good stuff, my friends.

Ticket spinlocks and MP-guest kernels

PaulMcKenney — Thu, 14 Feb 2008 16:38:51 +0000

Gang scheduling within the hypervisor would be one way to avoid this issue, and without the
need to lock the VCPUs to real CPUs.

Ticket spinlocks and MP-guest kernels

goaty — Thu, 14 Feb 2008 07:53:28 +0000

You could get a lovely cascade going. VCPU0 grabs the spinlock, then is pre-empted. VCPU1
spends its whole time slice waiting for the spinlock. VCPU2 is scheduled, and starts queuing
for the spinlock. VCPU0 is scheduled, releases the spinlock, goes to sleep. The host scheduler
looks for another thread to schedule. VCPU1 and VCPU2 have just been busy-waiting so they get
penalised. VCPU3, VCPU4, VCPU5, etc., each get scheduled in turn, each run until they hit the
spinlock in question, and start busy waiting. The cascade can continue until we run out of
virtual CPUs. If the rate at which CPUs manage to acquire the lock is slower than the rate at
which CPUs attempt to acquire the lock, the cascade can continue forever!

Although on a general-purpose system the likelihood that all the CPUs would keep trying to
acquire the same spinlock is pretty small, virtual guests are often used to run specific
workloads. Since all the virtual processors are doing the same kind of work, it is quite
likely that they would all keep contending the same spinlock. Which is exactly the situation
that ticket spinlocks are supposed to help with!

Probably anyone who's running MP guest kernels had already got each virtual CPU locked to a
different real CPU, or is running with realtime scheduling, precisely to avoid this kind of
problem. If not, then ticket spinlocks should provide them with all the motivation they need!
:-)

Ticket spinlocks and MP-guest kernels

eSk — Tue, 12 Feb 2008 19:51:00 +0000

There is also the case for multi-vCPU guests running in a VM. For regular spinlocks you have
the problem that a guest vCPU may be preempted while holding a spinlock, thereby potentially
causing large delays for other vCPUs (i.e., think spinlock wait times in the order of hundreds
of milliseconds rather than nanoseconds). A guest vCPU0 is preempted while holding a
spinlock. Guest vCPU1 is scheduled and spends its whole timeslice wating for the usually
shortly held spinlock to be released; not knowing that the holder of the spinlock is not
really running at the time.

One might argue that it's stupid to run an MP guest system on a single physical processor, but
the above scenario is also present for MP guests running on multiple physical processors. You
only get around the problem if you perform strict gang-scheduling of all the vCPUs. The other
option is to add heuristics to prevent prempting guest vCPUs holding spinlocks.

Using ticket spinlocks the problem with preempting lock-holders is increased. Not only do you
have the problem that preempting a lock-holder is bad for performance, but since all the
waiters must be granted the lock in FIFO order one had better make sure that the different
lock contenders are scheduled in the proper order. Failure to do so will massively increase
the lock waiting time. With regular spinlocks you have the chance that any one of the waiting
vCPUs can grab the lock and be done with it quicky. With ticket spinlocks you don't have this
option.

I would expect that anyone trying to run a multi-MP guest kernel will run into this
performance problem rather quickly (subject to number of vCPUs and kernel workload, of
course).

Ticket spinlocks

PaulMcKenney — Tue, 12 Feb 2008 18:08:42 +0000

We did take a look a few years ago.  Of course, pure MCS locks have an incompatible API, but
the K42 folks at IBM Research came up with a variant of MCS locks that matched the
Linux-kernel spin_lock() primitives.  If I remember correctly, MCS locks outperformed the
various NUMA-aware locking primitives, but were outperformed by simple spinlocks.

I would expect that ticket locks would out-perform MCS locks at low contention levels (the
common case in modern Linux kernels).  I am not sure how MCS locks and ticket locks would
compare at high levels of contention -- the ticket locks avoid the "thundering herd" problem,
so might do quite well.

Any benchmark results out there?

Ticket spinlocks

bgamsa — Tue, 12 Feb 2008 17:45:35 +0000

Has anyone looked at MCS locks recently in the context of Linux and modern hardware. I believe it has more requirements in terms of hardware primitives, but it does provide queueing and local spinning.

Ticket spinlocks

PaulMcKenney — Mon, 11 Feb 2008 15:27:23 +0000

Indeed, you do have to have two tasks spinning on the lock before the NMI/IRQ delay could
possibly cause a problem.  So I also hope that this effect is not a problem in practice, but
as you say, testing and experience will be required.

And ticket locks do have nice real-time properties, as they get rid of the possibility of
starvation!

Ticket spinlocks

Nick — Mon, 11 Feb 2008 09:00:17 +0000

That's a good point. And yes I guess if that happens, then that will cause all other waiters
to spin longer as well.

Although I'm not sure that it is a problem in practice. The theory is that most of the time we
have no contention, then FIFO locks are no worse than unfair locks. Actually we could have up
to 1 process spinning on the lock, and NMI/IRQ delay behaviour shouldn't be any different than
unfair locks. It is only when you start getting more contention that it could start to be a
problem...

But the heavily contended case often means performance has tanked anyway, and we are actually
willing to sacrifice some performance to ensure fairness and forward progress.

In short -- I'm hoping that the situation you ask about should not be a real problem.

Unfortunately, as with a lot of these tricky interactions and uncommon workloads, it may be a
release or two before we hear anything about it. But ticket locks are definitely something for
kernel devs to keep in mind if investigating some unusual performance regression in future ;)

Spinlocks

willy — Sat, 09 Feb 2008 14:09:11 +0000

For N processes, you would use a semaphore initialised to N, not a spinlock.  If you wanted N
cpus to be able to do something, then I guess we could introduce a counting spinlock, similar
to a semaphore.  But we've not needed one yet, and I doubt we will.

We have many things implemented on top of spinlocks -- the BKL, semaphores, mutexes, rwlocks,
rwsems and the late and unlamented brwlock (obsoleted by RCU).  Then there's specialised
beasts such as bitlocks and seqlocks.

Ticket spinlocks

PaulMcKenney — Sat, 09 Feb 2008 14:07:53 +0000

I have to ask...  What happens if a task is interrupted/NMIed/whatever-ed just as its turn
arrives?  Especially given a long-running handler?  Or is this scenario sufficiently rare so
as to have little or no performance impact?

Beautiful

Nick — Sat, 09 Feb 2008 07:40:16 +0000

Hi, this is Nick,

I must say first that Jon's writeup is very good, much better than my own attempts to explain
the problem/solution! :)

Secondly, I cannot take credit for the basic algorithm. I didn't actually look at any ticket
lock implementations before implementing this one, however the basic idea of implementing a
fifo using a continually incrementing head and tail index is not new at all.

If anything, I might take credit for distilling it down to somewhere close to the optimal x86
implementation. This is based just on the fact that I had a famous CPU designer ask if they
could use it as their company's reference ticket lock assembly sequence! (but I don't consider
myself an x86 guru, so I wouldn't be surprised if someone else coded this up well before me!)

The main trick I use, as Jon's article points to, is to increment the "next" part of the lock
and load the "owner" part in a single instruction. I do this by using 16 bit wide memory
operations on the 8+8 bit wide ticket lock.

The basic implementation of the algorithm would first increment the next part, and then load
the owner part. Doing it like this is probably fine, but it will increase the load on the L1
cache or on some CPUs it might have to invoke the store forwarding logic or even have to stall
until the store exists the store queue.

To answer cpeterso's question, yes in most situations, if you have enough contention on your
lock that starvation is a problem then sure you have a larger problem with your algorithm.
However, what can happen is that you have a common-case-uncontended lock but 0.001% of your
users will actually generate a lot of contention on that lock. Even if the critical section
itself is very tiny, the cost of transferring the lock cacheline will end up being a limiting
factor as the frequency of acquisition and the core count increases. It is, IMO, better to
survive the corner cases gracefully than to squeeze another couple of cycles out of the
algorithm in the case that you expect. (Also don't forget that often sleeping locks are
implemented using a spinlock)

I could easily be mistaken, but I believe it will result in a better kernel. We'll see.

Spinlocks

jd — Fri, 08 Feb 2008 19:04:42 +0000

The more sophisticated the spinlock allocation mechanism, the greater the overheads. Simply having a wide enough table (and it can be as wide as the UPID mechanism) would be fast, but costly on memory. Spinlocks are also not going to be under equal demand. Perhaps spinlocks can be created with a given ticket width (to reflect expected demand) or have the ticket width grow once a high water mark is exceeded (to reflect actual demand).

There's also the problem of what to do if a maximum of N processes can control something. Do you allocate N spinlocks? That ruins any quality of service you're trying to attain. Or if a process has claimed one spinlock, do you bar it from reclaiming it (or claiming one of the other N) until some sort of fair service guarantee has been achieved?

There seem to be many semi-independent problems involved here, with spinlocks ending up used for 1:1, 1:N, M:1 and M:N situations. in practice, some of these may never really apply, but even if one of the other options applies once, you've a candidate for leading the Department of Headaches.

Do we need smarter spinlocks? Something on top of "dumb" spinlocks to add any extra capabilities? Something that divides the problem-space up such that smarter spinlocks aren't needed? Big spinlocks, such that they can stay as they are and survive in more generalized situations?

Ticket spinlocks

rvfh — Fri, 08 Feb 2008 10:43:46 +0000

I think you missed the ironical reference to Bill Gates' alleged comment "640K of memory
should be enough for anybody." :)

Bill Gates claims this is a urban legend though.

Ticket spinlocks

zlynx — Fri, 08 Feb 2008 06:51:25 +0000

Right, no one could ever need more than 4G processor cores. :)

Ticket spinlocks

brianomahoney — Thu, 07 Feb 2008 23:01:14 +0000

While I suspect it will be a little while before tightly clustered
systems begin to feel this limitation, history is littered with 
_this_should_be_enough_ comments however the algorithm obviously
extends to u_64 (next|current) with little extra cost and that will surely 
do

Ticket spinlocks

jengelh — Thu, 07 Feb 2008 21:10:05 +0000

>That raises the maximum system size to 65536 processors - who could ever want more than that?

SGI.

20 years too late?

cpeterso — Thu, 07 Feb 2008 18:33:30 +0000

I believe DEC implemented a similar lock queue algorithm in the Topaz operating system for
their Firefly multiprocessor system in the 1980s. I think their system avoided polling the
spinlock, thus eliminating memory bus contention. They had some sort of cascading queue of
locks.

If your spinlocks have this much contention, maybe they should sleep (in a wait queue) instead
of spinning?

Ticket spinlocks

corbet — Thu, 07 Feb 2008 14:28:52 +0000

They do refer to the same thing; I've fixed it. Sorry.

Processor limit

corbet — Thu, 07 Feb 2008 14:20:53 +0000

That was a slip of the brain as I was running out of time to get the article done. Of course the limit is 65536; the article has been fixed. Sorry for the confusion.

Ticket spinlocks

jzbiciak — Thu, 07 Feb 2008 13:13:05 +0000

Are there other structures that limit Linux to 32768?  Perhaps that's the context of the
comment.  I looked at the patch and didn't see anything that really tied it to signedness,
since the comparisons are all equals compares.

Ticket spinlocks

and — Thu, 07 Feb 2008 09:25:44 +0000

> That raises the maximum system size to 32768 processors. 

hm, why is this a signed 16 bit number when the "small" version of the 
patch uses unsigned 8 bit values? Unsigned 16 bit integers allow up to 
65536 processors...

Ticket spinlocks

eru — Thu, 07 Feb 2008 06:08:53 +0000

Isn't there a naming inconsistency in the algorithm description as it now stands (at 05:59:44 UTC)? The box diagram and the following paragraph refer to fields "next" and "owner", but the next paragraph after that talks about "next" and "current". I assume "owner" and "current" refer to the same thing?

Beautiful

walken — Thu, 07 Feb 2008 05:06:13 +0000

I think the "lamport's bakery" algorithm is similar to this. There may be subtle differences
though, I'm not an expert in locking algorithms...

Beautiful

elanthis — Thu, 07 Feb 2008 04:01:12 +0000

Seeing algorithms like these come into being is nothing short of beautiful.

Is this an old trick just now brought to the Linux kernel, or something new Nick came up with?

Ticket spinlocks

Lennie — Thu, 07 Feb 2008 01:57:16 +0000

The x86 maintainers clearly thought that the benefits of eliminating the unseemly scramble for spinlocks exceeded this small cost; it seems unlikely that others will disagree.

Not only do I not disagree, I'd love to run a kernel that has this cost, well actually I want the advantages, not so much the disadvantages.

It sounds like a really nice improvement.