Weekly Edition Return to the Kernel page |
Ticket spinlocks
Spinlocks are the lowest-level mutual exclusion mechanism in the Linux
kernel. As such, they have a great deal of influence over the safety and
performance of the kernel, so it is not surprising that a great deal of
optimization effort has gone into the various (architecture-specific)
spinlock implementations. That does not mean that all of the work has been
done, though; a patch merged for 2.6.25 shows that there is always more
which can be done.
On the x86 architecture, in the 2.6.24 kernel, a spinlock is represented by an integer value. A value of one indicates that the lock is available. The spin_lock() code works by decrementing the value (in a system-wide atomic manner), then looking to see whether the result is zero; if so, the lock has been successfully obtained. Should, instead, the result of the decrement option be negative, the spin_lock() code knows that the lock is owned by somebody else. So it busy-waits ("spins") in a tight loop until the value of the lock becomes positive; then it goes back to the beginning and tries again. Once the critical section has been executed, the owner of the lock releases it by setting it to 1. This implementation is very fast, especially in the uncontended case (which is how things should be most of the time). It also makes it easy to see how bad the contention for a lock is - the more negative the value of the lock gets, the more processors are trying to acquire it. But there is one shortcoming with this approach: it is unfair. Once the lock is released, the first processor which is able to decrement it will be the new owner. There is no way to ensure that the processor which has been waiting the longest gets the lock first; in fact, the processor which just released the lock may, by virtue of owning that cache line, have an advantage should it decide to reacquire the lock quickly. One would hope that spinlock unfairness would not be a problem; usually, if there is serious contention for locks, that contention is a performance issue even before fairness is taken into account. Nick Piggin recently revisited this issue, though, after noticing:
On an 8 core (2 socket) Opteron, spinlock unfairness is extremely
noticable, with a userspace test having a difference of up to 2x
runtime per thread, and some threads are starved or "unfairly"
granted the lock up to 1 000 000 (!) times.
This sort of runtime difference is certainly undesirable. But lock unfairness can also create latency issues; it is hard to give latency guarantees when the wait time for a spinlock can be arbitrarily long. Nick's response was a new spinlock implementation which he calls "ticket spinlocks." Under the initial version of this patch, a spinlock became a 16-bit quantity, split into two bytes:
Each byte can be thought of as a ticket number. If you have ever been to a store where customers take paper tickets to ensure that they are served in the order of arrival, you can think of the "next" field as being the number on the next ticket in the dispenser, while "owner" is the number appearing in the "now serving" display over the counter. So, in the new scheme, the value of a lock is initialized (both fields) to zero. spin_lock() starts by noting the value of the lock, then incrementing the "next" field - all in a single, atomic operation. If the value of "next" (before the increment) is equal to "owner," the lock has been obtained and work can continue. Otherwise the processor will spin, waiting until "owner" is incremented to the right value. In this scheme, releasing a lock is a simple matter of incrementing "owner." The implementation described above does have one small disadvantage in that it limits the number of processors to 256 - any more than that, and a heavily-contended lock could lead to multiple processors thinking they had the same ticket number. Needless to say, the resulting potential for mayhem is not something which can be tolerated. But the 256-processor limit is an unwelcome constraint for those working on large systems, which already have rather more processors than that. So the add-on "big ticket" patch - also merged for 2.6.25 - uses 16-bit values when the configured maximum number of processors exceeds 256. That raises the maximum system size to 65536 processors - who could ever want more than that? With the older spinlock implementation, all processors contending for a lock fought to see who could grab it first. Now they wait nicely in line and grab the lock in the order of arrival. Multi-thread run times even out, and maximum latencies are reduced (and, more to the point, made deterministic). There is a slight cost to the new implementation, says Nick, but that gets very small on contemporary processors and is essentially zero relative to the cost of a cache miss - which is a common event when dealing with contended locks. The x86 maintainers clearly thought that the benefits of eliminating the unseemly scramble for spinlocks exceeded this small cost; it seems unlikely that others will disagree. (Log in to post comments)
Ticket spinlocks Posted Feb 7, 2008 1:57 UTC (Thu) by Lennie (subscriber, #49641) [Link] The x86 maintainers clearly thought that the benefits of eliminating the unseemly scramble for spinlocks exceeded this small cost; it seems unlikely that others will disagree. Not only do I not disagree, I'd love to run a kernel that has this cost, well actually I want the advantages, not so much the disadvantages. It sounds like a really nice improvement.
Beautiful Posted Feb 7, 2008 4:01 UTC (Thu) by elanthis (subscriber, #6227) [Link] Seeing algorithms like these come into being is nothing short of beautiful. Is this an old trick just now brought to the Linux kernel, or something new Nick came up with?
Beautiful Posted Feb 7, 2008 5:06 UTC (Thu) by walken (subscriber, #7089) [Link] I think the "lamport's bakery" algorithm is similar to this. There may be subtle differences though, I'm not an expert in locking algorithms...
20 years too late? Posted Feb 7, 2008 18:33 UTC (Thu) by cpeterso (subscriber, #305) [Link] I believe DEC implemented a similar lock queue algorithm in the Topaz operating system for their Firefly multiprocessor system in the 1980s. I think their system avoided polling the spinlock, thus eliminating memory bus contention. They had some sort of cascading queue of locks. If your spinlocks have this much contention, maybe they should sleep (in a wait queue) instead of spinning?
Beautiful Posted Feb 9, 2008 7:40 UTC (Sat) by Nick (guest, #15060) [Link] Hi, this is Nick, I must say first that Jon's writeup is very good, much better than my own attempts to explain the problem/solution! :) Secondly, I cannot take credit for the basic algorithm. I didn't actually look at any ticket lock implementations before implementing this one, however the basic idea of implementing a fifo using a continually incrementing head and tail index is not new at all. If anything, I might take credit for distilling it down to somewhere close to the optimal x86 implementation. This is based just on the fact that I had a famous CPU designer ask if they could use it as their company's reference ticket lock assembly sequence! (but I don't consider myself an x86 guru, so I wouldn't be surprised if someone else coded this up well before me!) The main trick I use, as Jon's article points to, is to increment the "next" part of the lock and load the "owner" part in a single instruction. I do this by using 16 bit wide memory operations on the 8+8 bit wide ticket lock. The basic implementation of the algorithm would first increment the next part, and then load the owner part. Doing it like this is probably fine, but it will increase the load on the L1 cache or on some CPUs it might have to invoke the store forwarding logic or even have to stall until the store exists the store queue. To answer cpeterso's question, yes in most situations, if you have enough contention on your lock that starvation is a problem then sure you have a larger problem with your algorithm. However, what can happen is that you have a common-case-uncontended lock but 0.001% of your users will actually generate a lot of contention on that lock. Even if the critical section itself is very tiny, the cost of transferring the lock cacheline will end up being a limiting factor as the frequency of acquisition and the core count increases. It is, IMO, better to survive the corner cases gracefully than to squeeze another couple of cycles out of the algorithm in the case that you expect. (Also don't forget that often sleeping locks are implemented using a spinlock) I could easily be mistaken, but I believe it will result in a better kernel. We'll see.
Ticket spinlocks Posted Feb 7, 2008 6:08 UTC (Thu) by eru (subscriber, #2753) [Link] Isn't there a naming inconsistency in the algorithm description as it now stands (at 05:59:44 UTC)? The box diagram and the following paragraph refer to fields "next" and "owner", but the next paragraph after that talks about "next" and "current". I assume "owner" and "current" refer to the same thing?
Ticket spinlocks Posted Feb 7, 2008 14:28 UTC (Thu) by corbet (editor, #1) [Link] They do refer to the same thing; I've fixed it. Sorry.
Ticket spinlocks Posted Feb 7, 2008 9:25 UTC (Thu) by and (subscriber, #2883) [Link] > That raises the maximum system size to 32768 processors. hm, why is this a signed 16 bit number when the "small" version of the patch uses unsigned 8 bit values? Unsigned 16 bit integers allow up to 65536 processors...
Ticket spinlocks Posted Feb 7, 2008 13:13 UTC (Thu) by im14u2c (subscriber, #5246) [Link] Are there other structures that limit Linux to 32768? Perhaps that's the context of the comment. I looked at the patch and didn't see anything that really tied it to signedness, since the comparisons are all equals compares.
Processor limit Posted Feb 7, 2008 14:20 UTC (Thu) by corbet (editor, #1) [Link] That was a slip of the brain as I was running out of time to get the article done. Of course the limit is 65536; the article has been fixed. Sorry for the confusion.
Ticket spinlocks Posted Feb 7, 2008 21:10 UTC (Thu) by jengelh (subscriber, #33263) [Link] >That raises the maximum system size to 65536 processors - who could ever want more than that? SGI.
Ticket spinlocks Posted Feb 7, 2008 23:01 UTC (Thu) by brianomahoney (subscriber, #6206) [Link] While I suspect it will be a little while before tightly clustered systems begin to feel this limitation, history is littered with _this_should_be_enough_ comments however the algorithm obviously extends to u_64 (next|current) with little extra cost and that will surely do
Ticket spinlocks Posted Feb 8, 2008 6:51 UTC (Fri) by zlynx (subscriber, #2285) [Link] Right, no one could ever need more than 4G processor cores. :)
Ticket spinlocks Posted Feb 8, 2008 10:43 UTC (Fri) by rvfh (subscriber, #31018) [Link] I think you missed the ironical reference to Bill Gates' alleged comment "640K of memory should be enough for anybody." :) Bill Gates claims this is a urban legend though.
Spinlocks Posted Feb 8, 2008 19:04 UTC (Fri) by jd (subscriber, #26381) [Link] The more sophisticated the spinlock allocation mechanism, the greater the overheads. Simply having a wide enough table (and it can be as wide as the UPID mechanism) would be fast, but costly on memory. Spinlocks are also not going to be under equal demand. Perhaps spinlocks can be created with a given ticket width (to reflect expected demand) or have the ticket width grow once a high water mark is exceeded (to reflect actual demand).There's also the problem of what to do if a maximum of N processes can control something. Do you allocate N spinlocks? That ruins any quality of service you're trying to attain. Or if a process has claimed one spinlock, do you bar it from reclaiming it (or claiming one of the other N) until some sort of fair service guarantee has been achieved? There seem to be many semi-independent problems involved here, with spinlocks ending up used for 1:1, 1:N, M:1 and M:N situations. in practice, some of these may never really apply, but even if one of the other options applies once, you've a candidate for leading the Department of Headaches. Do we need smarter spinlocks? Something on top of "dumb" spinlocks to add any extra capabilities? Something that divides the problem-space up such that smarter spinlocks aren't needed? Big spinlocks, such that they can stay as they are and survive in more generalized situations?
Spinlocks Posted Feb 9, 2008 14:09 UTC (Sat) by willy (subscriber, #9762) [Link] For N processes, you would use a semaphore initialised to N, not a spinlock. If you wanted N cpus to be able to do something, then I guess we could introduce a counting spinlock, similar to a semaphore. But we've not needed one yet, and I doubt we will. We have many things implemented on top of spinlocks -- the BKL, semaphores, mutexes, rwlocks, rwsems and the late and unlamented brwlock (obsoleted by RCU). Then there's specialised beasts such as bitlocks and seqlocks.
Ticket spinlocks Posted Feb 9, 2008 14:07 UTC (Sat) by PaulMcKenney (subscriber, #9624) [Link] I have to ask... What happens if a task is interrupted/NMIed/whatever-ed just as its turn arrives? Especially given a long-running handler? Or is this scenario sufficiently rare so as to have little or no performance impact?
Ticket spinlocks Posted Feb 11, 2008 9:00 UTC (Mon) by Nick (guest, #15060) [Link] That's a good point. And yes I guess if that happens, then that will cause all other waiters to spin longer as well. Although I'm not sure that it is a problem in practice. The theory is that most of the time we have no contention, then FIFO locks are no worse than unfair locks. Actually we could have up to 1 process spinning on the lock, and NMI/IRQ delay behaviour shouldn't be any different than unfair locks. It is only when you start getting more contention that it could start to be a problem... But the heavily contended case often means performance has tanked anyway, and we are actually willing to sacrifice some performance to ensure fairness and forward progress. In short -- I'm hoping that the situation you ask about should not be a real problem. Unfortunately, as with a lot of these tricky interactions and uncommon workloads, it may be a release or two before we hear anything about it. But ticket locks are definitely something for kernel devs to keep in mind if investigating some unusual performance regression in future ;)
Ticket spinlocks Posted Feb 11, 2008 15:27 UTC (Mon) by PaulMcKenney (subscriber, #9624) [Link] Indeed, you do have to have two tasks spinning on the lock before the NMI/IRQ delay could possibly cause a problem. So I also hope that this effect is not a problem in practice, but as you say, testing and experience will be required. And ticket locks do have nice real-time properties, as they get rid of the possibility of starvation!
Ticket spinlocks Posted Feb 12, 2008 17:45 UTC (Tue) by bgamsa (subscriber, #33057) [Link] Has anyone looked at MCS locks recently in the context of Linux and modern hardware. I believe it has more requirements in terms of hardware primitives, but it does provide queueing and local spinning.
Ticket spinlocks Posted Feb 12, 2008 18:08 UTC (Tue) by PaulMcKenney (subscriber, #9624) [Link] We did take a look a few years ago. Of course, pure MCS locks have an incompatible API, but the K42 folks at IBM Research came up with a variant of MCS locks that matched the Linux-kernel spin_lock() primitives. If I remember correctly, MCS locks outperformed the various NUMA-aware locking primitives, but were outperformed by simple spinlocks. I would expect that ticket locks would out-perform MCS locks at low contention levels (the common case in modern Linux kernels). I am not sure how MCS locks and ticket locks would compare at high levels of contention -- the ticket locks avoid the "thundering herd" problem, so might do quite well. Any benchmark results out there?
Ticket spinlocks and MP-guest kernels Posted Feb 12, 2008 19:51 UTC (Tue) by eSk (subscriber, #45221) [Link] There is also the case for multi-vCPU guests running in a VM. For regular spinlocks you have the problem that a guest vCPU may be preempted while holding a spinlock, thereby potentially causing large delays for other vCPUs (i.e., think spinlock wait times in the order of hundreds of milliseconds rather than nanoseconds). A guest vCPU0 is preempted while holding a spinlock. Guest vCPU1 is scheduled and spends its whole timeslice wating for the usually shortly held spinlock to be released; not knowing that the holder of the spinlock is not really running at the time. One might argue that it's stupid to run an MP guest system on a single physical processor, but the above scenario is also present for MP guests running on multiple physical processors. You only get around the problem if you perform strict gang-scheduling of all the vCPUs. The other option is to add heuristics to prevent prempting guest vCPUs holding spinlocks. Using ticket spinlocks the problem with preempting lock-holders is increased. Not only do you have the problem that preempting a lock-holder is bad for performance, but since all the waiters must be granted the lock in FIFO order one had better make sure that the different lock contenders are scheduled in the proper order. Failure to do so will massively increase the lock waiting time. With regular spinlocks you have the chance that any one of the waiting vCPUs can grab the lock and be done with it quicky. With ticket spinlocks you don't have this option. I would expect that anyone trying to run a multi-MP guest kernel will run into this performance problem rather quickly (subject to number of vCPUs and kernel workload, of course).
Ticket spinlocks and MP-guest kernels Posted Feb 14, 2008 7:53 UTC (Thu) by goaty (guest, #17783) [Link] You could get a lovely cascade going. VCPU0 grabs the spinlock, then is pre-empted. VCPU1 spends its whole time slice waiting for the spinlock. VCPU2 is scheduled, and starts queuing for the spinlock. VCPU0 is scheduled, releases the spinlock, goes to sleep. The host scheduler looks for another thread to schedule. VCPU1 and VCPU2 have just been busy-waiting so they get penalised. VCPU3, VCPU4, VCPU5, etc., each get scheduled in turn, each run until they hit the spinlock in question, and start busy waiting. The cascade can continue until we run out of virtual CPUs. If the rate at which CPUs manage to acquire the lock is slower than the rate at which CPUs attempt to acquire the lock, the cascade can continue forever! Although on a general-purpose system the likelihood that all the CPUs would keep trying to acquire the same spinlock is pretty small, virtual guests are often used to run specific workloads. Since all the virtual processors are doing the same kind of work, it is quite likely that they would all keep contending the same spinlock. Which is exactly the situation that ticket spinlocks are supposed to help with! Probably anyone who's running MP guest kernels had already got each virtual CPU locked to a different real CPU, or is running with realtime scheduling, precisely to avoid this kind of problem. If not, then ticket spinlocks should provide them with all the motivation they need! :-)
Ticket spinlocks and MP-guest kernels Posted Feb 14, 2008 16:38 UTC (Thu) by PaulMcKenney (subscriber, #9624) [Link] Gang scheduling within the hypervisor would be one way to avoid this issue, and without the need to lock the VCPUs to real CPUs.
Ticket spinlocks Posted Feb 16, 2008 22:38 UTC (Sat) by xorbe (guest, #3165) [Link] FINALLY! I have always wondered why an approach like this wasn't attempted earlier. This is the good stuff, my friends.
|
Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.