Improving ticket spinlocks

By Jonathan Corbet
January 3, 2013

Spinlocks, being the lowest-level synchronization mechanism in the kernel, are the target of seemingly endless attempts at performance enhancement. The ticket spinlock mechanism used in the mainline has resisted such attempts for a few years. Now, though, some developers have identified a performance bottleneck associated with these locks and are busily trying to come up with an improved version.

A spinlock is so-named because a CPU waiting for a contended lock will "spin" in a tight loop, repeatedly querying the lock until it becomes available. Ticket spinlocks adjust this algorithm by having each waiting CPU take a "ticket" so that each CPU obtains the lock in the order in which it arrived. These locks thus resemble the "take a number" mechanisms found at deli counters or motor vehicle division offices worldwide — though, with luck, the wait is rather shorter than is required to renew a driver's license in your editor's part of the world. Without the ticket mechanism, which was added for the 2.6.25 release, the kernel's spinlocks were unfair; in some situations, some waiters could be starved for an extended period of time.

It has long been understood that lock contention reduces system performance considerably. The simple act of spinning for a lock clearly is not going to be good for performance, but there are also caching issues to take into account. If two CPUs are repeatedly acquiring a spinlock, the memory location representing that lock will bounce back and forth between those CPUs' caches. Even if neither CPU ever has to wait for the lock, the process of moving it between caches will slow things down considerably. For that reason, interest in lockless algorithms has been growing for many years.

In the case of a contended lock, though, cache contention would appear to be less of an issue. A CPU spinning on a lock will cache its contents in a shared mode; no cache bouncing should occur until the CPU owning the lock releases it. Releasing the lock (and its acquisition by another CPU) requires writing to the lock, and that requires exclusive cache access. The cache line movement at that time hurts, but probably not as much as waiting for the lock in the first place. So it would seem that trying to optimize cache behavior in the contended case is not likely to produce much in the way of useful results.

That picture is not complete, though; one must take a couple of other facts into account. Processors do not cache a single value; they cache a "line" of (typically) 128 consecutive bytes as a single unit. In other words, the cache lines in any contemporary processor are almost certainly significantly larger than what is required to hold a spinlock. So when a CPU needs exclusive access to a spinlock's cache line, it also gains exclusive access to a significant chunk of surrounding data. And that is where the other important detail comes into play: spinlocks tend to be embedded within the data structures that they protect, so that surrounding data is typically data of immediate interest to the CPU holding the lock.

Kernel code will acquire a lock to work with (and, usually, modify) a structure's contents. Often, changing a field within the protected structure will require access to the same cache line that holds the structure's spinlock. If the lock is uncontended, that access is not a problem; the CPU owning the lock probably owns the cache line as well. But if the lock is contended, there will be one or more other CPUs constantly querying its value, obtaining shared access to that same cache line and depriving the lock holder of the exclusive access it needs. A subsequent modification of data within the affected cache line will thus incur a cache miss. So CPUs querying a contended lock can slow the lock owner considerably, even though that owner is not accessing the lock directly.

How badly can throughput be impacted? In the description of his patch adding proportional backoff to ticket spinlocks, Rik van Riel describes a microbenchmark that is slowed by a factor of two when there is a single contending CPU, and by as much as a factor of ten with many CPUs in the mix. That is not just a slowdown; that is a catastrophic loss of performance. Needless to say, that is not the sort of behavior that kernel developers like to see.

Rik's solution is simple enough. Rather than spinning tightly and querying a contended lock's status, a waiting CPU should wait a bit more patiently, only querying the lock occasionally. So his patch causes a waiting CPU to loop a number of times doing nothing at all before it gets impatient and checks the lock again. It goes without saying that picking that "number of times" correctly is the key to good performance with this algorithm. While a CPU is looping without querying the lock it cannot be bouncing cache lines around, so the lock holder should be able to make faster progress. But too much looping will cause the lock to sit idle before the owner of the next ticket notices that its turn has come; that, too, will hurt performance.

The first step in Rik's patch series calculates how many CPUs must release the lock before the current CPU can claim it (by subtracting the current CPU's ticket number from the number currently being served) and loops 50 times for every CPU that is ahead in the queue. That is where the "proportional backoff" comes in; the further back in line the CPU is, the longer it will wait between queries of the lock. The result should be a minimizing of idle looping while also minimizing cache traffic.

The number 50 was determined empirically, but it seems unlikely that it will be optimal for all situations. So the final part of Rik's patch set attempts to tune that number dynamically. The dynamic delay factor is increased when the lock is found to be unavailable and decreased when the lock is obtained. The goal is to have a CPU query the lock an average of 2.7 times before obtaining it. The number 2.7, once again, was obtained by running lots of tests and seeing what worked best; subsequent versions of the patch have tweaked this heuristic somewhat. Details aside, the core idea is that the delay factor (a per-CPU value that applies to all contended locks equally) will increase for workloads experiencing more contention, tuning the system appropriately.

That said, the notion of a single delay for all locks is likely to be causing a severe case of raised eyebrows for some readers, and, indeed, it turned out to be inadequate; some locks are rather more contended than others, after all. So the January 3 version of Rik's patch keeps a hashed list (based on the spinlock address) of delay values instead.

Michel Lespinasse ran some experiments of his own to see how well the proportional backoff algorithm worked. In particular, he wanted to figure out whether it was truly necessary to calculate a dynamic delay factor, or whether an optimal static value could be found. His conclusion was that, in fact, a static value is good enough; it might be possible to do a little better with a dynamic value, he said, but the improvement is not enough to justify the added complexity of the tuning mechanism. There is just one little difficulty:

Of course, one major downside in my proposal is that I haven't figured out an automatic way to find the most appropriate spinlock_delay system tunable. So there is clearly some more experimentation needed there. However, IMO the important result here is that our goal of avoiding performance cliffs seems to be reachable without the complexity (and IMO, risk) of per-spinlock tuning values.

If these results stand, and an appropriate way of picking the static value can be found, then there is probably not a case for adding dynamic backoff to the kernel's spinlock implementation. But the backoff idea in general would appear to be a significant improvement for some workloads. So the chances are good that we will see it added in some form in an upcoming development cycle.

Index entries for this article
Kernel	Spinlocks

Improving ticket spinlocks

Posted Jan 4, 2013 4:08 UTC (Fri) by jdike (subscriber, #4055) [Link] (1 responses)

> The goal is to have a CPU query the lock an average of 2.7 times before
> obtaining it. The number 2.7, once again, was obtained by running lots of
> tests and seeing what worked best

This is e, which shows up in similar contexts. I have a fuzzy recollection that in a situation where you want to buy something, i.e. a house or car, you know nothing about the market going in, and you need to decide to buy or not on the spot (i.e. the item disappears from the market after you've seen it), you need to sample the market (i.e. look without buying) before thinking about buying. The optimal number of items to sample is e, which translates to 3 for discrete things like houses and cars.

Improving ticket spinlocks

Posted Jan 5, 2013 3:27 UTC (Sat) by spigot (subscriber, #50709) [Link]

It sounds like you're describing the secretary problem. The idea is that you have n applicants for a job, each of whom has a rank from 1 (most qualified) to n (least qualified). You get to interview them in a random order, and after each interview you must decide, on the spot, whether to hire them or reject them. You would like a procedure to maximize your chances of hiring the most qualified candidate.

It turns out that you should reject the first n/e candidates, then hire the first of the remainder who is more qualified than any of the rejects (or hiring the last, if none are more qualified).

However, it's not obvious to me that this is applicable to the problem at hand.

What about separating locks and data?

Posted Jan 4, 2013 16:54 UTC (Fri) by renox (guest, #23785) [Link] (9 responses)

If the cause of the issue is that the lock is in the same cache line as the data, I wonder if it would be possible to ensure that the lock would be in a different place(different cache line)?

Either the lock would be handled separately or the data would have a pointer to the "real" lock.
It strikes me as quite funny that adding a level of indirection could help performance in this case.

What about separating locks and data?

Posted Jan 4, 2013 17:04 UTC (Fri) by corbet (editor, #1) [Link] (4 responses)

I'd pondered that a bit as I was writing the article. It would be painful.

If you move the lock out of the structure it is protecting, you have to allocate it in a separate step. The allocation could maybe be hidden in spin_lock_init(), but the freeing would require explicit action (in vast numbers of call sites) or some severe magic. Probably don't want to do that.

Regardless of whether the lock is relocated or not, the only way to avoid cache problems with surrounding data is to expand the lock to fill an entire cache line. That would bloat spinlocks considerably, and there are a lot of spinlocks in a running kernel. That overhead would hurt; among other things, the bloat, alone, would slow things by causing additional cache utilization.

Maybe I'm missing something (I often do), but, from what I can tell, trying to isolate the locks isn't really a viable solution.

What about separating locks and data?

Posted Jan 4, 2013 17:34 UTC (Fri) by renox (guest, #23785) [Link] (2 responses)

I agree with you that the management of deallocation is a significant issue/change.

As for the memory bloat, I think that these "separated spinlock" would be useful for only the most contended spinlocks which would somewhat mitigate the issue.
Of course contention isn't the same on all the systems and developers working on embedded or supercomputers would probably disagree on which spinlocks has to be "separated"..

What about separating locks and data?

Posted Jan 5, 2013 22:53 UTC (Sat) by dlang (guest, #313) [Link] (1 responses)

There will then be the problem of exactly how much space do you allocate for the spinlock

how do you know what the best cache size is to use on the processor that someone will buy next year and run the binary you compiled today on.

What about separating locks and data?

Posted Jan 5, 2013 23:50 UTC (Sat) by giraffedata (guest, #1954) [Link]

Aren't there runtime facilities in the kernel for tailoring behavior to the actual cache line size? Something like a cache line allocator?

That would make this kind of memory layout feasible until they invent a more complex form of caching.

What about separating locks and data?

Posted Jan 5, 2013 18:07 UTC (Sat) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Putting the lock and the data together increases performance at low levels of lock contention. After all, if the lock and data are in separate cache lines, you will take two cache misses, whereas if the lock and data are in the same cache line, you will only take one cache miss.

Of course, and noted earlier in this thread, the opposite holds true at high levels of contention. Life is like that sometimes! ;-)

What about separating locks and data?

Posted Jan 4, 2013 18:58 UTC (Fri) by daney (guest, #24551) [Link] (2 responses)

There are really two issues:

1) Each load from the Now-Serving counter memory location creates traffic on the internal CPU buses. This traffic decreases the amount of useful work that can be done. Since the threads blocked waiting for their Now-Serving number to come up are not doing anything useful, decreasing the amount of bus traffic they are generating makes everything else go faster.

2) Cache line contention/bouncing caused by Ticket-Counter modifications and modifications to the data protected by the lock.

The first issue is (mostly) the one addressed by the patch in question. Increasing the size/alignment of arch_spinlock_t to occupy an entire cache line might be beneficial for some use cases, but it would increase the size of many structures, thus causing increased cache pressure.

What about separating locks and data?

Posted Jan 5, 2013 10:38 UTC (Sat) by renox (guest, #23785) [Link]

Thanks for this informative reply.

For the second issue, what you're describing (having the spinlock occupying an entire cache line) isn't always necessary: in some cases you could put 'cold' data in the same cache line as the lock to get the best performance without using too much memory.

What about separating locks and data?

Posted Jan 10, 2013 18:30 UTC (Thu) by ksandstr (guest, #60862) [Link]

Having considered this topic for a bit, I wonder: given that there are two uses of CAS involved in ticket spinlocks, i.e. one on next-ticket (lock immediately or wait), and one on now-serving (unlock), which is as many as with regular locked/not spinlocks, the issue is clearly the increased _non-local_ write traffic on the data structure under contention. This seems to suggest a solution where spinlocks associated with objects smaller than a cacheline are moved out of the data and into, say, a hashed group keyed like the objects' parent data structure, trading some false contention for space.

That'd protect the significant cachelines from not only write-bouncing from ticket-acquisition, but from any spinlock-related "oops, had to flush this exclusive cache line to RAM in the meantime" cases due to read traffic also. I'm guessing that an operation to acquire locks on N objects without ordering foulups could fit on top of that as well.

What about separating locks and data?

Posted Aug 19, 2024 14:14 UTC (Mon) by 301043030 (guest, #172920) [Link]

Maybe you and your responders even dind't understand what the author says and what the paper writes about the lock, so you would never know the data of cache line means the lock, not some fileds of the data structure, your title is rather ridiculous and misleading.

Improving ticket spinlocks

Posted Jan 4, 2013 22:06 UTC (Fri) by zlynx (guest, #2285) [Link]

I wonder if you could use a profiling step to determine the ideal number of cycles to wait for each lock?

For example, record the cycle count when getting the lock and when releasing the lock, record the difference as N. The next lock spinner in line should ideally wait N+1 cycles before checking the lock.

Then on the next kernel build use those profile numbers and record them in each spinlock structure.

This would work perfectly if it weren't for the pesky problems of unexpected cache line misses, CPU interrupts and other interference.

Improving ticket spinlocks

Posted Jan 4, 2013 22:12 UTC (Fri) by zlynx (guest, #2285) [Link] (1 responses)

How expensive are inter-processor interrupts? If they're cheaper than 50 cycles and a cache-line bounce then the thing to do would be to have the lock owner CPU look at the next ticket in line and send it an interrupt while the waiting CPU just sits in an idle. From what I could pick up in a fast Google an IPI is about the same as a cache line bounce, it is the register saves and all that make it more expensive. But since the waiting CPU is in a known state the code could probably skip all that and just resume with the lock acquire.

Improving ticket spinlocks

Posted Jan 10, 2013 15:18 UTC (Thu) by Otus (subscriber, #67685) [Link]

That would shift the cost to the uncontended case, since it would have to check whether someone is waiting.

Improving ticket spinlocks

Posted Jan 10, 2013 9:44 UTC (Thu) by heijo (guest, #88363) [Link]

What about replacing ticket locks with a queue-based lock mechanism with local spinning?

Like an MCS lock with the internal state stored in a pseudo-stack of per-CPU variables searched for the lock address, or maybe something stateless if possible.

Improving ticket spinlocks

Posted Jan 10, 2013 19:54 UTC (Thu) by quanstro (guest, #77996) [Link] (1 responses)

i believe the common cache line sizes are 64-bytes for most x86
machines, and 32-bytes for arms.

Improving ticket spinlocks

Posted Jan 21, 2013 21:00 UTC (Mon) by vcunat (guest, #88938) [Link]

Yes, exactly my point. I read on Wiki that P4 had L2 with 128-byte lines, but I thought 64B is the most common. One can check individually around /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size