Improving ticket spinlocks
A spinlock is so-named because a CPU waiting for a contended lock will "spin" in a tight loop, repeatedly querying the lock until it becomes available. Ticket spinlocks adjust this algorithm by having each waiting CPU take a "ticket" so that each CPU obtains the lock in the order in which it arrived. These locks thus resemble the "take a number" mechanisms found at deli counters or motor vehicle division offices worldwide — though, with luck, the wait is rather shorter than is required to renew a driver's license in your editor's part of the world. Without the ticket mechanism, which was added for the 2.6.25 release, the kernel's spinlocks were unfair; in some situations, some waiters could be starved for an extended period of time.
It has long been understood that lock contention reduces system performance considerably. The simple act of spinning for a lock clearly is not going to be good for performance, but there are also caching issues to take into account. If two CPUs are repeatedly acquiring a spinlock, the memory location representing that lock will bounce back and forth between those CPUs' caches. Even if neither CPU ever has to wait for the lock, the process of moving it between caches will slow things down considerably. For that reason, interest in lockless algorithms has been growing for many years.
In the case of a contended lock, though, cache contention would appear to be less of an issue. A CPU spinning on a lock will cache its contents in a shared mode; no cache bouncing should occur until the CPU owning the lock releases it. Releasing the lock (and its acquisition by another CPU) requires writing to the lock, and that requires exclusive cache access. The cache line movement at that time hurts, but probably not as much as waiting for the lock in the first place. So it would seem that trying to optimize cache behavior in the contended case is not likely to produce much in the way of useful results.
That picture is not complete, though; one must take a couple of other facts into account. Processors do not cache a single value; they cache a "line" of (typically) 128 consecutive bytes as a single unit. In other words, the cache lines in any contemporary processor are almost certainly significantly larger than what is required to hold a spinlock. So when a CPU needs exclusive access to a spinlock's cache line, it also gains exclusive access to a significant chunk of surrounding data. And that is where the other important detail comes into play: spinlocks tend to be embedded within the data structures that they protect, so that surrounding data is typically data of immediate interest to the CPU holding the lock.
Kernel code will acquire a lock to work with (and, usually, modify) a structure's contents. Often, changing a field within the protected structure will require access to the same cache line that holds the structure's spinlock. If the lock is uncontended, that access is not a problem; the CPU owning the lock probably owns the cache line as well. But if the lock is contended, there will be one or more other CPUs constantly querying its value, obtaining shared access to that same cache line and depriving the lock holder of the exclusive access it needs. A subsequent modification of data within the affected cache line will thus incur a cache miss. So CPUs querying a contended lock can slow the lock owner considerably, even though that owner is not accessing the lock directly.
How badly can throughput be impacted? In the description of his patch adding proportional backoff to ticket spinlocks, Rik van Riel describes a microbenchmark that is slowed by a factor of two when there is a single contending CPU, and by as much as a factor of ten with many CPUs in the mix. That is not just a slowdown; that is a catastrophic loss of performance. Needless to say, that is not the sort of behavior that kernel developers like to see.
Rik's solution is simple enough. Rather than spinning tightly and querying a contended lock's status, a waiting CPU should wait a bit more patiently, only querying the lock occasionally. So his patch causes a waiting CPU to loop a number of times doing nothing at all before it gets impatient and checks the lock again. It goes without saying that picking that "number of times" correctly is the key to good performance with this algorithm. While a CPU is looping without querying the lock it cannot be bouncing cache lines around, so the lock holder should be able to make faster progress. But too much looping will cause the lock to sit idle before the owner of the next ticket notices that its turn has come; that, too, will hurt performance.
The first step in Rik's patch series calculates how many CPUs must release the lock before the current CPU can claim it (by subtracting the current CPU's ticket number from the number currently being served) and loops 50 times for every CPU that is ahead in the queue. That is where the "proportional backoff" comes in; the further back in line the CPU is, the longer it will wait between queries of the lock. The result should be a minimizing of idle looping while also minimizing cache traffic.
The number 50 was determined empirically, but it seems unlikely that it will be optimal for all situations. So the final part of Rik's patch set attempts to tune that number dynamically. The dynamic delay factor is increased when the lock is found to be unavailable and decreased when the lock is obtained. The goal is to have a CPU query the lock an average of 2.7 times before obtaining it. The number 2.7, once again, was obtained by running lots of tests and seeing what worked best; subsequent versions of the patch have tweaked this heuristic somewhat. Details aside, the core idea is that the delay factor (a per-CPU value that applies to all contended locks equally) will increase for workloads experiencing more contention, tuning the system appropriately.
That said, the notion of a single delay for all locks is likely to be causing a severe case of raised eyebrows for some readers, and, indeed, it turned out to be inadequate; some locks are rather more contended than others, after all. So the January 3 version of Rik's patch keeps a hashed list (based on the spinlock address) of delay values instead.
Michel Lespinasse ran some experiments of his own to see how well the proportional backoff algorithm worked. In particular, he wanted to figure out whether it was truly necessary to calculate a dynamic delay factor, or whether an optimal static value could be found. His conclusion was that, in fact, a static value is good enough; it might be possible to do a little better with a dynamic value, he said, but the improvement is not enough to justify the added complexity of the tuning mechanism. There is just one little difficulty:
If these results stand, and an appropriate way of picking the static value
can be found, then there is probably not a case for adding dynamic backoff
to the kernel's spinlock implementation. But the backoff idea in general
would appear to be a significant improvement for some workloads. So the
chances are good that we will see it added in some form in an upcoming
development cycle.
Index entries for this article | |
---|---|
Kernel | Spinlocks |
Posted Jan 4, 2013 4:08 UTC (Fri)
by jdike (subscriber, #4055)
[Link] (1 responses)
This is e, which shows up in similar contexts. I have a fuzzy recollection that in a situation where you want to buy something, i.e. a house or car, you know nothing about the market going in, and you need to decide to buy or not on the spot (i.e. the item disappears from the market after you've seen it), you need to sample the market (i.e. look without buying) before thinking about buying. The optimal number of items to sample is e, which translates to 3 for discrete things like houses and cars.
Posted Jan 5, 2013 3:27 UTC (Sat)
by spigot (subscriber, #50709)
[Link]
Improving ticket spinlocks
> obtaining it. The number 2.7, once again, was obtained by running lots of
> tests and seeing what worked best
It sounds like you're describing the secretary problem. The idea is that you have n applicants for a job, each of whom has a rank from 1 (most qualified) to n (least qualified). You get to interview them in a random order, and after each interview you must decide, on the spot, whether to hire them or reject them. You would like a procedure to maximize your chances of hiring the most qualified candidate.
Improving ticket spinlocks
Posted Jan 4, 2013 16:54 UTC (Fri)
by renox (guest, #23785)
[Link] (9 responses)
Either the lock would be handled separately or the data would have a pointer to the "real" lock.
Posted Jan 4, 2013 17:04 UTC (Fri)
by corbet (editor, #1)
[Link] (4 responses)
If you move the lock out of the structure it is protecting, you have to allocate it in a separate step. The allocation could maybe be hidden in spin_lock_init(), but the freeing would require explicit action (in vast numbers of call sites) or some severe magic. Probably don't want to do that.
Regardless of whether the lock is relocated or not, the only way to avoid cache problems with surrounding data is to expand the lock to fill an entire cache line. That would bloat spinlocks considerably, and there are a lot of spinlocks in a running kernel. That overhead would hurt; among other things, the bloat, alone, would slow things by causing additional cache utilization.
Maybe I'm missing something (I often do), but, from what I can tell, trying to isolate the locks isn't really a viable solution.
Posted Jan 4, 2013 17:34 UTC (Fri)
by renox (guest, #23785)
[Link] (2 responses)
As for the memory bloat, I think that these "separated spinlock" would be useful for only the most contended spinlocks which would somewhat mitigate the issue.
Posted Jan 5, 2013 22:53 UTC (Sat)
by dlang (guest, #313)
[Link] (1 responses)
how do you know what the best cache size is to use on the processor that someone will buy next year and run the binary you compiled today on.
Posted Jan 5, 2013 23:50 UTC (Sat)
by giraffedata (guest, #1954)
[Link]
That would make this kind of memory layout feasible until they invent a more complex form of caching.
Posted Jan 5, 2013 18:07 UTC (Sat)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
Of course, and noted earlier in this thread, the opposite holds true at high levels of contention. Life is like that sometimes! ;-)
Posted Jan 4, 2013 18:58 UTC (Fri)
by daney (guest, #24551)
[Link] (2 responses)
1) Each load from the Now-Serving counter memory location creates traffic on the internal CPU buses. This traffic decreases the amount of useful work that can be done. Since the threads blocked waiting for their Now-Serving number to come up are not doing anything useful, decreasing the amount of bus traffic they are generating makes everything else go faster.
2) Cache line contention/bouncing caused by Ticket-Counter modifications and modifications to the data protected by the lock.
The first issue is (mostly) the one addressed by the patch in question. Increasing the size/alignment of arch_spinlock_t to occupy an entire cache line might be beneficial for some use cases, but it would increase the size of many structures, thus causing increased cache pressure.
Posted Jan 5, 2013 10:38 UTC (Sat)
by renox (guest, #23785)
[Link]
For the second issue, what you're describing (having the spinlock occupying an entire cache line) isn't always necessary: in some cases you could put 'cold' data in the same cache line as the lock to get the best performance without using too much memory.
Posted Jan 10, 2013 18:30 UTC (Thu)
by ksandstr (guest, #60862)
[Link]
That'd protect the significant cachelines from not only write-bouncing from ticket-acquisition, but from any spinlock-related "oops, had to flush this exclusive cache line to RAM in the meantime" cases due to read traffic also. I'm guessing that an operation to acquire locks on N objects without ordering foulups could fit on top of that as well.
Posted Aug 19, 2024 14:14 UTC (Mon)
by 301043030 (guest, #172920)
[Link]
Posted Jan 4, 2013 22:06 UTC (Fri)
by zlynx (guest, #2285)
[Link]
For example, record the cycle count when getting the lock and when releasing the lock, record the difference as N. The next lock spinner in line should ideally wait N+1 cycles before checking the lock.
Then on the next kernel build use those profile numbers and record them in each spinlock structure.
This would work perfectly if it weren't for the pesky problems of unexpected cache line misses, CPU interrupts and other interference.
Posted Jan 4, 2013 22:12 UTC (Fri)
by zlynx (guest, #2285)
[Link] (1 responses)
Posted Jan 10, 2013 15:18 UTC (Thu)
by Otus (subscriber, #67685)
[Link]
Posted Jan 10, 2013 9:44 UTC (Thu)
by heijo (guest, #88363)
[Link]
Like an MCS lock with the internal state stored in a pseudo-stack of per-CPU variables searched for the lock address, or maybe something stateless if possible.
Posted Jan 10, 2013 19:54 UTC (Thu)
by quanstro (guest, #77996)
[Link] (1 responses)
Posted Jan 21, 2013 21:00 UTC (Mon)
by vcunat (guest, #88938)
[Link]
What about separating locks and data?
It strikes me as quite funny that adding a level of indirection could help performance in this case.
I'd pondered that a bit as I was writing the article. It would be painful.
What about separating locks and data?
What about separating locks and data?
Of course contention isn't the same on all the systems and developers working on embedded or supercomputers would probably disagree on which spinlocks has to be "separated"..
What about separating locks and data?
Aren't there runtime facilities in the kernel for tailoring behavior to the actual cache line size? Something like a cache line allocator?
What about separating locks and data?
What about separating locks and data?
What about separating locks and data?
What about separating locks and data?
What about separating locks and data?
What about separating locks and data?
Improving ticket spinlocks
Improving ticket spinlocks
Improving ticket spinlocks
Improving ticket spinlocks
Improving ticket spinlocks
machines, and 32-bytes for arms.
Improving ticket spinlocks