By Jonathan Corbet
January 3, 2013
Spinlocks, being the lowest-level synchronization mechanism in the kernel,
are the target of seemingly endless attempts at performance enhancement.
The
ticket spinlock mechanism used in the
mainline has resisted such attempts for a few years. Now, though, some
developers have identified a performance bottleneck associated with these
locks and are busily trying to come up with an improved version.
A spinlock is so-named because a CPU waiting for a contended lock will
"spin" in a tight loop, repeatedly querying the lock until it becomes
available. Ticket spinlocks adjust this algorithm by having each waiting
CPU take a "ticket" so that each CPU obtains the lock in the order in which
it arrived. These locks thus resemble the "take a number" mechanisms found
at deli counters or motor vehicle division offices worldwide — though, with
luck, the wait is rather shorter than is required to renew a driver's
license in your editor's part of the world. Without the ticket mechanism,
which was added for the 2.6.25 release, the kernel's spinlocks were unfair;
in some situations, some waiters could be starved for an extended period of
time.
It has long been understood that lock contention reduces system performance
considerably. The simple act of spinning for a lock clearly is not going
to be good for performance, but there are also caching issues to take into
account. If two CPUs are repeatedly acquiring a spinlock, the memory
location representing that lock will bounce back and forth between those
CPUs' caches. Even if neither CPU ever has to wait for the lock, the
process of moving it between caches will slow things down considerably.
For that reason, interest in lockless algorithms has been growing for many
years.
In the case of a contended lock, though, cache contention would appear to
be less of an issue. A CPU spinning on a lock will cache its contents in a
shared mode; no cache bouncing should occur until the CPU owning the lock
releases it. Releasing the lock (and its acquisition by another CPU)
requires writing to the lock, and that requires exclusive cache access.
The cache line movement at that time hurts, but probably not as much as
waiting for the lock in the first place. So it would seem that trying to
optimize cache behavior in the contended case is not likely to produce much
in the way of useful results.
That picture is not complete, though; one must take a couple of other facts
into account. Processors do not cache a single value; they cache a "line"
of (typically) 128 consecutive bytes as a single unit. In other words, the
cache lines in any contemporary processor are
almost certainly significantly larger than what is required to hold a
spinlock. So when a CPU needs
exclusive access to a spinlock's cache line, it also gains exclusive access
to a significant chunk of surrounding data. And that is where the other
important detail comes into play: spinlocks tend to be embedded within the
data structures that they protect, so that surrounding data is typically
data of immediate interest to the CPU holding the lock.
Kernel code will acquire a lock to work with (and, usually, modify) a
structure's contents. Often, changing a field within the protected
structure will require access to the same cache line that holds the
structure's spinlock. If the lock is uncontended, that access is not a
problem; the CPU owning the lock probably owns the cache line as well. But
if the lock is contended, there will be one or more other CPUs constantly
querying its value, obtaining shared access to that same cache line and
depriving the lock holder of the exclusive access it needs. A subsequent
modification of data within the affected cache line will thus incur a cache
miss. So
CPUs querying a contended lock can slow the lock owner considerably, even
though that owner is not accessing the lock directly.
How badly can throughput be impacted? In the description of his patch adding proportional backoff to ticket
spinlocks, Rik van Riel describes a microbenchmark that is slowed by a
factor of two when there is a single contending CPU, and by as much as a
factor of ten with many CPUs in the mix. That is not just a slowdown; that
is a catastrophic loss of performance. Needless to say, that is not the
sort of behavior that kernel developers like to see.
Rik's solution is simple enough. Rather than spinning tightly and querying a
contended lock's status, a waiting CPU should wait a bit more patiently,
only querying the lock occasionally. So his patch causes a waiting CPU to
loop a number of times doing nothing at all before it gets impatient and
checks the lock again. It goes without saying that picking that "number of
times" correctly is the key to good performance with this algorithm. While
a CPU is looping without querying the lock it cannot be bouncing cache
lines around, so
the lock holder should be able to make faster progress. But too much
looping will cause the lock to sit idle before the owner of the next ticket
notices that its turn has come; that, too, will hurt performance.
The first step in Rik's patch series calculates how many CPUs must release
the lock before the current CPU can claim it (by subtracting the current
CPU's ticket number from the number currently being served) and loops 50
times for every CPU that is ahead in the queue. That is where the
"proportional backoff" comes in; the further back in line the CPU is, the
longer it will wait between queries of the lock. The result should be a
minimizing of idle looping while also minimizing cache traffic.
The number 50 was determined empirically, but it seems unlikely that it
will be optimal for all situations. So the final part of Rik's patch set
attempts to tune that number dynamically. The dynamic delay factor is
increased when the lock is found to be unavailable and decreased when the
lock is obtained. The goal is to have a CPU query the lock an average of
2.7 times before obtaining it. The number 2.7, once again, was obtained by
running lots of tests and seeing what worked best; subsequent versions of
the patch have tweaked this heuristic somewhat.
Details aside, the core idea is that the delay factor (a per-CPU value that
applies to all
contended locks equally) will increase for workloads experiencing more
contention, tuning the system appropriately.
That said, the notion of a single delay
for all locks is likely to be causing a severe case of raised eyebrows for
some readers, and, indeed, it turned out to be inadequate; some locks are
rather more contended than others, after all. So the January 3 version of Rik's patch
keeps a hashed list (based on the spinlock address) of delay values instead.
Michel Lespinasse ran some
experiments of his own to see how well the proportional backoff
algorithm worked. In particular, he wanted to figure out whether it was
truly necessary to calculate a dynamic delay factor, or whether an optimal
static value could be found. His conclusion was that, in fact, a static
value is good enough; it might be possible to do a little better with a
dynamic value, he said, but the improvement is not enough to justify the
added complexity of the tuning mechanism. There is just one little
difficulty:
Of course, one major downside in my proposal is that I haven't
figured out an automatic way to find the most appropriate
spinlock_delay system tunable. So there is clearly some more
experimentation needed there. However, IMO the important result
here is that our goal of avoiding performance cliffs seems to be
reachable without the complexity (and IMO, risk) of per-spinlock
tuning values.
If these results stand, and an appropriate way of picking the static value
can be found, then there is probably not a case for adding dynamic backoff
to the kernel's spinlock implementation. But the backoff idea in general
would appear to be a significant improvement for some workloads. So the
chances are good that we will see it added in some form in an upcoming
development cycle.
(
Log in to post comments)