January 30, 2013
This article was contributed by Andi Kleen
elision : the act or an instance of omitting something : omission
Contended locks are a common software scaling problem on modern multi-core
systems. As more cores are added the synchronization overhead increases,
potentially exceeding the speed-up provided by the extra
processors. Writing good, scalable, fine-grained
locking is difficult for programmers. An alternative to locking is to use
memory transactions.
A memory transaction buffers all of the memory side effects from a code region and
makes the memory changes visible in an atomic manner, but only if the transaction
commits. If, instead, there is a conflict — memory used by one thread in
the transaction is modified by another thread at the same time —
the transaction is aborted and rolled back, and its side effects are discarded.
If the transaction commits, all memory changes become visible to other
threads atomically.
Traditionally, memory transactions have been implemented in software and
were too slow for most practical uses. In order to make memory
transactions usable in real-world situations, Intel has announced "Intel
Transactional Synchronization Extension" (Intel TSX) for the upcoming
Haswell CPU core. TSX supports memory transactions in hardware to improve
their performance and make them practical. IBM has also announced similar
extensions.
Of course very few programs today are written to use memory transactions
directly. Hardware memory transactions are a best-effort model intended for fast
paths: there is no guarantee that any hardware transaction will be
successful (but most reasonable ones should be). Any transaction implemented
with TSX needs a fallback path. The simplest fallback path is to use a lock
when the transaction doesn't succeed. This technique is called "lock elision"; it
executes a lock-protected region speculatively as a transaction, and only falls back
to blocking on the lock if the transaction does not commit.
Lock elision uses the same programming model as normal locks, so it can be
directly applied to existing programs. The programmer keeps using locks,
but the locks are faster as they can use hardware transactional memory
internally for more parallelism. Lock elision uses memory transactions as a
fast path, while the slow path is still a normal lock. Deadlocks and other
classic locking problems are still possible, because the transactions may
fall back to a real lock at any time. Critical sections that do not conflict in
their memory accesses with other threads and are otherwise
transaction-friendly will execute in parallel, and not block each other
out. The lock
elision fast path makes the lock lock-less (non-blocking) if the conditions
are right.
Usually, getting good performance with locks requires a large programming
effort to do locking in a sufficiently fine-grained manner that individual locks are not
overly contended. This work complicates the program. With lock elision it is
possible to stay with coarser-grained locks and let the hardware discover
the underlying parallelism opportunities instead, saving programmer
time.
Intel TSX implements two instruction-set interfaces: HLE (Hardware Lock
Elision) and RTM (Restricted Transactional Memory). HLE can be used to add
transaction hints to existing code paths, while RTM allows control of the
transactions directly and the use of software abort handlers. HLE refers to the
specific instruction prefixes, while lock elision is a more general concept
that can be also implemented with RTM. RTM requires new code paths as new
instructions are used.
The lock library needs to be modified to enable elision. Obvious candidates
are the glibc POSIX thread (pthread) mutex and rwlock primitives. Existing binaries can
be dynamically linked to the new glibc pthreads library. Programs using
pthread locking can then directly elide their locks without modifications.
Programs that implement their own locking library need some changes to
their lock code paths to enable lock elision.
glibc pthreads elision implementation
An implementation of RTM lock elision for
glibc has been posted to the
libc-alpha mailing list. POSIX mutexes support different locking types and
attributes. Glibc already has a range of lock types, like timed, adaptive,
recursive; elided locks simply become a new lock attribute and type. The elision
can be controlled by the administrator or the programmer, or it can be automatically
controlled by the pthreads library. The current interface is documented in
the glibc manual draft. This is a preliminary draft that may change, as
the code reviewers are expressing concerns about the API design.
The pthread lock functions like pthread_mutex_lock() dispatch on the
different lock types. A new elision path has been added there for timed
and adaptive mutexes. The glibc implementation uses RTM because it is more
flexible.
Inside the locking code, elision is implemented as a wrapper around the
existing lock. The wrapper first tries to run in a transaction, and only if
it's not successful does it call the underlying lock for blocking as a
fallback path.
A simple RTM elision wrapper looks like this (the actual glibc
implementation is somewhat more complex, since it implements retries and
adaptation):
void elided_lock_wrapper(lock) {
if (_xbegin() == _XBEGIN_STARTED) { /* Start transaction */
if (lock is free) /* Check lock and put into read-set */
return; /* Execute lock region in transaction */
_xabort(0xff); /* Abort transaction as lock is busy */
} /* Abort comes here */
take fallback lock
}
void elided_unlock_wrapper(lock) {
if (lock is free)
_xend(); /* Commit transaction */
else
unlock lock;
}
The _xbegin() call is a wrapper around a special instruction that
begins the transaction. It can be seen as being similar to
setjmp() in that it may return a second time if the transaction
is aborted partway through. Unlike the setjmp() situation,
though, a transaction abort will also unwind any other changes made since
the transaction began. If the transaction begins successfully,
_XBEGIN_STARTED will be returned. Should the transaction be
aborted, _xbegin() will return with an error code after the
transaction's side effects have been unwound; that will cause the code to
take the lock instead.
Either way, the lock itself will be put into the transaction's "read set";
a change to memory in the read set by another thread will cause the
transaction to abort immediately. Other memory locations accessed during
the transaction are treated in the same way; the processor will watch for
conflicting accesses and abort the transaction if need be. So, as soon as
one thread generates a conflicting memory access, one or more threads will
take the lock, causing all ongoing transactions tied to this lock to abort
and fall back to normal locking.
In its full form, the glibc elision wrapper uses an adaptive elision algorithm. When a lock
region frequently aborts its transactions it may slow down the program due
to mis-speculation (executing transactions that do not commit). Not
all mis-speculations necessarily cost time because they often happen in the
time the thread would have blocked anyway. But transactions that run for a
long time before aborting may slow down the execution of the program as a
whole. The adaptive algorithm
reacts to an aborted transaction by disabling elision for the lock for some time,
before retrying. This avoids the need of manually annotating every lock in
programs for elision.
It is also possible for the programmer to manually opt in or out, per-lock,
using a per-lock annotation interface. That capability may be especially
useful for rwlocks, since adaptive elision is not currently implemented for
them. See the
manual for more details.
Writing to a cache line used in a transaction will cause that transaction
to abort; thus, code in critical sections protected by elided locks cannot write to the
cache lines (regions of memory stored as a unit in the CPU's cache)
containing those locks. One of the big
advantages of elision is that it can keep the lock cache line in the "shared"
state, avoiding the infamous "lock cache line bouncing" communication
overhead that often slows down programs with fine-grained locking. But this
requires not writing to the lock or any memory location in the same cache
line. Normal glibc mutexes have owner fields
for debugging purposes which they always write. The elision code path
disables changing these owner fields, to keep the mutex structure read-only
in the locking fast path. The only side effect is that
pthread_mutex_destroy() will not return an error when called on a locked
lock, as the implementation can no longer detect this case.
Code using
RTM-elided locks do not know inside the critical section that a lock is locked
(the lock appears to be free). POSIX pthreads locking does not export an
operation to query a lock's state directly, but it can be queried
indirectly by calling pthread_mutex_trylock() recursively on a lock already
locked in the same thread. With RTM elision this call will succeed (HLE has
special hardware support to handle this case). Essentially, elided locks
behave like recursive locks. There were some concerns among code reviewers
that this behavior violates the POSIX standard ("shall fail" instead of "may
fail"). The latest version aborts nested trylocks by default, unless
elision is explicitly
forced for the mutex. This prevents some elision
opportunities, especially for programs that use pthread_mutex_trylock() as their
main locking primitive; this is usually done in an attempt to do a homebrew
spin-then-sleep lock, instead of using glibc's adaptive mutexes.
/* Assuming lock is not recursive */
pthread_mutex_lock(&lock);
if (pthread_mutex_trylock(&lock) == 0)
/* lock with elision forced will come here *
else
/* non elided lock will come here, after aborting */
pthread_mutex_unlock() detects whether the current lock is executed
transactionally by checking if the lock is free. If it is free it commits
the transaction, otherwise the lock is unlocked normally. This implies that
if a broken program unlocks a free lock, it may attempt to commit outside a
transaction, an error which causes a fault in RTM. In POSIX, unlocking a free lock is
undefined (so any behavior, including starting World War 3 is acceptable). It is
possible to detect this situation by adding an additional check in the
unlock path. The current glibc implementation does not do this, but if this
programming mistake is common, the implementation may add this check in the
future.
There are some other minor semantic changes that are not expected to affect
programs. See the glibc manual for a detailed list.
Tuning and Profiling TSX
Existing programs that use pthread
locking can run unchanged with the eliding pthreads library. Many programs
have some unnecessary conflicts in common lock regions that can be fixed
relatively easily, with a corresponding improvement in performance with
elision. These changes
typically also improve performance without elision, because they eliminate
unnecessary bouncing of cache lines between CPU cores.
Memory transactions involve speculative execution, so generally a profiler
is needed to understand their performance implications. A patch kit for perf to
enable
profiling on an Intel Haswell CPU is currently pending on
linux-kernel. This patch kit is relatively large because it includes special
support for TSX profiling. A normal cycle profiler would cause additional
aborts with its profiling interrupts, so special TSX events need to be used
to profile aborts.
Basic profiling for TSX with the perf patch kit is done with:
perf stat -T ....
to measure basic transaction success.
When the abort rate is high, aborts can be profiled with:
perf record -g -e tx-aborts ....
perf report
This profiles RTM abort locations in the source (for HLE aborts,
"-e el-aborts" should be used instead). It is important to
keep in mind that the call graph displayed
with -g is after the abort, so it points to the lock library, not the abort
location inside the lock region. The code address reported at the beginning
of the call graph is inside the transaction though. Essentially, the
call graph is not continuous.
Additional information on the abort can be recorded with:
perf record -g -W --transaction -e tx-aborts ...
perf report --sort comm,vdso,symbol,weight,transaction
The transaction flag allows the categorization of aborts into different
classes: caused by the current thread (SYNC), caused by another thread
(CONFLICT), and others. weight represents the number of cycles wasted in
the transaction before they are aborted; that is, how costly the abort is. Due to
limitations in the perf infrastructure and, contrary to what the option name
suggests, these fields do not actually sort currently.
Conflict abort profiles show only the target of the conflict abort, not the
cause. There are some common program patterns that are vulnerable to
conflicts and can be relatively easily fixed when detected:
- Global statistic counters (disable or make per thread)
- False sharing
in data structures or variables
- Unnecessary changes to variables that are already the expected value
(add a test)
Lock elision in other projects
Other projects that implement their own locking would need their own
TSX-enabled lock library. The kernel has its own locking library, but, in a
sense,
lock elision is less important for the kernel because it already has
highly tuned fine-grained locks, unlike many user-level programs. So glibc
elision is more important than kernel elision. But even a highly tuned code
base like the kernel can benefit from locking improvements (and not all
locks in the kernel are highly tuned for every workload).
The kernel has a large number of locks. Benchmarking each lock individually
with different workloads and annotating all locks is impractical. One
approach is to enable lock elision for all locks and rely on automatic
adaption algorithms (and some manual overrides) to disable elision for
elision-unfriendly locks.
The kernel locking primitives (spinlocks, rw spinlocks, mutexes, rw
semaphores, bit spinlocks) can all be elided with a similar RTM wrapping
pattern as used for the glibc mutexes. The kernel has an "is-locked"
primitive, which can be handled with RTM locks only with aborting. This can
be avoided with some changes to the callers in the kernel.
Some more details are available in this Linux plumbers
talk [PDF]. The kernel elision implementation is still in development.
Next steps
The next step is to pass glibc code review and integrate the lock elision
implementation into the glibc mainline. Then testing with a large range of
applications is useful to evaluate performance and potential problems. This
will require volunteers with programs that use pthread locking and the
willingness to test them. When no TSX-enabled system is available the SDE
emulator can be used for functional testing.
Credits
The glibc elision implementation has been implemented by Andi Kleen and Hongjiu Lu.
Author
Andi Kleen is a software engineer working on Linux in the Intel Open Source
Technology Center.
This article is the author's opinion and he is not speaking for Intel.
(
Log in to post comments)