LWN.net Logo

Development

Lock elision in the GNU C library

January 30, 2013

This article was contributed by Andi Kleen

elision : the act or an instance of omitting something : omission

Contended locks are a common software scaling problem on modern multi-core systems. As more cores are added the synchronization overhead increases, potentially exceeding the speed-up provided by the extra processors. Writing good, scalable, fine-grained locking is difficult for programmers. An alternative to locking is to use memory transactions.

A memory transaction buffers all of the memory side effects from a code region and makes the memory changes visible in an atomic manner, but only if the transaction commits. If, instead, there is a conflict — memory used by one thread in the transaction is modified by another thread at the same time — the transaction is aborted and rolled back, and its side effects are discarded. If the transaction commits, all memory changes become visible to other threads atomically.

Traditionally, memory transactions have been implemented in software and were too slow for most practical uses. In order to make memory transactions usable in real-world situations, Intel has announced "Intel Transactional Synchronization Extension" (Intel TSX) for the upcoming Haswell CPU core. TSX supports memory transactions in hardware to improve their performance and make them practical. IBM has also announced similar extensions.

Of course very few programs today are written to use memory transactions directly. Hardware memory transactions are a best-effort model intended for fast paths: there is no guarantee that any hardware transaction will be successful (but most reasonable ones should be). Any transaction implemented with TSX needs a fallback path. The simplest fallback path is to use a lock when the transaction doesn't succeed. This technique is called "lock elision"; it executes a lock-protected region speculatively as a transaction, and only falls back to blocking on the lock if the transaction does not commit.

Lock elision uses the same programming model as normal locks, so it can be directly applied to existing programs. The programmer keeps using locks, but the locks are faster as they can use hardware transactional memory internally for more parallelism. Lock elision uses memory transactions as a fast path, while the slow path is still a normal lock. Deadlocks and other classic locking problems are still possible, because the transactions may fall back to a real lock at any time. Critical sections that do not conflict in their memory accesses with other threads and are otherwise transaction-friendly will execute in parallel, and not block each other out. The lock elision fast path makes the lock lock-less (non-blocking) if the conditions are right.

Usually, getting good performance with locks requires a large programming effort to do locking in a sufficiently fine-grained manner that individual locks are not overly contended. This work complicates the program. With lock elision it is possible to stay with coarser-grained locks and let the hardware discover the underlying parallelism opportunities instead, saving programmer time.

Intel TSX implements two instruction-set interfaces: HLE (Hardware Lock Elision) and RTM (Restricted Transactional Memory). HLE can be used to add transaction hints to existing code paths, while RTM allows control of the transactions directly and the use of software abort handlers. HLE refers to the specific instruction prefixes, while lock elision is a more general concept that can be also implemented with RTM. RTM requires new code paths as new instructions are used.

The lock library needs to be modified to enable elision. Obvious candidates are the glibc POSIX thread (pthread) mutex and rwlock primitives. Existing binaries can be dynamically linked to the new glibc pthreads library. Programs using pthread locking can then directly elide their locks without modifications. Programs that implement their own locking library need some changes to their lock code paths to enable lock elision.

glibc pthreads elision implementation

An implementation of RTM lock elision for glibc has been posted to the libc-alpha mailing list. POSIX mutexes support different locking types and attributes. Glibc already has a range of lock types, like timed, adaptive, recursive; elided locks simply become a new lock attribute and type. The elision can be controlled by the administrator or the programmer, or it can be automatically controlled by the pthreads library. The current interface is documented in the glibc manual draft. This is a preliminary draft that may change, as the code reviewers are expressing concerns about the API design.

The pthread lock functions like pthread_mutex_lock() dispatch on the different lock types. A new elision path has been added there for timed and adaptive mutexes. The glibc implementation uses RTM because it is more flexible. Inside the locking code, elision is implemented as a wrapper around the existing lock. The wrapper first tries to run in a transaction, and only if it's not successful does it call the underlying lock for blocking as a fallback path.

A simple RTM elision wrapper looks like this (the actual glibc implementation is somewhat more complex, since it implements retries and adaptation):

    void elided_lock_wrapper(lock) {
	if (_xbegin() == _XBEGIN_STARTED) {	/* Start transaction */
	    if (lock is free)	/* Check lock and put into read-set */
		return;		/* Execute lock region in transaction */
	
	    _xabort(0xff);		/* Abort transaction as lock is busy */
	}				/* Abort comes here */
	take fallback lock
    }

   void elided_unlock_wrapper(lock) {
	if (lock is free)
	    _xend();		/* Commit transaction */
	else
	    unlock lock;
    }

The _xbegin() call is a wrapper around a special instruction that begins the transaction. It can be seen as being similar to setjmp() in that it may return a second time if the transaction is aborted partway through. Unlike the setjmp() situation, though, a transaction abort will also unwind any other changes made since the transaction began. If the transaction begins successfully, _XBEGIN_STARTED will be returned. Should the transaction be aborted, _xbegin() will return with an error code after the transaction's side effects have been unwound; that will cause the code to take the lock instead.

Either way, the lock itself will be put into the transaction's "read set"; a change to memory in the read set by another thread will cause the transaction to abort immediately. Other memory locations accessed during the transaction are treated in the same way; the processor will watch for conflicting accesses and abort the transaction if need be. So, as soon as one thread generates a conflicting memory access, one or more threads will take the lock, causing all ongoing transactions tied to this lock to abort and fall back to normal locking.

In its full form, the glibc elision wrapper uses an adaptive elision algorithm. When a lock region frequently aborts its transactions it may slow down the program due to mis-speculation (executing transactions that do not commit). Not all mis-speculations necessarily cost time because they often happen in the time the thread would have blocked anyway. But transactions that run for a long time before aborting may slow down the execution of the program as a whole. The adaptive algorithm reacts to an aborted transaction by disabling elision for the lock for some time, before retrying. This avoids the need of manually annotating every lock in programs for elision.

It is also possible for the programmer to manually opt in or out, per-lock, using a per-lock annotation interface. That capability may be especially useful for rwlocks, since adaptive elision is not currently implemented for them. See the manual for more details.

Writing to a cache line used in a transaction will cause that transaction to abort; thus, code in critical sections protected by elided locks cannot write to the cache lines (regions of memory stored as a unit in the CPU's cache) containing those locks. One of the big advantages of elision is that it can keep the lock cache line in the "shared" state, avoiding the infamous "lock cache line bouncing" communication overhead that often slows down programs with fine-grained locking. But this requires not writing to the lock or any memory location in the same cache line. Normal glibc mutexes have owner fields for debugging purposes which they always write. The elision code path disables changing these owner fields, to keep the mutex structure read-only in the locking fast path. The only side effect is that pthread_mutex_destroy() will not return an error when called on a locked lock, as the implementation can no longer detect this case.

Code using RTM-elided locks do not know inside the critical section that a lock is locked (the lock appears to be free). POSIX pthreads locking does not export an operation to query a lock's state directly, but it can be queried indirectly by calling pthread_mutex_trylock() recursively on a lock already locked in the same thread. With RTM elision this call will succeed (HLE has special hardware support to handle this case). Essentially, elided locks behave like recursive locks. There were some concerns among code reviewers that this behavior violates the POSIX standard ("shall fail" instead of "may fail"). The latest version aborts nested trylocks by default, unless elision is explicitly forced for the mutex. This prevents some elision opportunities, especially for programs that use pthread_mutex_trylock() as their main locking primitive; this is usually done in an attempt to do a homebrew spin-then-sleep lock, instead of using glibc's adaptive mutexes.

    /* Assuming lock is not recursive */
    pthread_mutex_lock(&lock);
    if (pthread_mutex_trylock(&lock) == 0)
	/* lock with elision forced will come here *
    else
	/* non elided lock will come here, after aborting */

pthread_mutex_unlock() detects whether the current lock is executed transactionally by checking if the lock is free. If it is free it commits the transaction, otherwise the lock is unlocked normally. This implies that if a broken program unlocks a free lock, it may attempt to commit outside a transaction, an error which causes a fault in RTM. In POSIX, unlocking a free lock is undefined (so any behavior, including starting World War 3 is acceptable). It is possible to detect this situation by adding an additional check in the unlock path. The current glibc implementation does not do this, but if this programming mistake is common, the implementation may add this check in the future.

There are some other minor semantic changes that are not expected to affect programs. See the glibc manual for a detailed list.

Tuning and Profiling TSX

Existing programs that use pthread locking can run unchanged with the eliding pthreads library. Many programs have some unnecessary conflicts in common lock regions that can be fixed relatively easily, with a corresponding improvement in performance with elision. These changes typically also improve performance without elision, because they eliminate unnecessary bouncing of cache lines between CPU cores.

Memory transactions involve speculative execution, so generally a profiler is needed to understand their performance implications. A patch kit for perf to enable profiling on an Intel Haswell CPU is currently pending on linux-kernel. This patch kit is relatively large because it includes special support for TSX profiling. A normal cycle profiler would cause additional aborts with its profiling interrupts, so special TSX events need to be used to profile aborts.

Basic profiling for TSX with the perf patch kit is done with:

    perf stat -T ....

to measure basic transaction success.

When the abort rate is high, aborts can be profiled with:

    perf record -g -e tx-aborts .... 
    perf report

This profiles RTM abort locations in the source (for HLE aborts, "-e el-aborts" should be used instead). It is important to keep in mind that the call graph displayed with -g is after the abort, so it points to the lock library, not the abort location inside the lock region. The code address reported at the beginning of the call graph is inside the transaction though. Essentially, the call graph is not continuous.

Additional information on the abort can be recorded with:

    perf record -g -W --transaction -e tx-aborts ...
    perf report --sort comm,vdso,symbol,weight,transaction

The transaction flag allows the categorization of aborts into different classes: caused by the current thread (SYNC), caused by another thread (CONFLICT), and others. weight represents the number of cycles wasted in the transaction before they are aborted; that is, how costly the abort is. Due to limitations in the perf infrastructure and, contrary to what the option name suggests, these fields do not actually sort currently.

Conflict abort profiles show only the target of the conflict abort, not the cause. There are some common program patterns that are vulnerable to conflicts and can be relatively easily fixed when detected:

  • Global statistic counters (disable or make per thread)
  • False sharing in data structures or variables
  • Unnecessary changes to variables that are already the expected value (add a test)

Lock elision in other projects

Other projects that implement their own locking would need their own TSX-enabled lock library. The kernel has its own locking library, but, in a sense, lock elision is less important for the kernel because it already has highly tuned fine-grained locks, unlike many user-level programs. So glibc elision is more important than kernel elision. But even a highly tuned code base like the kernel can benefit from locking improvements (and not all locks in the kernel are highly tuned for every workload).

The kernel has a large number of locks. Benchmarking each lock individually with different workloads and annotating all locks is impractical. One approach is to enable lock elision for all locks and rely on automatic adaption algorithms (and some manual overrides) to disable elision for elision-unfriendly locks.

The kernel locking primitives (spinlocks, rw spinlocks, mutexes, rw semaphores, bit spinlocks) can all be elided with a similar RTM wrapping pattern as used for the glibc mutexes. The kernel has an "is-locked" primitive, which can be handled with RTM locks only with aborting. This can be avoided with some changes to the callers in the kernel.

Some more details are available in this Linux plumbers talk [PDF]. The kernel elision implementation is still in development.

Next steps

The next step is to pass glibc code review and integrate the lock elision implementation into the glibc mainline. Then testing with a large range of applications is useful to evaluate performance and potential problems. This will require volunteers with programs that use pthread locking and the willingness to test them. When no TSX-enabled system is available the SDE emulator can be used for functional testing.

Credits

The glibc elision implementation has been implemented by Andi Kleen and Hongjiu Lu.

Author

Andi Kleen is a software engineer working on Linux in the Intel Open Source Technology Center. This article is the author's opinion and he is not speaking for Intel.

Comments (7 posted)

Brief items

Quotes of the week

Do you remember the GNOME 1.x => 2.x transition? Similarly to how there are forks of GNOME now to 'keep the GNOME 2 candle burning,' there were forks of GNOME 1.x to 'keep the GNOME 1 candle burning.'

Do you remember what they were called? I didn't; I had to look 'em up. Do you ever wonder what happened to them? Dead projects nobody seems to remember. Do we really want to switch to a desktop that history has shown is likely to become a dead project in a few years?

Máirín Duffy

The only thing we're missing is a nice car analogy! So let me provide one.

systemd's an Edsel with the trailer and aircraft-carrier-catapult attachments, sysvinit is a Peel Trident. (I just want a Volvo.)

"nix"

Comments (57 posted)

Kdenlive 0.9.4 released

Version 0.9.4 of the Kdenlive video editor is out with a number of new features. "Kdenlive can now parse your clips to find the different scenes and add markers or cut the clip accordingly. The process is currently very slow but it's a start... Kdenlive can also now analyse an object's motion, and the result of this can be used as keyframes for a transition or an effect. For example, you can now have a title clip that follows an object."

Comments (none posted)

Seigo: Plasma.next()?

Aaron Seigo describes the development plans for the Plasma framework. "However, in Plasma Active we've made two purposeful decisions: do not expose the hierarchical file system (unless the use case dictates that as a requirement) and do not expose details that are not relevant to the usage of the device (e.g. I care that I can open that spreadsheet, it's less important at that moment that the application is Calligra Sheets). Thus to open a spreadsheet one opens the file manager and goes to Documents, or simply searches for the document from the Launch area directly. No file system, no application launchers."

Comments (107 posted)

RPM 4.11.0 released

Version 4.11.0 of the RPM package manager is out. It offers a number of performance improvements, better conflict detection, a new %license directive, and a new %autosetup macro for better automation of source unpacking and patch application.

Full Story (comments: none)

Newsletters and articles

Development newsletters from the last week

Comments (none posted)

Bacon: Community Driven Ubuntu Phone Core Apps

On his blog, Jono Bacon describes the Ubuntu Phone core apps development process while also soliciting volunteers to help on the design side. "So, we have a good set of developers assigned for each app, but we would like to invite our community to contribute design ideas for each of these apps. We have already defined a set of user stories and functional requirements, and for each app we have also defined a set of the core screens and functionality that we will need design for. We would like to invite you wonderful designers out there to contribute your design ideas, and these ideas can provide food for thought for the developers." The "core" apps are things like calendar, clock, calculator, email client, document viewer, social media apps, and so on.

Comments (5 posted)

Making a Film by using Blender (animationsupplement.com)

Animationsupplement.com has an interview with Ajit Nair who is leading a team making the 100-minute animation feature Naughty 5 using the Blender open source 3D content creation suite. The film is about five naughty children and is expected to be released by mid-2013. "Students should focus on enhancing their creativity and art rather than understanding software functions. What you want to achieve is primary and how you achieve it is secondary. For people planning to make a film should definitely give Blender a try, there will be some time spent training but it will definitely be worth it." (Thanks to Paul Wise.)

Comments (1 posted)

Clasen: GNOME 3.7 at the halfway mark

Matthias Clasen previews the changes in GNOME 3.8. "Allowing you to focus on your task and minimizing interruptions has been an important aspect of the GNOME 3 design from the start. So far, we just had a global switch to turn off notifications. The new Notification panel expands on this and allows fine-grained control over what applications get to annoy you, and how much."

Comments (179 posted)

Poettering: The Biggest Myths

Lennart Poettering decided to refute a few systemd myths on his blog, where "a few" means "30". "There's certainly some truth in that. systemd's sources do not contain a single line of code originating from original UNIX. However, we drive inspiration from UNIX, and thus there's a ton of UNIX in systemd. For example, the UNIX idea of 'everything is a file' finds reflection in that in systemd all services are exposed at runtime in a kernel file system, the cgroupfs. Then, one of the original features of UNIX was multi-seat support, based on built-in terminal support. Text terminals are hardly the state of the art how you interface with your computer these days however. With systemd we brought native multi-seat support back, but this time with full support for today's hardware, covering graphics, mice, audio, webcams and more, and all that fully automatic, hotplug-capable and without configuration."

Comments (465 posted)

Schulz: The meaning of the 4.0

Charles H. Schulz Looks forward to the LibreOffice 4.0 release, currently planned for early February. "On a more abstract level, these changes also mark a more radical departure from the OpenOffice.org codebase, and it is now becoming quite difficult to just assume that because OpenOffice.org, Apache OpenOffice behave in one specific way LibreOffice would do just the same. Of course the API changes do not make the whole work themselves, but the work we started with the 3.4 branch is paying off: LibreOffice 4.0 is becoming a different animal, and that comes with its own distinct advantages while clearly showing our ability as a community to innovate and move forward."

Comments (16 posted)

Villa: Pushing back against licensing and the permission culture

Luis Villa ponders the "post open-source" movement, which rejects licensing altogether. "If some 'no license' sharing is a quiet rejection of the permission culture, the lawyer’s solution (make everyone use a license, for their own good!) starts to look bad. This is because once an author has used a standard license, their immediate interests are protected – but the political content of not choosing a license is lost. Or to put it another way: if license authors get their wish, and everyone uses a license for all content, then, to the casual observer, it looks like everyone accepts the permission culture. This could make it harder to change that culture – to change the defaults – in the long run."

Comments (72 posted)

Page editor: Nathan Willis
Next page: Announcements>>

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds