August 1, 2012
This article was contributed by Jon Masters
Modern Linux distributions support a number of different computer
architectures. Each of these architectures has its own quirks and
implementation differences that are largely abstracted by a clever
collaboration between the kernel and system libraries, such as the GNU C
Library (glibc). However, there are still some ways in which core
architecture differences are exposed to higher-level software. One example
of this is in the implementation of atomic memory operations.
Atomic operations are necessary to ensure programming correctness in
those situations where there are multiple threads of simultaneous
execution. (Atomic operations are even necessary on uniprocessor systems,
where interrupts and asynchronous scheduling of other threads provide the
illusion of multithreading.) Atomicity means that a given operation (such
as incrementing a counter variable) takes place in an indivisible fashion;
its result is either visible to all CPUs in the system instantaneously, or
does not take place at all (it is similar in concept but less fashionable
than transactional memory).
Atomic operations are typically fairly small, hand-optimized assembly
functions that provide for atomic increment and decrement of counters,
acquisition and release of locks, and so on. Since these operations differ
from one architecture to another, typically few developers on any given
project understand the different implementations in their entirety, and
even fewer care to vouch for the code being correct across all supported
architectures. Although there are generic implementations available in
libraries such as pthreads, not all projects can make use of them, for a
variety of reasons, including a desire to be portable to non-Linux
platforms; thus a number of projects within the average Linux distribution
still contain their own custom implementations of atomic operations.
Atomic operations are particularly useful on modern systems with many
CPUs running multi-threaded applications, but even a system with a single
CPU (core) has a need for them. After all, the Linux kernel may interrupt a
running task thread "A" to service an interrupt routine, and may then
schedule a different task thread "B" before returning to the one that was
originally interrupted. Without a means to ensure certain operations have
taken place atomically, there would be no way to cope with potential
interference between task thread B and task thread A (e.g., if both threads
race for the same lock or operate on the same variable).
How do atomic operations work?
Atomic operations fundamentally require underlying hardware
support. There are, broadly speaking, two popular mechanisms used by CPUs
in implementing support for atomic operations in modern computer
architectures. The older, more traditional approach involves directly
manipulating memory locations, for example, a compare-and-swap (or
compare-and-exchange) instruction such as CMPXCHG on Intel's x86
and Itanium architectures. These instructions compare the value of a given
memory location with a value supplied as part of the instruction. If the
two values are identical, then yet another supplied value is written back
to the memory location, while the overall result is signaled in the form of
a returned value (almost universally in a register). This whole sequence
takes place as a single processor instruction, for example by locking the
processor local bus, and disabling external interrupts for its duration.
A more modern alternative to directly acting upon memory locations is to
implement a reservation engine within the processor. A reservation engine
(as used by modern RISC architectures such as ARM, POWER, etc.) is
typically implemented under the control of two special processor
instructions. The first instruction, often called load-with-reservation,
load-exclusive, or load-link, atomically loads the value of a given memory
location into a register and marks that memory location as reserved.
The loaded value can then be
manipulated arbitrarily before a second instruction, often called
store-exclusive, or store-conditional, atomically stores an updated value
from a register back to a given memory location, provided that no
modification has been made to that memory location in the interim. The
store-exclusive operation returns a value indicating whether the store
operation completed successfully or not, which is important because there is
an opportunity for external interference between the load, modification,
and subsequent store. This means that higher-level atomic operations built
using these instructions typically involve a loop, retrying the entire
operation until the store is successful.
Reservation engines are slightly more complex to work with in software
(requiring two instructions and a comparison in a loop block), but they
come with multiple benefits. Although compare-and-exchange appears simpler
because it is implemented in a single instruction, it in fact causes poor
performance in the CPU pipeline because multiple additional sub-stages are
required for the implied memory operations. By contrast, the
reservation engine approach explicitly separates memory reads and stores
into multiple operations. A reservation engine can be implemented
separately from the bulk of the core CPU logic, and can be as complex as
desired (including necessary logic to synchronize with other reservation
engines).
Some reservation engine implementations handle only a single memory location
at a time on a given processor, while others are more complex. In every
case, outstanding reservations are invalidated upon a context switch
between running tasks (often as a result of a specific invalidation in the
context-switch code). The reservation approach can also handle the "ABA
problem"—that is, it can detect any changes to the target memory
location after the atomic load, even if the original value is written back
prior to the store, because the reservation engine is aware of all memory
modifications.
The story doesn't quite end there. Some architectures lack full support
for certain atomic operations that are required by higher-level software,
such as atomic 64-bit (multiple word) load and store operations. In this
case, there are workarounds (e.g. the "kuser" helper, a VDSO-like helper on
older ARM processors), but that is a topic best saved for another article.
How are atomic operations used?
Atomic-operations libraries typically provide a set of functions that
include incrementing and decrementing a memory location, compare-and-swap
of a memory location, and higher-level operations built using these
functions, such as lock acquisition and release. All of these various
operations are built using the fundamental architecture-specific processor
instructions of the kind described above. As an example, the OpenMPI message-passing library
includes the following inline assembly code to implement an atomic 32-bit
addition operation on version 7 of the ARM Architecture:
START_FUNC(opal_atomic_add_32)
LSYM(13)
ldrex r2, [r0] @ exlusively load address at r0 into r2
add r2, r2, r1 @ increment the value of r2 with value in r1
strex r3, r2, [r0] @ attempt to store the value of r2 at the address in r0
cmp r3, #0 @ r3 contains result from store exclusive, test if successful
bne REFLSYM(13) @ repeat entire operation if it was interrupted
mov r0, r2 @ return value that was written
bx lr
END_FUNC(opal_atomic_add_32)
This atomic increment function works by using the special
ldrex and strex instructions, which control the CPU's
reservation engine, to gain exclusive access to a desired memory
location. The example code first loads the contents of a given memory
location into a general-purpose register, adds a value to the register, and
then tests the result of attempting to exclusively store this change back
to memory. If it is successful, the function returns. If it is not
successful, the operation is repeated until it completes without
interference.
OpenMPI includes a custom atomic-operations library that implements
support for 13 different base architectures. Some of those architectures
have multiple ways to achieve the same thing, depending on which version
is in use. For example, ARM processors have moved away from the deprecated
SWP (compare-and-swap) instruction in favor of a reservation-engine-based approach. Both approaches need to be supported if code is to
run on newer and older ARM processors. It is unfortunate that projects such
as OpenMPI have needed to implement their own atomic-operations libraries,
which must be periodically updated for new processors and are hard to
maintain because they require special knowledge of multiple underlying
architectures.
The C11 memory model
The main culprit for this state of affairs is the venerable C
programming language. Traditionally, C had no explicit internal notion of
multi-threaded applications, and only a very weakly ordered memory
model. That is, it was hard to guarantee that the compiler would not
reorder memory operations on shared variables because the language lacked
the built-in constructs that are necessary to inform the compiler of such
hidden data dependencies. Over the years, independent platform-specific
libraries have provided support for general threading abstractions,
including atomic operations performed on their own defined types. This is
all well and good, but not all projects can rely upon such platform
libraries for atomic operations, especially those that want to remain
highly portable to non-Linux systems.
This is where C11 comes in. C11 introduces a new memory model
explicitly designed with support for threading and atomic operations. It
introduces the new standard header stdatomic.h, atomic integer
types such as atomic_int (constructed using the _Atomic
type qualifier), and a new memory_order enumerated type that
defines various levels of memory ordering from the weakest
memory_order_relaxed (no specific ordering requirement) through to
memory_order_seq (sequentially consistent), the strongest
ordering. Using the C11 memory model, the previous inline assembly can be
reduced to defining an _Atomic typed variable and using one of the
atomic fetch-and-modify generic functions, such as
atomic_fetch_add().
Here is an example of using the C11 defined atomics:
#include <stdatomic.h>
_Atomic int magic_number = ATOMIC_VAR_INIT(42); // can also use _Atomic(int)
atomic_fetch_add(&magic_number, 5); // make Star Trek fans happy
This defines and initializes a new variable called
magic_number to the value 42 (using ATOMIC_VAR_INIT())
before correcting that value for the true answer to the ultimate question
of life, the universe, and everything, which, as everyone knows, Star Trek
correctly defined to be 47. Using the new C11 extensions, projects such as
OpenMPI do not need to implement their own atomic-operations library,
because the underlying language now provides the necessary support, already
optimized for each target architecture.
There is, however, at least one little problem with rushing to embrace
C11. As of this writing, GCC and glibc do not yet have full support for the
new atomic types. This is slated to be added in the GCC 4.8 time
frame. (The glibc maintainers are aware of the topic, and plan to
incorporate support once it is available in GCC.) Meanwhile, GCC 4.7 gained
support for a new set of built-in functions to provide memory-model-aware
atomic operations that were designed specifically to meet the requirements
of the C11 memory model. The idea is that the higher-level C11 atomic
primitives can be easily built using these built-ins in time for GCC
4.8. In the meantime, there are several alternative options. One of these
is to use a third-party macro-based implementation of the C11 types (which
already use the GCC built-ins), of which several exist.
Another option prior to broader C11 support being available is to use
the new GCC built-ins directly. For common operations, such as atomic
increment, the GCC 4.7 built-in atomic functions look very similar to those
that form part of the broader C11 standard:
__atomic_store_n(&v, 42, __ATOMIC_SEQ_CST);
__atomic_add_fetch(&v, 5, __ATOMIC_SEQ_CST);
The OpenMPI atomic-add example code could thus be replaced with a
single call to __atomic_add_fetch(), which will atomically fetch a
value from a memory location, add a supplied value, and return the result,
doing the right thing for every supported architecture. Compiling the
example and disassembling it will (unsurprisingly) produce a sequence of
operations that appears very similar to the inline assembly it replaces. Of
course, one does need to be careful in using the GCC built-ins directly
because they do not require the use of variables with an _Atomic
type qualifier. This means that it is possible to mix the use of atomic functions
with manipulations of regular variables without triggering any compiler
warnings. Still, this is no different than existing code failing to use a
call to a special inline assembly function, which is also incorrect.
In time, it is the hope of this author that most projects implementing
custom inline assembly for atomic operations can move to a standard C11
based implementation using stdatomic.h. That would be both
portable to many different platforms, and easier to maintain by
distributions and upstream projects themselves, because a specialist
knowledge of the architecture specifics can be abstracted by the
compiler. (Note, however, that projects may need to continue to support
legacy approaches to atomic operations, if they want to continue supporting
old compilers.) You can read more about the new C11 atomic operations
(including the details of the memory model and its orderings not covered in
depth here) in section 7.17.7 of the final draft version of the C11
specification [PDF].
(
Log in to post comments)