Memory barriers for TSO architectures

By Jonathan Corbet
December 11, 2013

Developers — even kernel developers — like to think of access to memory as a nice, predictable, deterministic thing. If two memory locations are written in a given order, one would hope that any other processors would see those writes in the same order. The truth of the matter is rather murkier; compilers and processors are both happy to take liberties with the ordering of memory accesses in the name of performance. Most of this playing around is invisible to programmers, but it can interfere with the correct operation of concurrent systems, so developers must occasionally force things to happen in the right order through the use of memory barriers. The set of available barrier types looks like it will get a bit larger in the 3.14 kernel.

Introduction to memory barriers

A memory barrier is a directive that prohibits the hardware (and compiler) from reordering operations in specific ways. To see how they might be used, consider the following simple example, taken from a 2013 Kernel Summit session. The lockless insertion of a new element into a linked list can be performed in two steps. The first is to set the "next" pointer of the new item to point to the item that will follow it in the list:

Once that is done, the list itself can be modified to include the new item:

A thread walking the list will either see the new item or it won't, depending on the timing, but it will see a well-formed list in either case. If, however, the operations are reordered such that the second pointer assignment becomes visible before the first, there will be a period of time during which the structure of the list is corrupted. Should a thread follow that pointer at the wrong time, it will end up off in the weeds. To keep that from happening, this sort of list operation must use a memory barrier between the two writes. With a proper barrier in place, the pointer assignments will never be seen in the wrong order.

The kernel offers a wide variety of memory barrier operations adapted to specific situations, but the most commonly used barriers are:

smp_wmb() is a write memory barrier; it ensures that any write operation executed after the barrier will not become visible until all writes executed prior to the barrier are visible. A write memory barrier would be the appropriate type to use in the linked list example above.
smp_rmb() is a read memory barrier; any reads executed before the barrier are forced to complete before any reads after the barrier can happen. Code traversing a linked list that is subject to lockless modification would want to use read barriers between access to subsequent link pointers.
smp_mb() is a barrier for both read and write operations; it can be thought of as the combination of smb_wmb() and smp_rmb().

Memory barriers almost invariably come in pairs. If one of two cooperating threads cares about the order in which two values are written, the other side must be equally concerned about the order in which those values are read.

Naturally enough, the full story is rather more complex than described here. Readers with sufficient interest and free time, along with quite a bit of excess brain power, can read Documentation/memory-barriers.txt for the full story.

The primary reason for the proliferation of memory barrier types is performance. A full memory barrier can be an expensive operation; that is something that kernel developers would prefer to avoid in fast paths. Weaker barriers are often cheaper, especially if they can be omitted altogether on some architectures. The x86 architecture, in particular, offers more ordering guarantees than some others do, making it possible to do without barriers entirely in some situations.

TSO barriers

A situation that has come up relatively recently has to do with "total store order" (TSO) architectures, where, as Paul McKenney put it, "reads are ordered before reads, writes before writes, and reads before writes, but not writes before reads". The x86 architecture has this property, though some others do not. TSO ordering guarantees are enough for a number of situations, but, in current kernels, a full memory barrier must be used to ensure those semantics on non-TSO architectures. Thus, it would be nice to have yet another memory barrier primitive to suit this situation.

Peter Zijlstra had originally called the new barrier smp_tmb(), but Linus was less than impressed with the descriptive power of that name. So Peter came up with a new patch set adding two new primitives:

smp_load_acquire() forces a read of a location in memory (in much the same way as ACCESS_ONCE()), but it ensures that the read happens before any subsequent reads or writes.
smp_store_release() writes a value back to memory, ensuring that the write happens after any previously-executed reads or writes.

These new primitives are immediately put to work in the code implementing the ring buffer used for perf events. That buffer has two pointers, called head and tail; head is where the kernel will next write event data, while tail is the next location user space will read events from. Only the kernel changes head, while only user space can change tail. In other words, it is a fairly standard circular buffer.

The code on the kernel side works like this (in pseudocode form):

    tail = smp_load_acquire(ring_buffer->tail);
    write_events(ring_buffer->head); /* If 'tail' indicates there is space */
    smp_store_release(ring_buffer->head, new_head);

The smp_load_acquire() operation ensures that the proper tail pointer is read before any data is written to the buffer. And, importantly, smp_store_release() ensures that any data written to the buffer is actually visible there before the new head pointer is made visible. Without that guarantee, the reader side could possibly see a head pointer indicating that more data is available before that data is actually visible in the buffer.

The code on the read side is the mirror image:

    head = smp_load_acquire(ring_buffer->head);
    read_events(tail);  /* If 'head' indicates available events */
    smp_store_release(ring_buffer->tail, new_tail);

Here, the code ensures that the head pointer has been read before trying to access any data in the buffer; in that way, head corresponds to the data the kernel side wrote there. This smp_load_acquire() operation is thus paired with the smp_store_release() in the kernel-side code; together they make sure that data is seen in the correct order. The smp_store_release() call here pairs with the smp_load_acquire() call in the kernel-side code; it makes sure that the tail pointer does not visibly change until user space has fully read the data from the buffer. Without that guarantee, the kernel could possibly overwrite that data before it was actually read.

The ring buffer code worked properly before the introduction of these new operations, but it had to use full barriers, making it slower than it needed to be. The new operations allow this code to be optimized while also better describing the exact operations that are being protected by barriers. As it happens, a lot of kernel code may be able to work with the slightly weaker guarantees offered by the new barrier operations; the patch changelog says "It appears that roughly half of the explicit barriers in core kernel code might be so replaced."

The cost, of course, is that the kernel's complicated set of memory barrier operations has become even more complex. Once upon a time that might not have mattered much, since most use of memory barriers was deeply hidden within other synchronization primitives (spinlocks and mutexes, for example). With scalability pressures pushing lockless techniques into more places in the kernel, though, the need to be explicitly aware of memory barriers is growing. There may come a point where understanding memory-barriers.txt will be mandatory for working in much of the kernel.

Index entries for this article
Kernel	Memory barriers

Memory barriers for TSO architectures

Posted Dec 12, 2013 2:59 UTC (Thu) by jcm (subscriber, #18262) [Link] (1 responses)

Obligatory link to the seminal work on the topic: http://infolab.stanford.edu/pub/cstr/reports/csl/tr/95/68...

Memory barriers for TSO architectures

Posted Dec 12, 2013 21:02 UTC (Thu) by tvld (guest, #59052) [Link]

The work by Peter Sewell's group is also a very good resource for further information: http://www.cl.cam.ac.uk/~pes20/weakmemory/index.html
For example, the formalization of the C++ memory model by Batty et al. explains how acquire/release work in combination with other properties of a program.

Memory barriers for TSO architectures

Posted Dec 12, 2013 4:06 UTC (Thu) by deater (subscriber, #11746) [Link] (14 responses)

It's all well and good to say that the user code "just" needs to run head = smp_load_acquire() but have fun trying to find a definition for that function that you can include in your code (that works cross-platform).

For programs trying to access the perf_event mmap interface this often involves having to cut and paste big chunks of kernel code, and that's not really workable if your project isn't GPL2 licensed. (For example the PAPI performance library is BSD licensed. When I raised this issue on the lkml the answer I got was more or less "not our problem").

Memory barriers for TSO architectures

Posted Dec 12, 2013 18:27 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (13 responses)

Help is on the way.

In C/C++11, smp_load_acquire(x) would be "atomic_load_explicit(x, memory_order_acquire)" and smp_store_release(x, v) would be "atomic_store_explicit(x, v, memory_order_release)".

Memory barriers for TSO architectures

Posted Dec 12, 2013 20:44 UTC (Thu) by luto (guest, #39314) [Link] (1 responses)

Does that mean that smp_load_acquire will actually match the C/C++11 semantics?

Memory barriers for TSO architectures

Posted Dec 12, 2013 21:23 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

It will come close enough for almost all uses.

The only significant difference that I am aware of is that smp_load_acquire() and smp_store_release() have volatile semantics, while their C/C++11 counterparts might or might not have volatile semantics, depending on usage. There will of course also be type-system differences, but these should not matter for synchronization.

Memory barriers for TSO architectures

Posted Dec 13, 2013 1:16 UTC (Fri) by deater (subscriber, #11746) [Link] (8 responses)

> Help is on the way.

Theoretically, sometime in the eventual future. How long will it take for a compliant C/C++11 compiler to appear in say RHEL6? Some of us develop software that has to be used on non-bleeding-edge systems.

The perf_event interface is bad enough trying to understand as it is, let alone having part of the interface depending on obscure memory barriers that a normal user can't even access without coding up rare inline assembly instructions. It once again shows what happens when a tool like perf is included with the kernel; it's next to impossible for outside users to keep on top of the interface.

It would be much nicer if the kernel could handle this for the user. Putting something as complicated as this at the hands of the user is just asking for multiple poorly written versions of the access code cut-and-pasted in scattered projects.

I'd be more worried if the ordering issues had ever been seen in real life. Has there been any bug reported where perf was failing due to this? In theory the slightly wrong barriers have been in use since 2.6.32 without anyone really noticing.

Memory barriers for TSO architectures

Posted Dec 13, 2013 5:20 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (6 responses)

It sounds like you are arguing that the required memory barriers should be buried in header files supplied by the kernel. On this, I must defer to the perf developers and maintainers.

Memory barriers for TSO architectures

Posted Dec 13, 2013 15:14 UTC (Fri) by deater (subscriber, #11746) [Link] (5 responses)

> It sounds like you are arguing that the required memory barriers should
> be buried in header files supplied by the kernel. On this, I must defer to
> the perf developers and maintainers.

No, I'm saying the interface is complicated to get right and implement.

I challenge someone, based solely on this article, to write code that properly implements the userspace part of this interface using gcc-4.7 in a way that will work on x86, Power, and ARM machines, without requiring external libraries (and ideally working with kernel headers going back as far as say linux-3.2).

I am not aware of other system call interfaces requiring a PhD in Computer Architecture plus some guesswork to actually implement, but maybe I've been just lucky with which system calls I usually have to deal with.

Memory barriers for TSO architectures

Posted Dec 13, 2013 16:33 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

I could be wrong, but I don't believe that the purpose of this article was to serve as the sole reference for writing the user-space code required to interact with perf.

Memory barriers for TSO architectures

Posted Dec 13, 2013 16:42 UTC (Fri) by andresfreund (subscriber, #69562) [Link]

> > It sounds like you are arguing that the required memory barriers should
> > be buried in header files supplied by the kernel. On this, I must defer to
> > the perf developers and maintainers.

> No, I'm saying the interface is complicated to get right and implement.
> [...]
> I am not aware of other system call interfaces requiring a PhD in Computer Architecture plus some guesswork to actually implement, but maybe I've been just lucky with which system calls I usually have to deal with.

I don't think this is a meaningful comparison. Other syscalls have different requirements, and it's not like you need to use this syscall when developing a normal application. Pretty fundamentally it isn't interesting without a good idea of how the performance counters or tracepoints work.

> I challenge someone, based solely on this article, to write code that properly implements the userspace part of this interface using gcc-4.7 in a way that will work on x86, Power, and ARM machines, without requiring external libraries (and ideally working with kernel headers going back as far as say linux-3.2).

Why on earth should a random article on the internet, even if from a reputable source, allow you to do that? Also, afaics you can just continue to use full barriers on the consuming side, just as before. It will be a bit slower, but that's it. Doing so should just be a __sync_synchronize() in gcc.

That said, the interface really isn't trivial, so including a short LGPL/BSD licenced example consumer implementation wouldn't hurt.

Memory barriers for TSO architectures

Posted Dec 13, 2013 18:12 UTC (Fri) by tvld (guest, #59052) [Link] (2 responses)

I suppose you're concerned about the atomics that have to be used. If so, this should be helpful:
http://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/_005f_005fato...

Memory barriers for TSO architectures

Posted Dec 14, 2013 20:41 UTC (Sat) by deater (subscriber, #11746) [Link]

what I'm really trying to do is have a concise bit of code that I can plop into the "accessing the MMAP ring-buffer" part of the perf_event_open.2 manpage that users can cut-and-paste and have confidence that it works.

Ideally this chunk of code would work for any architecture, across various compilers (gcc/llvm/intel), not require an outside library, and would be license agnostic.

I suppose I could just put a notice saying "to ensure correctness you need a smp_wb() barrier here" but that's not necessarily very helpful to someone who just wants to read a ring buffer without having to learn low-level details about their architecture's memory ordering implementation.

Memory barriers for TSO architectures

Posted Dec 15, 2013 19:25 UTC (Sun) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Or, if you would like barriers whose semantic match those of the Linux kernel: http://urcu.so

Memory barriers for TSO architectures

Posted Dec 13, 2013 9:47 UTC (Fri) by tvld (guest, #59052) [Link]

> > Help is on the way.
>
> Theoretically, sometime in the eventual future. How long will it take for a compliant C/C++11 compiler to appear in say RHEL6? Some of us develop software that has to be used on non-bleeding-edge systems.

New atomics builtins that follow the C/C++11 model are offered since GCC 4.7, including the respective C++11 support. C11 atomics should appear in GCC 4.9, I believe.

On RHEL, you can get newer compilers for RHEL6, for example (look for the Developer Tools offering). I don't know whether something similar is available for other distributions; if not, try libatomic-ops (http://www.hpl.hp.com/research/linux/atomic_ops/), which has been part of Debian for a while already and also offers acquire/release atomics.

Note that on some architectures, there might be several ways in which particular memory orders / barriers could be implemented by the compiler/kernel. They should hopefully all have chosen the same implementation approach to make them compatible, but I don't know and haven't checked myself. If you're on a rather strongly ordered architecture like x86, for example, you probably don't need to worry at all.

Memory barriers for TSO architectures

Posted Dec 13, 2013 11:23 UTC (Fri) by tvld (guest, #59052) [Link] (1 responses)

Do you know whether the kernel's and GCC's way to implement barriers / memory orders is compatible (e.g., regarding the seq-cst choice on ARM outlined by the Cambridge group: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)? If the atomics "cross" the kernel interfaces, they should be part of the ABI. I know that this has been discussed for the compilers and compatibility across them at least; I don't know of any "official" ABI document for this though.

Memory barriers for TSO architectures

Posted Dec 13, 2013 16:48 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

In Peter Zijlstra's latest patch, smp_load_acquire() and smp_store_release() use the "dmb" instruction, which is compatible with the Cambridge group's recipe. Current smp_mb() also uses "dmb", so is compatible with this recipe's sequentially consistent fence. The rcu_dereference() primitive is compatible with this recipe's memory_order_consume load. The rcu_assign_pointer() primitive is currently a bit weaker than this recipe's store release, but I expect to be upgrading rcu_assign_pointer() in the next Linux-kernel release or two.

The other Linux-kernel memory barriers are not directly comparable to C/C++11.

So we should be good, at least until someone changes something. :-)

Specifying the memory-ordering properties of those user-kernel interfaces that are sensitive to memory ordering might not be a bad idea. I have a funny feeling that continued pushes for performance in the absence of significant increases in single-threaded CPU performance will make this sort of thing more common...

Memory barriers for TSO architectures

Posted Dec 19, 2013 10:20 UTC (Thu) by dakas (guest, #88146) [Link] (1 responses)

I hope the implementation is better than the illustration. If two threads try inserting an element of their own in the same place while being nearly in sync (meaning that one thread is never ahead more than one step of the other), the end result will have only one of the new elements being added to the linked list.

Memory barriers for TSO architectures

Posted Dec 19, 2013 15:03 UTC (Thu) by corbet (editor, #1) [Link]

Obviously, locking is needed between writers, yes.