Masters: ARM atomic operations

[Posted November 13, 2012 by corbet]

Jon Masters has put together a summary of how atomic operations work on the ARM architecture for those who are not afraid of the grungy details. "To provide for atomic access to a given memory location, ARM processors implement a reservation engine model. A given memory location is first loaded using a special 'load exclusive' instruction that has the side-effect of setting up a reservation against that given address in the CPU-local reservation engine. When the modified value it is later written back into memory, using the corresponding 'store exclusive' processor instruction, the reservation engine verifies that it has an outstanding reservation against that given address, and furthermore confirms that no external agents have interfered with the memory commit. A register returns success or failure."

Masters: ARM atomic operations

Posted Nov 13, 2012 18:25 UTC (Tue) by jcm (subscriber, #18262) [Link] (3 responses)

Note: I updated the entry following some useful feedback from Nicolas. I had a couple of problems with the last example, several of which were caused by trying to be too much like the rest of the OpenMPI bits. Should be clearer now.

Masters: ARM atomic operations

Posted Nov 15, 2012 23:25 UTC (Thu) by BenHutchings (subscriber, #37955) [Link] (2 responses)

You should delete this bit: 'may be sitting in a local cache within a CPU or CPU cluster (and thus visible only to a subset of processors), or may be sitting in an L2 cache that has yet to hit physical memory'. Memory barriers generally have no effect on Dcache; half the point of cache-coherency protocols is to avoid the need for cache flushes or explicit cache synchronisation.

Masters: ARM atomic operations

Posted Nov 23, 2012 1:19 UTC (Fri) by jzbiciak (guest, #5246) [Link] (1 responses)

Right, but an outside master that has to snoop in to the ARM's memory hierarchy could see writes commit in a different order than the CPU sent them, on the basis of the snoops landing in L2, L1D or the write merge buffer. A DMB effectively draws a line for snoops, too.

FWIW, another source of fun, at least on processors like A15, is the fact that snoops have to deal with the run-ahead OoO pipeline. There may be loads and stores in flight that are on a mispredicted path, and need to be unwound. That can be yet another source of memory reordering wackiness in the memory system.

What I've seen is that a processor like A15 will respond faster to a snoop that hits L2 and misses L1D than a snoop that hits L1D also, because it doesn't need to sync with the OoO pipeline. Depending on your access pattern, you could have later writes that got flushed out to L2 ahead of earlier rights. For example, A15 will stop write-allocating in L1D if you stream too many write misses. So, you could easily have some older writes in L1D, some middle-aged writes write-allocating in L2 and the youngest writes in the L1D write-merge buffer. (More info here. Look at bits 25 thru 28, which control the L1 and L2 write streaming no-allocate threshold.)

A DMB after the write stream should ensure that snoops that come in see all these writes, if the snoops could also see a write that followed the DMB, regardless of which of these three places the write stream landed.

Masters: ARM atomic operations

Posted Nov 23, 2012 1:22 UTC (Fri) by jzbiciak (guest, #5246) [Link]

Depending on your access pattern, you could have later writes that got flushed out to L2 ahead of earlier rights.

Earlier writes, even.

Masters: ARM atomic operations

Posted Nov 14, 2012 10:59 UTC (Wed) by etienne (guest, #25256) [Link]

Some ARMs (e.g. Sirata) also provide some high speed logic "spinlock" which is a king of memory where you acquire the spinlock by a *read* to its location (i.e. the first read by any processor returns 0, the second read returns 1, processor can write 0 to release).
They are of limited number (probably like the number of outstanding "load exclusive"), but they may have their use... sometimes using high speed logic (see also Sirata mailboxes) is better than trying to implement complex timing requirements in software.
Obviously "high speed logic" shall be at least as fast as layer 1 cache, from any processor in the system.