Some ARMs (e.g. Sirata) also provide some high speed logic "spinlock" which is a king of memory where you acquire the spinlock by a *read* to its location (i.e. the first read by any processor returns 0, the second read returns 1, processor can write 0 to release).
They are of limited number (probably like the number of outstanding "load exclusive"), but they may have their use... sometimes using high speed logic (see also Sirata mailboxes) is better than trying to implement complex timing requirements in software.
Obviously "high speed logic" shall be at least as fast as layer 1 cache, from any processor in the system.