I've just done a quick test and it looks like gcc emits a function call when you invoke __sync_* for ARM. So presumably it would be possible to write a small library of things with the right names that calls the kernel helper code. Then near-optimal portable code would be possible.