"Again, not being familiar with C11, what does the memory model give that can help multithreaded programming, then? How does §188.8.131.52/4 work in the case where the machine access granularity is larger than some primitive types and a RMW must happen?"
Discarding volatile, which is irrelevant, it requires the faulty behavior observed in the kernel not to occur. On 64 bit architectures it can do this by various pessimizations. It could use slower 32 bit accesses where available. If that is not available it may have to pad so that any 32-bit scalar occupies 64 bits of space, where necessary to conform to the standard.
None of this is new. gcc's pthreads implementations have been doing this for years. It might be instructive to look at the assembly output of the same test cases compiled with the -pthread switch.