Load/Store tearing in this day and age?

Posted Jul 16, 2019 22:11 UTC (Tue) by pr1268 (guest, #24648)
Parent article: Who's afraid of a big bad optimizing compiler?

I don't understand how/why, in this day and age, that load tearing and storing still occur. At least when loading/storing an n-bit value on an n-bit architecture. Isn't that what the "beauty" of larger-bit architectures is for?

For example, the compiler could, in theory, compile the load from global_ptr on line 1 of the following code as a series of one-byte loads.

70 years of electronic computers, 64-bit architectures are the norm now, and we're still putzing around one byte at a time? Ugh.

IMO loads and stores should be atomic at the assembly level (if not at the metal) for an n-bit value on an n-bit architecture. Just my $0x02.

Thank you to all the contributors for this article. Very enlightening (if not depressing).

Load/Store tearing in this day and age?

Posted Jul 16, 2019 22:58 UTC (Tue) by excors (subscriber, #95769) [Link] (2 responses)

In this day and age, we still use chips and CPU architectures that were designed in a previous day and age, because they're widely available and well supported and (most importantly) cheap. They might not support misaligned loads (because that's a lot of hardware complexity, which was an issue in olden times), and if the compiler isn't certain that a load will be aligned (e.g. it's reading a field from an __attribute__((packed)) struct, which are pretty common in Linux) it'll have to read individual bytes.

Load/Store tearing in this day and age?

Posted Jul 17, 2019 6:58 UTC (Wed) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

All good points! Plus modern CPUs often have store-immediate instructions with restricted-size immediate values, which can tempt the compiler to emit a series of store-immediate instructions for a larger store. As noted in the article, this is not a theoretical scenario.

Load/Store tearing in this day and age?

Posted Jul 18, 2019 4:21 UTC (Thu) by ncm (guest, #165) [Link]

It is very, very common on x86 for a value to straddle cache lines, which the hardware necessarily splits. An atomic that straddles cache lines presents the memory hardware a messy and typically slow job -- if you cared, you would have aligned! -- possibly even involving a kernel trap on simpler hardware.

On x86 they devote another few million transistors for each case to help avoid what would be a kernel trap. All the extra transistors strengthen Intel's (and, lately, AMD's and Samsung's) competitive position vs cheaper hardware. It's not free, because that enables a monopoly or oligopoly that may then extract rent or (worse) limit your choices.

In the US, only the former is ever considered actionable harm, despite the law recognizing both. It is artificially difficult to demonstrate the latter, where logically it should instead be assumed.

So, exercising care with alignment affords you more choice in hardware that can run your code fast enough, and safely, which might also enable saving money, too, and also power and heat, because those millions of transistors burn power.

Load/Store tearing in this day and age?

Posted Jul 17, 2019 9:32 UTC (Wed) by comex (subscriber, #71521) [Link]

> IMO loads and stores should be atomic at the assembly level (if not at the metal) for an n-bit value on an n-bit architecture. Just my $0x02.

They typically *are* guaranteed to be atomic at the assembly level, but only if the pointer is properly aligned.

You can see this more easily if you use C11 atomics, which are more explicit about what guarantees they provide. For example, this source code:

	#include <stdatomic.h>
	int load(_Atomic int *ptr) {
		return atomic_load_explicit(ptr, memory_order_relaxed);
	}

compiles to this assembly (GCC targeting x86-64):

	load:
			mov     eax, DWORD PTR [rdi]
			ret

All C11 atomic loads/stores are guaranteed to be, well, atomic, but GCC has decided to emit a plain mov instruction. In fact this is valid, because x86 has an architectural guarantee that regular movs (64-bit and smaller) to and from aligned pointers are atomic. (And GCC is allowed to assume that `ptr` is aligned, because using it is undefined behavior if not.) Most common architectures work the same.

On the other hand, many uses of atomics want a stronger memory ordering, e.g. memory_order_acquire. On x86, this *still* just generates a plain mov instruction, because x86 has very strong ordering guarantees for all accesses. But on other architectures it tends to require additional synchronization or a different instruction.