|
|
Log in / Subscribe / Register

Support for the TSO memory model on Arm CPUs

By Jonathan Corbet
April 26, 2024
At the CPU level, a memory model describes, among other things, the amount of freedom the processor has to reorder memory operations. If low-level code does not take the memory model into account, unpleasant surprises are likely to follow. Naturally, different CPUs offer different memory models, complicating the portability of certain types of concurrent software. To make life easier, some Arm CPUs offer the ability to emulate the x86 memory model, but efforts to make that feature available in the kernel are running into opposition.

CPU designers will do everything they can to improve performance. With regard to memory accesses, "everything" can include caching operations, executing them out of order, combining multiple operations into one, and more. These optimizations do not affect a single CPU running in isolation, but they can cause memory operations to be visible to other CPUs in a surprising order. Unwary software running elsewhere in the system may see memory operations in an order different from what might be expected from reading the code; this article describes one simple scenario for how things can go wrong, and this series on lockless algorithms shows in detail some of the techniques that can be used to avoid problems related to memory ordering.

The x86 architecture implements a model that is known as "total store ordering" (TSO), which guarantees that writes (stores) will be seen by all CPUs in the order they were executed. Reads, too, will not be reordered, but the ordering of reads and writes relative to each other is not guaranteed. Code written for a TSO architecture can, in many cases, omit the use of expensive barrier instructions that would otherwise be needed to force a specific ordering of operations.

The Arm memory model, instead, is weaker, giving the CPU more freedom to move operations around. The benefits from this design are a simpler implementation and the possibility for better performance in situations where ordering guarantees are not needed (which is most of the time). The downsides are that concurrent code can require a bit more care to write correctly, and code written for a stricter memory model (such as TSO) will have (possibly subtle) bugs when run on an Arm CPU.

The weaker Arm model is rarely a problem, but it seems there is one situation where problems arise: emulating an x86 processor. If an x86 emulator does not also emulate the TSO memory model, then concurrent code will likely fail, but emulating TSO, which requires inserting memory barriers, creates a significant performance penalty. It seems that there is one type of concurrent x86 code — games — that some users of Arm CPUs would like to be able to run; those users, strangely, dislike the prospect of facing the orc hordes in the absence of either performance or correctness.

TSO on Arm

As it happens, some Arm CPU vendors understand this problem and have, as Hector Martin described in this patch series, implemented TSO memory models in their processors. Some NVIDIA and Fujitsu CPUs run with TSO at all times; Apple's CPUs provide it as an optional feature that can be enabled at run time. Martin's purpose is to make this capability visible to, and controllable by, user space.

The series starts by adding a couple of new prctl() operations. PR_GET_MEM_MODEL will return the current memory model implemented by the CPU; that value can be either PR_SET_MEM_MODEL_DEFAULT or PR_SET_MEM_MODEL_TSO. The PR_SET_MEM_MODEL operation will attempt to enable the requested memory model, with the return code indicating whether it was successful; it is allowed to select a stricter memory model than requested. For the always-TSO CPUs, requesting TSO will obviously succeed. For Apple CPUs, requesting TSO will result in the proper CPU bits being set. Asking for TSO on a CPU that does not support it will, as expected, fail.

Martin notes that the code is not new: "This series has been brewing in the downstream Asahi Linux tree for a while now, and ships to thousands of users". Interestingly, Zayd Qumsieh had posted a similar patch set one day earlier, but that version only implemented the feature for Linux running in virtual machines on Apple CPUs.

Unfortunately for people looking forward to faster games on Apple CPUs, neither patch set is popular with the maintainers of the Arm architecture code in the kernel. Will Deacon expressed his "strong objection", saying that this feature would result in a fragmentation of user-space code. Developers, he said, would just enable the TSO bit if it appears to make problems go away, resulting in code that will fail, possibly in subtle ways, on other Arm CPUs. Catalin Marinas, too, indicated that he would block patches making this sort of implementation-defined feature available.

Martin responded that fragmentation is unlikely to be a problem, and pointed to the different page sizes supported by some processors (including Apple's) as an example of how these incompatibilities can be dealt with. He said that, so far, nobody has tried to use the TSO feature for anything that is not an emulator, so abuse in other software seems unlikely. Keeping it out, he said, will not improve the situation:

There's a pragmatic argument here: since we need this, and it absolutely will continue to ship downstream if rejected, it doesn't make much difference for fragmentation risk does it? The vast majority of Linux-on-Mac users are likely to continue running downstream kernels for the foreseeable future anyway to get newer features and hardware support faster than they can be upstreamed. So not allowing this upstream doesn't really change the landscape vis-a-vis being able to abuse this or not, it just makes our life harder by forcing us to carry more patches forever.

Deacon, though, insisted that, once a feature like this is merged, it will find uses in other software "and we'll be stuck supporting it".

If this patch is not acceptable, it is time to think about alternatives. One is to, as Martin described, just keep it out-of-tree and ship it on the distributions that actually run on that hardware. A long history of addition by distributions can, at times, eventually ease a patch's way past reluctant maintainers. Another might be to just enable TSO unconditionally on Apple CPUs, but that comes with an overall performance penalty — about 9%, according to Martin. Another possibility was mentioned by Marc Zyngier, who suggested that virtual machines could be started with TSO enabled, making it available to applications running within while keeping the kernel out of the picture entirely.

This seems like the kind of discussion that does not go away quickly. One of the many ways in which Linux has stood out over the years is in its ability to allow users to make full use of their hardware; refusing to support a useful hardware feature runs counter to that history. The concerns about potential abuse of this feature are also based in long experience, though. This is a case where the development community needs to repeat another part of its long history by finding a solution that makes the needed functionality available in a supportable way.

Index entries for this article
KernelArchitectures/Arm
KernelMemory model


to post comments

Support for the TSO memory model on Arm CPUs

Posted Apr 26, 2024 14:58 UTC (Fri) by marcH (subscriber, #57642) [Link] (5 responses)

> Interestingly, Zayd Qumsieh had posted a similar patch set one day earlier, but that version only implemented the feature for Linux running in virtual machines on Apple CPUs.

Does this mean it is on for the virtual machine but off for other processes running at the same time? On different cores? I'm confused about how "dynamic" this can be...

Support for the TSO memory model on Arm CPUs

Posted Apr 26, 2024 15:13 UTC (Fri) by corbet (editor, #1) [Link]

It's plenty dynamic; enabling it for a virtual machine (only) is entirely doable. A similar approach (mentioned in the article) was discussed for the other patch set.

Support for the TSO memory model on Arm CPUs

Posted Apr 26, 2024 23:51 UTC (Fri) by thoughtpolice (subscriber, #87455) [Link] (3 responses)

Yes, support for TSO on Apple Silicon is enabled on a per-process basis, and TSO-enabled threads must run on performance cores. Therefore virtual machine software (which runs just like any other normal process) can just enable it before relinquishing execution to the VMM.

This is a pretty big requirement in general because the Rosetta 2 emulator in macOS needs to be able to run x86 binaries efficiently and TSO is required for that, but you don't want to introduce a large performance penalty on the rest of the system to enable it. It would be pretty bad if your machine just randomly hit a 10% performance cliff the instant you started a single emulated x86 app anywhere (command line or a script, for instance).

Support for the TSO memory model on Arm CPUs

Posted Apr 28, 2024 13:57 UTC (Sun) by farnz (subscriber, #17727) [Link] (2 responses)

TSO-enabled threads must run on performance cores

Does this mean that you can't switch the efficiency cores into TSO mode? If that's the case, then it's not just a 10% performance cliff, but also an energy consumption hit for workloads that could otherwise run just fine on the efficiency cores alone.

Support for the TSO memory model on Arm CPUs

Posted Apr 28, 2024 15:44 UTC (Sun) by zdzichu (subscriber, #17118) [Link] (1 responses)

On the hand, if workload is fine running on efficiency core, then it does not need 10% better performance.

Support for the TSO memory model on Arm CPUs

Posted Apr 28, 2024 15:46 UTC (Sun) by farnz (subscriber, #17727) [Link]

But if TSO memory model mode means that you can't run on the efficiency cores, even though you'd be fine running on an efficiency core (despite any performance penalty), then you waste energy if you demand TSO mode when weak memory models are fine.

Support for the TSO memory model on Arm CPUs

Posted Apr 26, 2024 17:09 UTC (Fri) by flussence (guest, #85566) [Link]

<s>
I understand the objection! There's plenty of other examples of bad code that deserves to fail on systems that do things correctly: turning memory overcommit on, for instance, encourages a whole lot of sloppy programming practices that aren't portable to other OSes. Or trying to play sound in the Flash plugin on glibc circa 2008 and expecting it to not come out garbled - who does that? Also nobody on Linux should expect their CPU scheduler to keep their cores 100% loaded, or their disk scheduler to let them do literally anything else with the system while writing to a USB stick; I'm sure they'll revert those patches real soon now and we can go back to pretending we're a Real UNIX™.

We must punish users of stinky software by making their lives miserable through sanctions - that will convince the upstream application developers, three steps removed on another architecture, OS and economic system, to fix it promptly so that we never have to leave our elitist comfort zone.
</s>

God forbid anyone gets functioning drivers for mainstream hardware in Linux.

Support for the TSO memory model on Arm CPUs

Posted Apr 26, 2024 20:41 UTC (Fri) by shironeko (subscriber, #159952) [Link] (7 responses)

it's a little strange that changing the memory model need kernel involvement at all? Especially if it's a compatibility thing. seem like something user mode code should just be allowed to do by toggling some register or something. Is this a case of apple's backwards engineering?

Support for the TSO memory model on Arm CPUs

Posted Apr 26, 2024 20:49 UTC (Fri) by dezgeg (guest, #92243) [Link] (6 responses)

It's per-process state that needs to be enabled/disabled as necessary when doing a task switch.

Support for the TSO memory model on Arm CPUs

Posted Apr 27, 2024 4:37 UTC (Sat) by shironeko (subscriber, #159952) [Link] (5 responses)

Right. Is the patch to make linux preserve this state when task switching? that seem entirely un-controversial

Support for the TSO memory model on Arm CPUs

Posted Apr 27, 2024 10:18 UTC (Sat) by WolfWings (subscriber, #56790) [Link] (4 responses)

The patch wants to introduce an ioctl to allow processes to declare they need the alternative memory model.

But since that alternative memory model DRASTICALLY changes how multi-threaded code needs to be written (broadly speaking the LTO model greatly simplifies multi-threaded code) the maintainers are basically saying "Everyone will take the lazy route and copy-paste enable that everywhere so their x86 code works right on Arm and just refuse to support systems that don't support the x86 memory model." and... yes, so what?

Support for the TSO memory model on Arm CPUs

Posted Apr 28, 2024 4:27 UTC (Sun) by rgb (guest, #57129) [Link] (1 responses)

Yes, those developers will also take the well deserved ~10% performance hit, which sounds about the right amount of punishment for being lazy.

Support for the TSO memory model on Arm CPUs

Posted Apr 28, 2024 21:57 UTC (Sun) by WolfWings (subscriber, #56790) [Link]

It's not about being lazy directly, but about simply not having time to re-architect a codebase that started on x86 for example of which there are legion.

Being able to recompile an existing codebase by just throwing that flag on some boilerplate launch code and have it work without further fiddling?

That's worth lightyears more than the 10% performance hit to get it out the door and running. Not even for corporate stuff but just FOSS stuff that's multi-threaded.

Folks can fix the code over time to stop making the x86 LTO assumptions but getting it running fully elsewhere is worlds more important generally.

Support for the TSO memory model on Arm CPUs

Posted Apr 28, 2024 23:37 UTC (Sun) by jeremyhetzler (subscriber, #127663) [Link] (1 responses)

What do you mean by "LTO"?

Support for the TSO memory model on Arm CPUs

Posted Apr 29, 2024 6:53 UTC (Mon) by roc (subscriber, #30627) [Link]

I assume they meant TSO.

Support for the TSO memory model on Arm CPUs

Posted Apr 26, 2024 22:17 UTC (Fri) by Heretic_Blacksheep (subscriber, #169992) [Link] (17 responses)

Every time I see a Linux kernel maintainer dogmatically dictate they will actively block any patches that enables legitimate hardware functions for no good reason other than programmers might actually use it (GASP! - even if they use it in an iffy way programmers are MEANT to use it on that platform!) I want to grab the guy by the shirt and demand who made him god (they aren't even if they're accidentally in some position of minor authority) - and at the same time I realize that sometimes it's a good thing distros regularly maintain their own kernel tree fork & override some zealot upstream overly concerned about the orthodoxy of his True Way of doing things. Don't like it? Don't touch it, but likewise don't try to stop others that want to use it! (Don't like it? Don't watch it.)

I could understand if there is an objectively demonstrable issue with patch quality or data integrity over some hardware functions: constant time guarantee violations, inherent race conditions, etc. But none of those objections were raised from what I saw. Only that someone might at some point write subtly broken code... which people do all the time, including all the kernel maintainers themselves. Might as well disable concurrent code entirely and go back to single, strictly in-order processing because of branch misprediction, race conditions, and logical process mapping violations abound - because that's exactly what that argument amounts to - and all equally fixable by "disallowing" most modern CPUs and requiring all programmers complete 3 years of computer science to learn proper programming logic/formal methods (good luck!).

Support for the TSO memory model on Arm CPUs

Posted Apr 26, 2024 22:52 UTC (Fri) by willy (subscriber, #9762) [Link] (16 responses)

you seem nice

Support for the TSO memory model on Arm CPUs

Posted Apr 26, 2024 23:21 UTC (Fri) by Wol (subscriber, #4433) [Link] (11 responses)

Dunno what I think about Heretic Blacksheep, but he has a point :-(

There's too many stories about stuff being blocked because maintainers don't like it, with minimal or no technical justification, and sometimes a blunt "over my dead body".

I can accept maintainers being sceptical and wanting to be convinced, but one only has to look at the people who don't want Rust anywhere near their subsystem, or that module that's supposed to configure hardware (can't remember the details) where it seems a single driver for a single board is unacceptable - the functionality supposedly needs to be spread across several drivers and castrated in the process ...

There are a few maintainers who seem to think that users are major inconvenience to a well functioning system - well they are, aren't they?

Cheers,
Wol

Support for the TSO memory model on Arm CPUs

Posted Apr 26, 2024 23:31 UTC (Fri) by pizza (subscriber, #46) [Link] (10 responses)

On the other hand, the maintainers usually have a far better view and understanding of the overall system (and are therefore in a better place to make educated judgements) than individual driver authors or (especially) users.

Case in point: "Android Linux" vs "Mainline Linux", and how everyone+dog cobbled stuff together and bashed at it until the former "worked" with no consideration of the cost (or even the necessity) of ongoing maintenance. Repeat that for every SoC maker, or SoC family, or individual SoC, or even individual devices.

The "Android" features that finally made it into the mainline bore little resemblance (and is vastly superior) to what was first shipped en masse, and that doesn't even touch on how much stuff never even saw a release in source code form.

Support for the TSO memory model on Arm CPUs

Posted Apr 27, 2024 0:12 UTC (Sat) by Wol (subscriber, #4433) [Link] (5 responses)

> On the other hand, the maintainers usually have a far better view and understanding of the overall system (and are therefore in a better place to make educated judgements) than individual driver authors or (especially) users.

And if a maintainer says "convince me", then that's fine. It's when the maintainer says "I'm not interested in any arguments, the answer's no", that we have a problem.

The problem as far as I'm concerned, is that too many experts have been brainwashed into thinking they know best, and need bashing over the head with a clue-by-four. If they're not prepared even to ask the question "does the proposer have a point", then they ought to go.

Your Android stuff, I'm not saying it was the best way to do it, but cobbling together a system that works, and then fixing it so it's acceptable to mainline, is the way Linux works. But sometimes mainline has the attitude "to hell with users, I want an easy life". You're left with the feeling on occasion that the kernel is actively hostile to user-space, forgetting that without user-space there's no point in having a kernel.

Cheers,
Wol

Support for the TSO memory model on Arm CPUs

Posted Apr 27, 2024 2:05 UTC (Sat) by nksingh (subscriber, #94354) [Link] (4 responses)

I think part of the issue here is that ARM Ltd. has a strong opinion that the architecture should have a weak memory model and all software written for ARM should be ready to handle it.

They are using their maintainership over the upstream ARM kernel port to enforce their viewpoint as the owner of the architecture. But they're kind of doing it in a way that might hamper adoption for people who want to efficiently emulate x86 binaries. Maybe there are no ARM inhouse cores that support tlany TSO extensions so they don't care yet.

Support for the TSO memory model on Arm CPUs

Posted Apr 27, 2024 6:26 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (3 responses)

As far as I understand, Apple is a funding member of Arm. As such, they are allowed to implement non standard extensions whereas most other vendors can't.

Support for the TSO memory model on Arm CPUs

Posted Apr 27, 2024 17:59 UTC (Sat) by thoughtpolice (subscriber, #87455) [Link] (2 responses)

Yes, in practice all the cores that implemented it are custom designs, and Fujitsu, Nvidia and Apple are all architecture licensees, so they're allowed to do that. Not all of Nvidia's custom designs, at least, use TSO; Carmel (Xavier) is TSO only, while Denver2 (TX2) is weakly ordered.

TSO is a bit weird though because I was under the impression ARM's rules were something like "Non-standard extensions can't be externally 'visible' or documented for consumption and the core must fully comply with the spec." That's why Apple doesn't (and probably can't!) document extensions like this one, that allows TSO to be toggled. But the memory model is pretty visible? Maybe it's all a bit weaselly; after all, if the default memory model is stronger than the spec requires, then every correctly written weakly ordered program should behave the same anyway, so it's not "visible". On the other hand, incorrect programs may behave differently than they would under a weak model, but you could argue that TSO behaviors are "just" a subset of weakly ordered behaviors, and thus it's in spec.

Support for the TSO memory model on Arm CPUs

Posted Apr 28, 2024 4:46 UTC (Sun) by anton (subscriber, #25547) [Link]

Yes, weak ordering is not an architectural feature, it's a lack of a defined feature. An implementation that provides stronger ordering still satisfies the architectural requirements. And unlike, say, an additional instruction, where the architecture actually requires a specific behaviour (typically an illegal instruction exception) when the encoding for the new instruction is encountered, but the implementation with the additional instruction behaves differently, a stronger ordered implementation satisfies the (lack of) architectural requirements of the weakly ordered architecture. Another way of looking at it is that a program written for a weakly ordered architecture will behave as intended on an implementation with stronger ordering, so there is no test to say that the strongly ordered implementation does not satisfy the requirements of the architecture.

My guess is that this clause in the architectural license is indeed there to prevent licensees from squatting on currently unused instruction encoding space or currently illegal behaviour that ARM intends to reserve for future extensions.

The only thing that might be an issue here is the switch that Apple implemented. I guess they implemented it in something like a model-specific register (i.e. an area that the architecture leaves for implementations to define), and then that should satisfy the architectural license. I wonder, though, why Apple bothered with the switch; why not just do what Fujitsu did according to the article, and implement TSO throughout?

Support for the TSO memory model on Arm CPUs

Posted Apr 29, 2024 20:28 UTC (Mon) by justincormack (subscriber, #70439) [Link]

Apple does not actually anywhere refer to their cores as Arm (its Apple Silicon please), and Arm does not usually refer to them at all. Of course everyone else does.

Support for the TSO memory model on Arm CPUs

Posted Apr 28, 2024 20:22 UTC (Sun) by quotemstr (subscriber, #45331) [Link] (1 responses)

Android has facilities vastly superior to anything available in the GNU/Linux ecosystem. Consider linker namespaces, which comprehensively solve the ELF single namespace dynamic linker mistake. Consider the extremely efficient no-IPC system properties interface. Consider Binder, which beats the pants off dbus (on both performance and functionality) any day of the week. The broader "Linux" community would do well to learn, humbly, from Android.

Support for the TSO memory model on Arm CPUs

Posted Apr 29, 2024 5:59 UTC (Mon) by ssmith32 (subscriber, #72404) [Link]

But the original argument was not about the facilities provided by the implementations, it was about the quality of the implementation.

I can't speak to the quality of the Android implementation, but I've definitely seen the following pattern in languages (C++ some, but, particularly Java). Heck, I've seen analogs in applications and entire systems..

It goes like this..

Everyone agrees some aspect of the language/standard library is bad and horrible.

But, instead of getting a clean implementation upstreamed, they hack together some framework of horror (if one looks at the implementation), that putatively improves the situation (if one only looks at the features provided).

Eventually, the language incorporates a clean solution (since doing it right takes time), but all the consultants and folks who carved out a niche understanding the (often poorly documented) framework of horror, now have a sunk cost, and it takes ages to convince them to transition to just using the nicely implemented, performant, well-documented language/standard library features.

Maybe this isn't the situation with Android - but nothing in your argument rules it out.

Android could very well provide some nice features to the users *and* be a pile of unholy hacks. This is not an uncommon situation in many areas of programming and computer systems in general.

Worse is not necessarily better, despite what the gods have said to us.

Support for the TSO memory model on Arm CPUs

Posted Apr 28, 2024 20:45 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> Repeat that for every SoC maker, or SoC family, or individual SoC, or even individual devices.

Android now has a single image that works across all the devices: https://source.android.com/docs/core/architecture/kernel/...

> The "Android" features that finally made it into the mainline bore little resemblance (and is vastly superior) to what was first shipped en masse

Which ones?

> and that doesn't even touch on how much stuff never even saw a release in source code form.

Android core has always been Open Source.

Support for the TSO memory model on Arm CPUs

Posted Apr 29, 2024 6:02 UTC (Mon) by ssmith32 (subscriber, #72404) [Link]

Yeah, but Android *core* hasn't always been a terribly useful set of software. The situation has gotten better, but, that's not really what people mean when they talk about the Android ecosystem in general.

Support for the TSO memory model on Arm CPUs

Posted Apr 28, 2024 14:47 UTC (Sun) by Heretic_Blacksheep (subscriber, #169992) [Link] (3 responses)

Arbitrariness and capricious argument are what I expect out of religious fanatics. The reasons given by two of the Arm code contributors stinks of religious zealotry not well reasoned and thought out objections, hence my reaction. "Who died and made you god?" is a common question when dealing with this kind of problem. It's a colloquialism just like every culture on earth has. It signifies strong negative reaction, nothing else.

Support for the TSO memory model on Arm CPUs

Posted Apr 28, 2024 15:28 UTC (Sun) by pizza (subscriber, #46) [Link] (2 responses)

> "Who died and made you god?"

The Arm maintainers, like all other maintainers in the F/OSS world, are maintainers because they've consistently shown up and done this hard, usually thankless work for many, many years. The Arm architecture is in far, far better shape now thanks to their work.

In other words, they have earned (and continue to earn!) their godhood.

Can you say the same about yourself?

Support for the TSO memory model on Arm CPUs

Posted Apr 29, 2024 14:03 UTC (Mon) by hmh (subscriber, #3838) [Link] (1 responses)

Well, you have what effectively is a new, incompatible-with-others architecture variant (ARM64TSO ?) that is capable of running programs for the other architecture (ARM64), but the opposite isn't true. Just like i686 userspace running on AMD64/X86-64 kernels... But without the benefit of being actually considered a separate architecture or variant, and thus creating a massive amount of headaches later on, *including* to users, distros, etc.

But is it any different from other optional extensions to the ARM64 ISA, which not everybody provides (is there such a thing)? I don't know enough about it to even give an "IMHO" about it, but to someone without enough information, it comes to mind that maybe it is just in need of being implemented to look like, e.g., AMD64+AVX2 support is to generic AMD64 (but with the IOCTL toggle or whatever form it should take: maybe something in ELF)?

Support for the TSO memory model on Arm CPUs

Posted Apr 30, 2024 7:54 UTC (Tue) by epa (subscriber, #39769) [Link]

There's a moral dimension to it that other architecture extensions don't have. If a new CPU supports a new instruction, and you decide to use it to speed up your program, you are simply taking advantage of new hardware. No doubt you continue to provide an alternative, slower implementation for older CPUs. You're not being a bad person by using the new instruction -- except, perhaps, if it's patented and only available on processors from one litigious, grasping manufacturer. (Although in olden days there was sometimes a mild "naughtiness" associated with using undefined opcodes of an 8-bit processor, arising from "don't care" states in the design, which had strange but occasionally useful semantics.)

But here there is a true path revealed to us. The designers of ARM, in their wisdom, have provided a plain and humble concurrency model. The righteous programmer will take care to follow its strictures. Not for us the worldly luxury of TSO as practised by the followers of Intel and the AMD-ites. If one group of ARM developers is allowed to get lazy and become addicted to the easier memory model, that's unfair to those who have remained virtuous. Indeed, others might be led astray and start to demand support for TSO everywhere. And then where would we be?

Support for the TSO memory model on Arm CPUs

Posted Apr 27, 2024 20:44 UTC (Sat) by iabervon (subscriber, #722) [Link]

It seems to me like managing it on a per-VM basis would be wise, if only to avoid questions about what happens if two CPUs with different memory models share memory. Does TSO mean that other CPUs see your writes in order, or that you will see other CPUs' writes in order? If all CPUs have the same memory model (they're all x86, or all Arm cores that always use TSO), these alternatives are equivalent, so it wasn't previously important to specify. Will other vendors agree with Apple's implementation if they make CPUs that can change memory models?

It's already possible for all CPUs in an Arm system to be TSO, and having that potentially true of only some VMs on a single host isn't particularly complicated, while having a mixture of memory models seems like something that's only reliable if you're the CPU vendor.

Support for the TSO memory model on Arm CPUs

Posted Apr 28, 2024 5:42 UTC (Sun) by anton (subscriber, #25547) [Link] (6 responses)

You are too lenient on the CPU designers here. They are prone to throw problems over the wall to the software people. E.g., Alpha gave use imprecise exceptions "because of performance". But IA-32 implementations with OoO execution gave us better performance and precise exceptions (and OoO implementations would not become faster if the requirement for precise exceptions was dropped).

I call this attitude of throwing problems over to the software people the "supercomputer attitude", because it's especially rampant in supercomputing where hardware costs still exceed software costs.

Weak memory ordering is just another example of throwing a problem over to software. The hardware people would have loved to have no cache coherency between different CPUs at all. But that was so unusable that they did the next-best thing: weak ordering, where software has to insert synchronizing instructions (that are slow if the hardware is designed to be essentially incoherent) at various places. And then people from the same company that gave us Alpha with not just imprecise exceptions but also an especially weak memory model wrote an advocacy piece that justifies the lack of architectural quality with performance (the magic word that makes lots of people overlook any misfeature).

Now we have CPUs from Intel and AMD which mostly use TSO that are in machines with hundreds of cores. Do they suffer from bad performance? Fujitsu actually designed its A64FX for a supercomputer and implemented TSO, so they obviously know how to implement TSO efficiently (they did not even bother with a weakly ordered mode; they have decades of experience implementing SPARC CPUs which offered both TSO and weak ordering, switchable). Apparently Apple is not quite there yet, as according to the numbers they have an implementation where weak ordering still provides a performance advantage, but at least they offer a TSO mode. I hope that they will improve their implementation such that the performance penalty of TSO mode vanishes. And if ARM really wants to be competition to Intel and AMD in laptops and servers, they need to implement TSO, too, and implement it efficiently.

To come back to the Alpha: On the 21264 they actually implemented precise exceptions AFAICT (it's natural for an OoO implementation and the trapb instruction was as cheap as a noop, because it then was a noop). But Alpha was canceled before they saw the light and made precise exceptions an official feature of a new version of the architecture.

Support for the TSO memory model on Arm CPUs

Posted Apr 28, 2024 20:18 UTC (Sun) by ringerc (subscriber, #3071) [Link] (5 responses)

It's not like a simple "yes" or "no" situation though. AMD64 and x86 are also not fully ordered models either, they're just a bit LESS weakly ordered.

It's like database isolation. READ UNCOMMITTED is weaker than READ COMMITTED which is weaker than REPEATABLE READ and so on. Some implementations support some isolation levels and not others. Almost none support totally strict ordering because that makes necessary concurrency almost impossible; instead stronger isolation levels use speculative execution where the transaction can fail and roll back if an isolation requirement cannot be satisfied.

There's more of a spectrum (or really an n-dimensional matrix) of possible ordering decisions in memory ordering and logical order of operations. There is no one right choice - "100% ordered all the time" just isn't feasible with concurrent execution with shared-anything, so *all* design decisions are compromises.

In a database I pick the isolation level most appropriate to my application's requirements. Or even that specific operations; sometimes I want to have stronger guarantees on specific things where I'm more willing to have them slow and/or require retries. So having something similar for the CPU memory model makes perfect sense to me. You use what's appropriate for the app's correctness, latency and throughput requirements.

Support for the TSO memory model on Arm CPUs

Posted Apr 29, 2024 10:04 UTC (Mon) by farnz (subscriber, #17727) [Link]

There's also the fact that, at least in theory, compilers are permitted to reorder atomic accesses based on the rules of the language memory model (not the CPU memory model), adding a whole extra layer of complexity.

I don't believe any compiler currently does so for accesses other than "relaxed" type accesses (although I've seen discussions that suggest that LLVM is thinking about this), but it's not forbidden by any language spec I've seen. I have a suspicion, though, that there's plenty of source code out there that's buggy if a compiler does start doing reorderings permitted by the language spec that aren't permitted by TSO.

Support for the TSO memory model on Arm CPUs

Posted Apr 30, 2024 17:04 UTC (Tue) by anton (subscriber, #25547) [Link] (3 responses)

My understanding of AMD64 is that the regular memory accesses (those in the ordinary mov instructions and that are part of load-and-operate and RMW instructions) behave according to TSO. There are some special instructions with worse semantics, but you have to choose them to be bitten by them. So it's not bad. If you don't want to be bitten, don't use these instructions. And, concerning farnz' comment, the same goes for compilers: avoid those that provide worse memory ordering than can be had on the hardware the code is running on.

If by "not fully ordered" you refer to TSO being weaker than sequential consistency, that's true, and I would prefer sequential consistency. But that's just whataboutism. The issue at hand is about TSO vs. ARM's variant of weak memory ordering.

There is no one right choice - "100% ordered all the time" just isn't feasible with concurrent execution with shared-anything
What evidence do you have for that? Is it the same evidence that the Alpha people gave for rejecting byte and word accesses (which they added to the architecture later) and for rejecting precise exceptions?

Anyway, whatever you mean by "100% ordered all the time" (sequential consistency?), the issue at hand is TSO, and TSO is obviously feasible, as evidenced by all the hardware that implement TSO; and given that this includes even hardware designed for supercomputers (the Fujitsu A64FX), and it's not optional there, the tradeoff is obviously not one of horses for courses, but one of better (TSO) vs. worse (weaker memory models).

Support for the TSO memory model on Arm CPUs

Posted Apr 30, 2024 17:55 UTC (Tue) by farnz (subscriber, #17727) [Link] (2 responses)

That A64FX is TSO is not evidence that TSO is feasible for all heavy compute scenarios; HPC code tends to be designed to minimize data sharing between threads, since that's historically been a really slow option. As a result, Fujitsu might well decide to throw away the last 5% to 10% peak performance in favour of fewer bugs that only exhibit on A64FX and not Intel/AMD supercomputers.

Support for the TSO memory model on Arm CPUs

Posted May 3, 2024 14:22 UTC (Fri) by anton (subscriber, #25547) [Link] (1 responses)

There are many unstated (and unsupported) and some stated (but also unsupported) assumptions in your postings, among them:
  • That the speed advantage of weak ordering on some hardware shows itself only on shared data.
  • That other "heavy compute scenarios" (which ones?) do not avoid data sharing, but HPC does.
  • That data sharing has historically (where, when?) been a really slow option, but that's no longer the case.
  • That the performance difference between TSO and a hypothetical weak ordering variant of the A64FX would be 5%-10%.
  • That supercomputer users are willing to throw away 5%-10% in performance for easier porting of code, while for the rest of the computing world performance is much more important, so it's entirely reasonable to let them suffer the problems that come from weak ordering.
  • That porting from Intel/AMD is of particular relevance for the users of A64FX-based supercomputers.
I think that most of these assumptions are not plausible or outright wrong.

Support for the TSO memory model on Arm CPUs

Posted May 3, 2024 15:31 UTC (Fri) by farnz (subscriber, #17727) [Link]

In order:

  • If there's no data sharing, then there's no ordering between CPUs to begin with. In this situation, a high-performance implementation of TSO or even sequentially consistent memory models can delay the stores needed to meet the memory model rules until a read happens from another CPU. This, in turn, means that the speed advantage of weak ordering is only guaranteed to show up on shared data, or on low performance implementations.
  • All compute scenarios attempt to avoid data sharing; however, we were talking in the context of A64FX, which is a HPC processor, and the history of HPC matters here.
  • Data sharing in HPC has historically been things like MPI over a network, instead of shared memory. That biases HPC towards a "no shared data" model, meaning that the performance penalty of TSO is less significant than in something like a database server which uses shared state to manage the consistency part of ACID.
  • On the CPU cores we have that allow you to switch between TSO and weaker models, the weaker model has at least a 10% advantage over TSO on a basket of tasks, and considerably more than that in many cases. 5% is thus a reasonable estimate for the cost of TSO mode, assuming that cores are close to optimal today.
  • It's not justabout easier porting of code; weaker memory models result in silent errors, where you thought you had sufficient synchronization, but you did not. In HPC, that leads to unexpected wrong results, where the computation was incorrect.
  • HPC development is not normally done on the supercomputer itself; that's where you run your completed code. Development is done on "normal" laptops and desktops, just like everywhere else; a bug that only shows on the supercomputer but not on the developer workstations is challenging to debug, since the supercomputer is occupied by people running completed code. That makes porting from Intel/AMD relevant to the users of A64FX-based supercomputers, since the initial testing is going to be done on Intel/AMD for CPU-based code; the exception is GPU-based supercomputers, where the testing is done on a GPU from the same manufacturer.

On the other hand, your assertion, which is completely unbacked, is that if A64FX has chosen TSO, it must have done so because TSO and weak memory ordering has the same performance; and yet on every CPU out there that offers both options, including the latest Apple Silicon devices and previous Fujitsu SPARC64 CPUs, TSO has a performance penalty compared to the processor's relaxed memory ordering. However, given that A64FX is swimming against the stream in HPC (most HPC systems now use GPUs for the compute elements, with CPUs there to manage the GPUs, where Fujitsu are using CPUs without GPUs), it is entirely plausible that they chose to minimise risk of bad outputs in preference to extracting the last few percent of performance.

After all, being 4th instead of 1st in the Top500 list is preferable to being 4th but having to withdraw papers based on A64FX computations because they're demonstrably wrong; it's also possible that Fujitsu had actual data from their SPARC64 XII machines that showed that people ran in TSO mode rather than doing the work to get the last few % of performance in RMO mode.

personality(2) already exists

Posted Apr 28, 2024 18:07 UTC (Sun) by Hello71 (guest, #103412) [Link] (5 responses)

Linux already has a system call to enable special compatibility logic in the kernel for old binaries. personality(2) was added 30 years ago in Linux 1.1.20 and has gained various "weird memory" modes over the years, including "ADDR_LIMIT_32BIT (since Linux 2.2): Limit the address space to 32 bits." and "ADDR_NO_RANDOMIZE (since Linux 2.6.12): With this flag set, disable address-space-layout randomization.". There has not been a surge of poorly-written programs relying on these modes.

Fundamentally, CPU emulation is a heavy compatibility mode. A reasonable argument could, and has endlessly been made that Wine and other emulators are detrimental to the long-term success of portable applications, but I see no principle why Wine, QEMU, ADDR_LIMIT_32BIT, and ADDR_NO_RANDOMIZE should be acceptable but PR_SET_MEM_MODEL_TSO shouldn't.

personality(2) already exists

Posted Apr 29, 2024 7:27 UTC (Mon) by pm215 (subscriber, #98099) [Link] (3 responses)

As it happens QEMU has generally either not needed or not succeeded in getting into the kernel compat tweaks for its emulation. (The one I'm thinking of is that if you're emulating a 32 bit guest readdir syscall you need to fill a 32 bit d_off in struct dirent, but a 64 bit host kernel with some filesystems will put a 64 bit hash in the d_off, which can't fit into the guest's 32 bit field, and there's no way to say "give me something I can put in 32 bits". We tried to get patches for an fcntl to toggle this into the kernel a few years back but they went nowhere.)

personality(2) already exists

Posted Apr 30, 2024 6:34 UTC (Tue) by linusw (subscriber, #40300) [Link] (2 responses)

I tried to fix this for QEMU but ended up getting ignored I think.
Personally I can live with this, but it's probably not good for QEMU.
https://lore.kernel.org/linux-fsdevel/20201117233928.2556...

personality(2) already exists

Posted Apr 30, 2024 15:53 UTC (Tue) by mb (subscriber, #50428) [Link] (1 responses)

What does qemu currently do, when it encounters this situation?

personality(2) already exists

Posted May 2, 2024 14:53 UTC (Thu) by pm215 (subscriber, #98099) [Link]

The typical failure mode is that QEMU returns the 64-bit value in the struct from the getdents64(2) syscall, and the guest libc then fails the readdir(3) library call with EOVERFLOW when it can't fit the quart into the pint pot. The guest program is usually terminally unhappy with that.

personality(2) already exists

Posted Apr 29, 2024 20:33 UTC (Mon) by justincormack (subscriber, #70439) [Link]

emacs used to disable ASLR with that, not aware of any other usage, and eventually they gave up.

Support for the TSO memory model on Arm CPUs

Posted Apr 30, 2024 12:27 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (3 responses)

I undertand Will and he's right. A few years ago when I started to test haproxy on the Neoverse-N1 platform, I discovered that it would randomly crash and misbehave. It had weaker atomics than its predecessors and my code was just too unrigorous to accommodate to that. This notion of memory models was totally unclear to me by then; nobody except a few tens of people on earth really know how each memory model works (and Will definitely is one of them BTW). After a few weeks of work (and help from the CPU designer) we addressed the problems and started to be stricter in our way to use atomics. Now it's one of the most scalable platforms, if not the most scalable one. Now I wish there was a way to disable TSO on x86!

If I had had this prctl available, what would have happened ? Very easy, looking at the random crashes, Google would have directed me to other people solving this problem by placing this prctl() in the startup chain and voilà! Not only the code would never have been fixed, but it would have remained suboptimal and I would not have learned anything about my bad practice. Thus I'm really glad this prctl didn't exist!

Also there are other problems: it's not just a matter of process, it's a matter of code. Programs are dynamically linked with external libraries nowadays. What if a library was designed with TSO in mind and not the rest of the code ? How will one detect that the whole code needs to be protected ? IMHO that's more of an ABI problem than anything else, and if something had to be done, it should be by claiming to be a different architecture variant so that it is the kernel that recognizes a binary as working in this or that mode, and that programs can only link with compatible libraries. This way it still allows to port existing code using that explicit architecture variant, but it also imposes it on the whole ecosystem (the set of dependencies), just like some use compat32 for example. This would void the risk that users start to randomly stuff that into their programs to solve bugs, and make sure that it's only used when there is a compelling reason to do so.

Support for the TSO memory model on Arm CPUs

Posted May 1, 2024 3:02 UTC (Wed) by gmatht (subscriber, #58961) [Link]

Presumably, every library that needs TSO should enable it in its startup logic (immediately prior to where it subverts SSH).

Support for the TSO memory model on Arm CPUs

Posted May 4, 2024 7:38 UTC (Sat) by DemiMarie (subscriber, #164188) [Link]

That reasoning holds so long as one can change the code.

For emulation, one can’t. And without TSO, x86 can’t be emulated efficiently on Arm.

Support for the TSO memory model on Arm CPUs

Posted May 5, 2024 1:55 UTC (Sun) by jubal (guest, #67202) [Link]

this is going to be used on apple's processors anyways, because it's required to run x86 emulation efficiently; the question is whether it's going to be managed where it should (in the kernel), or – if the architecture maintainers manage to convince linus that he shouldn't be able to run steam and windows games efficiently on his apple laptop when running mainline kernel – outside of the kernel tree.

Support for the TSO memory model on Arm CPUs

Posted Apr 30, 2024 13:15 UTC (Tue) by farnz (subscriber, #17727) [Link]

Would it be workable to say that this state is per-thread, and that a thread in TSO mode cannot make syscalls other than to disable TSO mode?

For an emulator like FEX, this is perfectly reasonable; emit "switch to TSO mode", emit your JITted code, and at the end of a JIT block (when you're about to go back to Arm native code such as a FEX ThunkLib, or making a syscall), emit "switch out of TSO mode". It has a small performance hit (since you need to re-request TSO mode at each syscall boundary), but hopefully the advantage of running the core in TSO mode outweighs the hit from having to switch back and forth.

However, this isn't usable as a "quick hack" for applications that didn't take the memory model into account - you'd have to patch up every place that makes a syscall to switch out of TSO mode, make the syscall, and switch back, else it'll crash. If you do this, it's obvious looking at your code that you've got an evil hack in place, and thus clear that you didn't think about the impact.

Am I missing something critical here?

Support for the TSO memory model on Arm CPUs

Posted May 3, 2024 12:47 UTC (Fri) by smitty_one_each (subscriber, #28989) [Link]

So, in general TSO chickening-out is bad?

(I'll show myself out.)


Copyright © 2024, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds