Better than forcing it [LWN.net]

Better than forcing it

Posted Oct 31, 2025 15:02 UTC (Fri) by Kamiccolo (subscriber, #95159) [Link] (14 responses)

Indeed. But it comes with cost... 2x-3x more storage required for the repos and mirrors?

Better than forcing it

Posted Oct 31, 2025 16:00 UTC (Fri) by muep (subscriber, #86754) [Link] (11 responses)

It sounds as if it's currently and also might continue to be just some packages where the extended instruction set really makes a difference.

Better than forcing it

Posted Oct 31, 2025 16:04 UTC (Fri) by ttuttle (subscriber, #51118) [Link] (10 responses)

Do we have any good benchmarks or other data on how much moving to -v3 improves things and how?

Better than forcing it

Posted Oct 31, 2025 18:02 UTC (Fri) by WolfWings (subscriber, #56790) [Link] (8 responses)

It's extremely CPU dependent. The main gains are getting AVX/AVX2 access to work with 256 bits of data at a time (and likewise -v4 gaining 512-bit AVX), but Intel's shoddy AVX implementation on desktop CPUs make a lot of benchmarks 'meh' at best because mixing AVX code and non-AVX code cripples performance of the non-AVX parts so badly that the AVX doesn't become a net-benefit.

https://www.numberworld.org/blogs/2024_8_7_zen5_avx512_te... goes into that from the perspective of AVX512 where the problem is even more exaggerated.

And at this point a lot of libraries that can benefit from AVX just do CPU detection and have multiple codepaths instead so this side-distro is a nice idea but I feel like it's not going to help much. The infrastructure to support sub-arch'es so they could publish -v4 packages, etc, would be nice but it's gonna be up in the air how much it helps.

Amdahl's law, 55 years later

Posted Nov 1, 2025 2:55 UTC (Sat) by jreiser (subscriber, #11027) [Link] (7 responses)

https://en.wikipedia.org/wiki/Amdahl%27s_law

AVX-512 is not worth it for the vast majority of packages or users. AVX-512 is worth it if the computation mix is at least 60% linear algebra or crypto, but otherwise AVX-512 is not worth the effort and the cost in storage space, build time, and administrative morass.

Amdahl's law, 55 years later

Posted Nov 1, 2025 6:05 UTC (Sat) by WolfWings (subscriber, #56790) [Link] (5 responses)

That's a common misconception from the days when all the SIMD stuff was just basic parallel-math functions.

The BMI sub-extensions around AVX2 added a TON of fine-grained data-manipulation instructions down to the bit-level (thus the name), and AVX512 added more advanced masking features and selective packing on write with VPCOMPRESS to get variable-length memory writes from non-contiguous sequential bytes out of the 512-bit register.

So even just dealing with 32-byte blocks of data on something as simple as adding escape backslashes to a string or colorspace conversion can benefit almost fully.

AVX512 really straddles the line with what you'd expect more from GPU compute shaders.

Amdahl's law, 55 years later

Posted Nov 1, 2025 7:27 UTC (Sat) by epa (subscriber, #39769) [Link] (2 responses)

Adding escape backslashes to a string… it would take a diabolically cunning compiler to vectorize that code. Or should we write assembly language for it? Is there a better, more expressive language than C that can be compiled to efficient vector code, yet is safer than assembly language?

An example of vectorisation helping string operation

Posted Nov 1, 2025 16:53 UTC (Sat) by fishface60 (subscriber, #88700) [Link]

I misremembered reading https://purplesyringa.moe/blog/i-sped-up-serde-json-strin... as doing manual vectorisation because it does some similar tricks within 32-bit registers, so it's not relevant for specifically how to do it, but may be of interest for how vectorisation speeds up string encoding and decoding.

Amdahl's law, 55 years later

Posted Nov 1, 2025 22:16 UTC (Sat) by WolfWings (subscriber, #56790) [Link]

This is where the 'intrinsic' pseudo-functions that Intel created for compilers greatly simplifies the code so you don't need to break out raw assembly and can let the compiler still deal with register usage and inter-mix its code with yours.

https://www.intel.com/content/www/us/en/docs/intrinsics-g...

For a simple but sufficient example of the escaping-strings idea, and how you can POPCNT the mask used for VPCOMPRESS to get the byte-count written https://lemire.me/blog/2022/09/14/escaping-strings-faster... is a pretty decent point of reference.

Amdahl's law, 55 years later

Posted Nov 1, 2025 14:32 UTC (Sat) by khim (subscriber, #9252) [Link] (1 responses)

AVX512 would have been great if Intel wouldn't have bombed its introduction so badly. Today you may expect AVX512 from AMD in consistent fashion, but not from Intel.

This is extremely stupid, but hey, that's Intel for you.

Amdahl's law, 55 years later

Posted Nov 1, 2025 22:18 UTC (Sat) by WolfWings (subscriber, #56790) [Link]

I mean... they also introduced a niche instruction in AVX-512 that they implemented SO BADLY that other instructions could reproduce the same effect even faster, to the point Intel has deprecated the instruction.

AMD's implementation? 1 VP2INTERSECT per clock cycle as of Zen5, where Intel is was over 25 clock cycles.

Amdahl's law, 55 years later

Posted Nov 1, 2025 19:15 UTC (Sat) by thoughtpolice (subscriber, #87455) [Link]

AVX-512 adds a large set of useful features that work at all vector lengths that expand its total applicable use cases. Some instructions are just more powerful, more general, with lower cost at fewer cycles. In existing code, using these new instructions can make some loops or fast paths much faster, think like 20% improved. And this is practically free performance because the silicon is there, the cost has already been paid for you; the expanded register file and ALU area is not the dominant cost of the die. Most of the complaints about Haswell-era AVX implementation defects like all-core throttling haven't been relevant since Ice Lake (google "intel cpu core power licensing"). Modern AMD systems don't have this issue either.

Amdahl's law doesn't really mean anything here, because the most basic way of applying it is measuring a _single_ enhancement versus the system baseline at a single point in time. But making these instructions more useful with more features, more widely applicable, and improving their speed, expands the number of cases where they can be applied beneficially. Thus, the overall proportion of the system where improvements are possible has increased. This fact is not captured by the basic application of the law.

The reality is that AVX-512 is extremely nice to use but Intel completely fucked up delivering it to client systems, from what I can tell, due to their weird dysfunction and total addiction to product segmentation. We could have already been long past worrying about it if not for that.

Better than forcing it

Posted Nov 1, 2025 11:21 UTC (Sat) by hsivonen (subscriber, #91034) [Link]

> Do we have any good benchmarks or other data on how much moving to -v3 improves things and how?

Benchmarking shows what kind of improvements you get with code that's already there. It doesn't show what kind of could would get written if CPU capabilities could be statically assumed to be present.

Many people seem to think of SIMD as a thing you use on large numbers of pixels or audio samples at a time. In those cases you get workloads that run SIMD operations for all values in a buffer, the operation can be formulated as a leaf function, and run-time dispatch to multiple versions of the leaf function is cheap compared to the time it takes to compute the function itself.

If you wish to use SIMD for text processing, chances are that the code would run some SIMD instructions that intermingle with scalar code, and values in the buffer will determine whether SIMD or scalar operations can be run. For example, you might run SIMD code from stretches of ASCII or stretches of Basic Multilingual Plane text. In these scenarios, you'll want SIMD code to be statically inlineable, etc.

Unfortunately, it turns out that the Sys V calling convention is a poor fit for text processing with SIMD. Unlike the Windows x86_64 calling convention, which gives 128-bit SIMD registers the same treatment that conventional wisdom gives to scalar registers, i.e. having some callee-save registers and some caller-save registers, the Sys V calling convention treats all vector registers as callee-save. This can have unfortunate effects on inlining SIMD code into text processing code that calls other functions instead of being a leaf function.

Furthermore, if x86_64-v3 capabilities can be statically assumed to be present more often, it's more justifiable for compiler developers to spend time on autovectorizing operations that could be autovectorized under x86_64-v3 but can't or don't make sense to be autovectorized under x86_64-v1.

- -

As for what kind of benefits can be had right away by recompiling, here are some off the top of my head:

* Run-time dispatch can become static dispatch, which may enable inlining in addition to removing the cost of the dispatch per se.
* Bit manipulation intrinsics like "count trailing zeros" can compile to a single instruction. (For text manipulation with SIMD, on x86/x86_64, the "count trailing zeros" operation is typically needed to pair with the operation that creates a scalar bitmask from a vector lane mask.)
* SIMD constants that have the same non-zero value on each lane can be materialized from the instruction stream instead of having to be loaded from the constant pool.
* __builtin_shuffle can generate more efficient code. (Amazingly, very basic interleave/deinterleave aka. zip/unzip of 8-bit lanes isn't as efficient on x86_64-v1 as it is on aarch64 or x86_64-v3.)
* Intrinsics that are nominally SSE intrinsics can generate corresponding AVX instructions, which, if not more efficient in itself, is supposed to combine more efficiently with other AVX instructions (e.g. the shuffles or constant materializations from above).
* Autovectorization capabilities that require x86_64-v3 and that have already been implemented in compilers can come into use.
* Although you could enable a more Haswell-or-later-ish cost model while generating x86_64-v1-only instructions, chances are that that targeting x86_64-v3 activates a cost model that is more relevant to present-day CPUs.

Better than forcing it

Posted Oct 31, 2025 16:28 UTC (Fri) by nim-nim (subscriber, #34454) [Link]

It’s not just a storage thing, it means that every single time you build a package, you have to wait for all variants to finish building to declare the build successful. So your build costs and build velocity take a direct hit.

Or course you can declare a variant optional and not wait for it but that’s a short path to second class abandonware people should be weary about.

Better than forcing it

Posted Oct 31, 2025 21:06 UTC (Fri) by jengelh (guest, #33263) [Link]

>But it comes with cost... 2x-3x more storage required for the repos and mirrors?

Of course not.
If you're smart, you first leave out all noarch packages (fonts, game data, etc.). And then only rerun the build for select packages that profit (or are believed to profit from it). In openSUSE, the x86-64-v3 set is just an extra 37 MB on the download server.

Better than forcing it

Posted Oct 31, 2025 16:24 UTC (Fri) by nim-nim (subscriber, #34454) [Link] (1 responses)

RHEL (and Fedora) have been able to do what Ubuntu announced for ages, the decision not to do it was deliberate, building for multiple x86 variants is a massive drain on mirrors and build infra, it unearths exotic assembly bugs, the pesky users mix and match binaries so your support matrix takes a direct hit, etc. So people deemed it was not worth the pain and a clean break was better for everyone involved this time.

Of course sucks to be the proud owner of legacy hardware when it passes into the museum category (as Linus would say).

Better than forcing it

Posted Oct 31, 2025 17:41 UTC (Fri) by carlos.odonell (subscriber, #99737) [Link]

Exposing the choice to end users was also the original UX problem with CPU features that the levels e.g. x86-64-v2 attempted to solve by creating a baseline that could be advanced.

We already have to consider library-based multilibs e.g. glibc-hwcaps delivery mechanism, and function multi-versioning e.g. __attribute__((target_clones(...))) in testing, and this adds to the qualification costs.

Something that I don't see talked about is that container image creation systems make static decisions at container build time and so you also end up with the entire combinatorial matrix as built containers if you, as an ISV or IHV, can't pin down what your customers are going to require.

In summary, I agree that raising the baseline is the simplest engineering solution to a complex problem, but the level depends on the needs of your community, your resources, and many other factors.

For example, see the recent "Architecture baseline for Forky" discussion in Debian: https://lists.debian.org/debian-release/2025/10/msg00471....

Better than forcing it

Posted Nov 2, 2025 17:54 UTC (Sun) by pschneider1968 (guest, #178654) [Link] (1 responses)

Alma Linux has built its Alma 10 distribution additionally for x86-64-v2, too, despite of Red Hat's decision to drop support:

https://almalinux.org/blog/2025-05-27-welcoming-almalinux...

I still have an Ivy-Bridge dual Xeon server, and I'm happy that I can still run Alma 10 VMs on it. We should not have to scrap hardware that is still working fine and does its job, just because it's old.

Better than forcing it

Posted Nov 2, 2025 22:02 UTC (Sun) by pizza (subscriber, #46) [Link]

> We should not have to scrap hardware that is still working fine and does its job, just because it's old.

There are sometimes very good reasons to scrap hardware that is "still working fine".

In early 2024, I replaced a pair of Xeon dual-socket 2600v2 servers (that I got for free in 2018) with a single dual-socket Xeon 2600v4 server. The "new" machine had a higher total core count than the previous pair, each individual core was faster [1], and nearly 2 years later it has paid for itself about three times over just from the difference in my power bill.

[1] In terms of raw MHz, general IPC, and being able to utilize AVX2 optimizations (ie x86_64v3). Oh, and considerably faster (yet more power efficient) RAM too.