Better than forcing it

Posted Nov 1, 2025 11:21 UTC (Sat) by hsivonen (subscriber, #91034)
In reply to: Better than forcing it by ttuttle
Parent article: Ubuntu introduces architecture variants

> Do we have any good benchmarks or other data on how much moving to -v3 improves things and how?

Benchmarking shows what kind of improvements you get with code that's already there. It doesn't show what kind of could would get written if CPU capabilities could be statically assumed to be present.

Many people seem to think of SIMD as a thing you use on large numbers of pixels or audio samples at a time. In those cases you get workloads that run SIMD operations for all values in a buffer, the operation can be formulated as a leaf function, and run-time dispatch to multiple versions of the leaf function is cheap compared to the time it takes to compute the function itself.

If you wish to use SIMD for text processing, chances are that the code would run some SIMD instructions that intermingle with scalar code, and values in the buffer will determine whether SIMD or scalar operations can be run. For example, you might run SIMD code from stretches of ASCII or stretches of Basic Multilingual Plane text. In these scenarios, you'll want SIMD code to be statically inlineable, etc.

Unfortunately, it turns out that the Sys V calling convention is a poor fit for text processing with SIMD. Unlike the Windows x86_64 calling convention, which gives 128-bit SIMD registers the same treatment that conventional wisdom gives to scalar registers, i.e. having some callee-save registers and some caller-save registers, the Sys V calling convention treats all vector registers as callee-save. This can have unfortunate effects on inlining SIMD code into text processing code that calls other functions instead of being a leaf function.

Furthermore, if x86_64-v3 capabilities can be statically assumed to be present more often, it's more justifiable for compiler developers to spend time on autovectorizing operations that could be autovectorized under x86_64-v3 but can't or don't make sense to be autovectorized under x86_64-v1.

- -

As for what kind of benefits can be had right away by recompiling, here are some off the top of my head:

* Run-time dispatch can become static dispatch, which may enable inlining in addition to removing the cost of the dispatch per se.
* Bit manipulation intrinsics like "count trailing zeros" can compile to a single instruction. (For text manipulation with SIMD, on x86/x86_64, the "count trailing zeros" operation is typically needed to pair with the operation that creates a scalar bitmask from a vector lane mask.)
* SIMD constants that have the same non-zero value on each lane can be materialized from the instruction stream instead of having to be loaded from the constant pool.
* __builtin_shuffle can generate more efficient code. (Amazingly, very basic interleave/deinterleave aka. zip/unzip of 8-bit lanes isn't as efficient on x86_64-v1 as it is on aarch64 or x86_64-v3.)
* Intrinsics that are nominally SSE intrinsics can generate corresponding AVX instructions, which, if not more efficient in itself, is supposed to combine more efficiently with other AVX instructions (e.g. the shuffles or constant materializations from above).
* Autovectorization capabilities that require x86_64-v3 and that have already been implemented in compilers can come into use.
* Although you could enable a more Haswell-or-later-ish cost model while generating x86_64-v1-only instructions, chances are that that targeting x86_64-v3 activates a cost model that is more relevant to present-day CPUs.