Better than forcing it
Better than forcing it
Posted Oct 31, 2025 14:47 UTC (Fri) by archaic (subscriber, #111970)Parent article: Ubuntu introduces architecture variants
Posted Oct 31, 2025 15:02 UTC (Fri)
by Kamiccolo (subscriber, #95159)
[Link] (14 responses)
Posted Oct 31, 2025 16:00 UTC (Fri)
by muep (subscriber, #86754)
[Link] (11 responses)
Posted Oct 31, 2025 16:04 UTC (Fri)
by ttuttle (subscriber, #51118)
[Link] (10 responses)
Posted Oct 31, 2025 18:02 UTC (Fri)
by WolfWings (subscriber, #56790)
[Link] (8 responses)
https://www.numberworld.org/blogs/2024_8_7_zen5_avx512_te... goes into that from the perspective of AVX512 where the problem is even more exaggerated.
And at this point a lot of libraries that can benefit from AVX just do CPU detection and have multiple codepaths instead so this side-distro is a nice idea but I feel like it's not going to help much. The infrastructure to support sub-arch'es so they could publish -v4 packages, etc, would be nice but it's gonna be up in the air how much it helps.
Posted Nov 1, 2025 2:55 UTC (Sat)
by jreiser (subscriber, #11027)
[Link] (7 responses)
AVX-512 is not worth it for the vast majority of packages or users. AVX-512 is worth it if the computation mix is at least 60% linear algebra or crypto, but otherwise AVX-512 is not worth the effort and the cost in storage space, build time, and administrative morass.
Posted Nov 1, 2025 6:05 UTC (Sat)
by WolfWings (subscriber, #56790)
[Link] (5 responses)
The BMI sub-extensions around AVX2 added a TON of fine-grained data-manipulation instructions down to the bit-level (thus the name), and AVX512 added more advanced masking features and selective packing on write with VPCOMPRESS to get variable-length memory writes from non-contiguous sequential bytes out of the 512-bit register.
So even just dealing with 32-byte blocks of data on something as simple as adding escape backslashes to a string or colorspace conversion can benefit almost fully.
AVX512 really straddles the line with what you'd expect more from GPU compute shaders.
Posted Nov 1, 2025 7:27 UTC (Sat)
by epa (subscriber, #39769)
[Link] (2 responses)
Posted Nov 1, 2025 16:53 UTC (Sat)
by fishface60 (subscriber, #88700)
[Link]
Posted Nov 1, 2025 22:16 UTC (Sat)
by WolfWings (subscriber, #56790)
[Link]
https://www.intel.com/content/www/us/en/docs/intrinsics-g...
For a simple but sufficient example of the escaping-strings idea, and how you can POPCNT the mask used for VPCOMPRESS to get the byte-count written https://lemire.me/blog/2022/09/14/escaping-strings-faster... is a pretty decent point of reference.
Posted Nov 1, 2025 14:32 UTC (Sat)
by khim (subscriber, #9252)
[Link] (1 responses)
AVX512 would have been great if Intel wouldn't have bombed its introduction so badly. Today you may expect AVX512 from AMD in consistent fashion, but not from Intel. This is extremely stupid, but hey, that's Intel for you.
Posted Nov 1, 2025 22:18 UTC (Sat)
by WolfWings (subscriber, #56790)
[Link]
AMD's implementation? 1 VP2INTERSECT per clock cycle as of Zen5, where Intel is was over 25 clock cycles.
Posted Nov 1, 2025 19:15 UTC (Sat)
by thoughtpolice (subscriber, #87455)
[Link]
Amdahl's law doesn't really mean anything here, because the most basic way of applying it is measuring a _single_ enhancement versus the system baseline at a single point in time. But making these instructions more useful with more features, more widely applicable, and improving their speed, expands the number of cases where they can be applied beneficially. Thus, the overall proportion of the system where improvements are possible has increased. This fact is not captured by the basic application of the law.
The reality is that AVX-512 is extremely nice to use but Intel completely fucked up delivering it to client systems, from what I can tell, due to their weird dysfunction and total addiction to product segmentation. We could have already been long past worrying about it if not for that.
Posted Nov 1, 2025 11:21 UTC (Sat)
by hsivonen (subscriber, #91034)
[Link]
Benchmarking shows what kind of improvements you get with code that's already there. It doesn't show what kind of could would get written if CPU capabilities could be statically assumed to be present.
Many people seem to think of SIMD as a thing you use on large numbers of pixels or audio samples at a time. In those cases you get workloads that run SIMD operations for all values in a buffer, the operation can be formulated as a leaf function, and run-time dispatch to multiple versions of the leaf function is cheap compared to the time it takes to compute the function itself.
If you wish to use SIMD for text processing, chances are that the code would run some SIMD instructions that intermingle with scalar code, and values in the buffer will determine whether SIMD or scalar operations can be run. For example, you might run SIMD code from stretches of ASCII or stretches of Basic Multilingual Plane text. In these scenarios, you'll want SIMD code to be statically inlineable, etc.
Unfortunately, it turns out that the Sys V calling convention is a poor fit for text processing with SIMD. Unlike the Windows x86_64 calling convention, which gives 128-bit SIMD registers the same treatment that conventional wisdom gives to scalar registers, i.e. having some callee-save registers and some caller-save registers, the Sys V calling convention treats all vector registers as callee-save. This can have unfortunate effects on inlining SIMD code into text processing code that calls other functions instead of being a leaf function.
Furthermore, if x86_64-v3 capabilities can be statically assumed to be present more often, it's more justifiable for compiler developers to spend time on autovectorizing operations that could be autovectorized under x86_64-v3 but can't or don't make sense to be autovectorized under x86_64-v1.
- -
As for what kind of benefits can be had right away by recompiling, here are some off the top of my head:
* Run-time dispatch can become static dispatch, which may enable inlining in addition to removing the cost of the dispatch per se.
Posted Oct 31, 2025 16:28 UTC (Fri)
by nim-nim (subscriber, #34454)
[Link]
Or course you can declare a variant optional and not wait for it but that’s a short path to second class abandonware people should be weary about.
Posted Oct 31, 2025 21:06 UTC (Fri)
by jengelh (guest, #33263)
[Link]
Of course not.
Posted Oct 31, 2025 16:24 UTC (Fri)
by nim-nim (subscriber, #34454)
[Link] (1 responses)
Of course sucks to be the proud owner of legacy hardware when it passes into the museum category (as Linus would say).
Posted Oct 31, 2025 17:41 UTC (Fri)
by carlos.odonell (subscriber, #99737)
[Link]
We already have to consider library-based multilibs e.g. glibc-hwcaps delivery mechanism, and function multi-versioning e.g. __attribute__((target_clones(...))) in testing, and this adds to the qualification costs.
Something that I don't see talked about is that container image creation systems make static decisions at container build time and so you also end up with the entire combinatorial matrix as built containers if you, as an ISV or IHV, can't pin down what your customers are going to require.
In summary, I agree that raising the baseline is the simplest engineering solution to a complex problem, but the level depends on the needs of your community, your resources, and many other factors.
For example, see the recent "Architecture baseline for Forky" discussion in Debian: https://lists.debian.org/debian-release/2025/10/msg00471....
Posted Nov 2, 2025 17:54 UTC (Sun)
by pschneider1968 (guest, #178654)
[Link] (1 responses)
https://almalinux.org/blog/2025-05-27-welcoming-almalinux...
I still have an Ivy-Bridge dual Xeon server, and I'm happy that I can still run Alma 10 VMs on it. We should not have to scrap hardware that is still working fine and does its job, just because it's old.
Posted Nov 2, 2025 22:02 UTC (Sun)
by pizza (subscriber, #46)
[Link]
There are sometimes very good reasons to scrap hardware that is "still working fine".
In early 2024, I replaced a pair of Xeon dual-socket 2600v2 servers (that I got for free in 2018) with a single dual-socket Xeon 2600v4 server. The "new" machine had a higher total core count than the previous pair, each individual core was faster [1], and nearly 2 years later it has paid for itself about three times over just from the difference in my power bill.
[1] In terms of raw MHz, general IPC, and being able to utilize AVX2 optimizations (ie x86_64v3). Oh, and considerably faster (yet more power efficient) RAM too.
Better than forcing it
Better than forcing it
Better than forcing it
Better than forcing it
Amdahl's law, 55 years later
Amdahl's law, 55 years later
Amdahl's law, 55 years later
An example of vectorisation helping string operation
Amdahl's law, 55 years later
Amdahl's law, 55 years later
Amdahl's law, 55 years later
Amdahl's law, 55 years later
Better than forcing it
* Bit manipulation intrinsics like "count trailing zeros" can compile to a single instruction. (For text manipulation with SIMD, on x86/x86_64, the "count trailing zeros" operation is typically needed to pair with the operation that creates a scalar bitmask from a vector lane mask.)
* SIMD constants that have the same non-zero value on each lane can be materialized from the instruction stream instead of having to be loaded from the constant pool.
* __builtin_shuffle can generate more efficient code. (Amazingly, very basic interleave/deinterleave aka. zip/unzip of 8-bit lanes isn't as efficient on x86_64-v1 as it is on aarch64 or x86_64-v3.)
* Intrinsics that are nominally SSE intrinsics can generate corresponding AVX instructions, which, if not more efficient in itself, is supposed to combine more efficiently with other AVX instructions (e.g. the shuffles or constant materializations from above).
* Autovectorization capabilities that require x86_64-v3 and that have already been implemented in compilers can come into use.
* Although you could enable a more Haswell-or-later-ish cost model while generating x86_64-v1-only instructions, chances are that that targeting x86_64-v3 activates a cost model that is more relevant to present-day CPUs.
Better than forcing it
Better than forcing it
If you're smart, you first leave out all noarch packages (fonts, game data, etc.). And then only rerun the build for select packages that profit (or are believed to profit from it). In openSUSE, the x86-64-v3 set is just an extra 37 MB on the download server.
Better than forcing it
Better than forcing it
Better than forcing it
Better than forcing it
