Amdahl's law, 55 years later

Posted Nov 1, 2025 2:55 UTC (Sat) by jreiser (subscriber, #11027)
In reply to: Better than forcing it by WolfWings
Parent article: Ubuntu introduces architecture variants

https://en.wikipedia.org/wiki/Amdahl%27s_law

AVX-512 is not worth it for the vast majority of packages or users. AVX-512 is worth it if the computation mix is at least 60% linear algebra or crypto, but otherwise AVX-512 is not worth the effort and the cost in storage space, build time, and administrative morass.

Amdahl's law, 55 years later

Posted Nov 1, 2025 6:05 UTC (Sat) by WolfWings (subscriber, #56790) [Link] (5 responses)

That's a common misconception from the days when all the SIMD stuff was just basic parallel-math functions.

The BMI sub-extensions around AVX2 added a TON of fine-grained data-manipulation instructions down to the bit-level (thus the name), and AVX512 added more advanced masking features and selective packing on write with VPCOMPRESS to get variable-length memory writes from non-contiguous sequential bytes out of the 512-bit register.

So even just dealing with 32-byte blocks of data on something as simple as adding escape backslashes to a string or colorspace conversion can benefit almost fully.

AVX512 really straddles the line with what you'd expect more from GPU compute shaders.

Amdahl's law, 55 years later

Posted Nov 1, 2025 7:27 UTC (Sat) by epa (subscriber, #39769) [Link] (2 responses)

Adding escape backslashes to a string… it would take a diabolically cunning compiler to vectorize that code. Or should we write assembly language for it? Is there a better, more expressive language than C that can be compiled to efficient vector code, yet is safer than assembly language?

An example of vectorisation helping string operation

Posted Nov 1, 2025 16:53 UTC (Sat) by fishface60 (subscriber, #88700) [Link]

I misremembered reading https://purplesyringa.moe/blog/i-sped-up-serde-json-strin... as doing manual vectorisation because it does some similar tricks within 32-bit registers, so it's not relevant for specifically how to do it, but may be of interest for how vectorisation speeds up string encoding and decoding.

Amdahl's law, 55 years later

Posted Nov 1, 2025 22:16 UTC (Sat) by WolfWings (subscriber, #56790) [Link]

This is where the 'intrinsic' pseudo-functions that Intel created for compilers greatly simplifies the code so you don't need to break out raw assembly and can let the compiler still deal with register usage and inter-mix its code with yours.

https://www.intel.com/content/www/us/en/docs/intrinsics-g...

For a simple but sufficient example of the escaping-strings idea, and how you can POPCNT the mask used for VPCOMPRESS to get the byte-count written https://lemire.me/blog/2022/09/14/escaping-strings-faster... is a pretty decent point of reference.

Amdahl's law, 55 years later

Posted Nov 1, 2025 14:32 UTC (Sat) by khim (subscriber, #9252) [Link] (1 responses)

AVX512 would have been great if Intel wouldn't have bombed its introduction so badly. Today you may expect AVX512 from AMD in consistent fashion, but not from Intel.

This is extremely stupid, but hey, that's Intel for you.

Amdahl's law, 55 years later

Posted Nov 1, 2025 22:18 UTC (Sat) by WolfWings (subscriber, #56790) [Link]

I mean... they also introduced a niche instruction in AVX-512 that they implemented SO BADLY that other instructions could reproduce the same effect even faster, to the point Intel has deprecated the instruction.

AMD's implementation? 1 VP2INTERSECT per clock cycle as of Zen5, where Intel is was over 25 clock cycles.

Amdahl's law, 55 years later

Posted Nov 1, 2025 19:15 UTC (Sat) by thoughtpolice (subscriber, #87455) [Link]

AVX-512 adds a large set of useful features that work at all vector lengths that expand its total applicable use cases. Some instructions are just more powerful, more general, with lower cost at fewer cycles. In existing code, using these new instructions can make some loops or fast paths much faster, think like 20% improved. And this is practically free performance because the silicon is there, the cost has already been paid for you; the expanded register file and ALU area is not the dominant cost of the die. Most of the complaints about Haswell-era AVX implementation defects like all-core throttling haven't been relevant since Ice Lake (google "intel cpu core power licensing"). Modern AMD systems don't have this issue either.

Amdahl's law doesn't really mean anything here, because the most basic way of applying it is measuring a _single_ enhancement versus the system baseline at a single point in time. But making these instructions more useful with more features, more widely applicable, and improving their speed, expands the number of cases where they can be applied beneficially. Thus, the overall proportion of the system where improvements are possible has increased. This fact is not captured by the basic application of the law.

The reality is that AVX-512 is extremely nice to use but Intel completely fucked up delivering it to client systems, from what I can tell, due to their weird dysfunction and total addiction to product segmentation. We could have already been long past worrying about it if not for that.