There are largely three classes of applications that really make use of these features:
1) HPC apps, where portability of a binary is not necessary since the whole system is generally compiled for a very specific set of hardware.
2) Media processing apps, which generally offload to the GPU these days, because even the best CPU ops for media processing are slow compared to a 300-2000 core GPU (which might even have specialized circuitry for certain media processing tasks, like video encode/decode).
3) Games (the parts that can't be offloaded to the GPU, or for games where the graphics alone are consuming the GPU's processing bandwidth), which are often mixing in bits and pieces of vector code with non-vector code and in which the overload of dispatch to another routine completely negates the advantage of using the vector instructions in the first place.
In the last case, there are games that just compile the core engine multiple times for different common end-user CPU architectures into shared libraries, and use a small loader executable to select and load the proper shared library. This allows the vector math to be completely inlined as desired while still allowing use with newer instruction sets like SSE4.1, while the game still runs on older baseline SSE3 hardware. Note that we don't generally bother supporting folks without SSE3 even on x86, since nobody who plays high-end games has a CPU old enough to lack SSE3. (Steam Hardware Survey: 99.18% of PC gaming users have SSE3, but only 57.41% have SSE4.1, so SSE3 is baseline supported but SSE4.1 must still be optional in apps.)
tl;dr: the dispatch has overhead and the folks who need vector math either don't care about CPU portability or refuse to accept that overhead.
Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds