LWN: Comments on "An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)"

An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)

khim — Wed, 10 Oct 2012 05:19:11 +0000

tl;dr: the dispatch has overhead and the folks who need vector math either don't care about CPU portability or refuse to accept that overhead.

GCC dispatching is tied to shared libraries and have no overhead on top of that. Nothing. Exactly zero. Not one single cycle, not one single byte (except for the slow-path which is, obviously, not a big concern: it's called slow-path for a reason). Sure, shared libraries are slower for various reasons yet somehow games use them, but they don't use dispatch. Where is the logic in your statement? Why would you create fully separate engine where you can only create specialized version of some core part?

IMNSHO it's not "refuse to accept", it's "refuse to consider because of ignorance". Ignorance is most definitely not bliss in this particular case.

An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)

elanthis — Tue, 09 Oct 2012 18:35:38 +0000

I'm not convinced it's useful at all.

There are largely three classes of applications that really make use of these features:

1) HPC apps, where portability of a binary is not necessary since the whole system is generally compiled for a very specific set of hardware.

2) Media processing apps, which generally offload to the GPU these days, because even the best CPU ops for media processing are slow compared to a 300-2000 core GPU (which might even have specialized circuitry for certain media processing tasks, like video encode/decode).

3) Games (the parts that can't be offloaded to the GPU, or for games where the graphics alone are consuming the GPU's processing bandwidth), which are often mixing in bits and pieces of vector code with non-vector code and in which the overload of dispatch to another routine completely negates the advantage of using the vector instructions in the first place.

In the last case, there are games that just compile the core engine multiple times for different common end-user CPU architectures into shared libraries, and use a small loader executable to select and load the proper shared library. This allows the vector math to be completely inlined as desired while still allowing use with newer instruction sets like SSE4.1, while the game still runs on older baseline SSE3 hardware. Note that we don't generally bother supporting folks without SSE3 even on x86, since nobody who plays high-end games has a CPU old enough to lack SSE3. (Steam Hardware Survey: 99.18% of PC gaming users have SSE3, but only 57.41% have SSE4.1, so SSE3 is baseline supported but SSE4.1 must still be optional in apps.)

tl;dr: the dispatch has overhead and the folks who need vector math either don't care about CPU portability or refuse to accept that overhead.

An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)

nix — Tue, 25 Sep 2012 11:49:08 +0000

Perhaps this can be made simpler at the GCC level? Some kind of attribute which you can attach to you function to say "I need versions for Atom, Bulldozer, Core2, K8, K10, and Pentium4 - with automatic switch to pick one of them at runtime".

You can do this with a library, a couple of #defines, and IFUNC. However, for smaller operations this is actually harmful because the hit from the indirection through the PLT exceeds the time spent in the function. (Note that since glibc itself avoids use of the PLT for internal calls, its own calls to memcpy() et al do not benefit from IFUNC at all, and always go to a maximally-generic routine. The cost of this seems to be minimal.)

An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)

dashesy — Mon, 24 Sep 2012 18:29:07 +0000

And Intel's IPP when used as dynamic library (and upon calling <i>ippInit</i>) dispatches based on CPU automatically.

An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)

khim — Mon, 24 Sep 2012 16:39:26 +0000

Good point. Note that recent versions of GCC and GLibC support the dispatching.

Perhaps this can be made simpler at the GCC level? Some kind of attribute which you can attach to you function to say "I need versions for Atom, Bulldozer, Core2, K8, K10, and Pentium4 - with automatic switch to pick one of them at runtime".

Intel's compiler does something like this but few guys use it in production...

An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)

bluebugs — Mon, 24 Sep 2012 01:06:17 +0000

The main problem of auto vectorization and any vectorization logic done at compilation time is that it only target one kind of CPU. Handling the switch logic at runtime (required for any packaged application supposed to run on many hardware), require a lot of hack into the build system to produce each possible optimized loop. So very few project do that...

An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)

ssam — Sun, 23 Sep 2012 12:22:21 +0000

Its even better if you can get the compiler to do a good job of autovectorising your code. That way it is portable to new architectures. I found this article the other day http://locklessinc.com/articles/vectorize/ . It has a a few examples of how GCC autovectorises, why sometimes it thinks that it can't, and how to convince it that it can.

An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)

iwillneverremember — Sat, 22 Sep 2012 06:45:09 +0000

Another approach is ORC http://code.entropywave.com/orc/
* I haven't tried it myself.

An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)

gmaxwell — Sat, 22 Sep 2012 01:45:27 +0000

Meh. Closed source libraries with free software incompatible licenses. Not my idea of stellar advice.

An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)

dashesy — Fri, 21 Sep 2012 21:34:25 +0000

A really nice article I wished for a few years ago. Although instrinsics prevent going all the way to assembly there are still some gotchas like placing __builtin_ia32_emms here and there to avoid some floating point corner cases. Also in its simplest usage it requires fixed vector length for the processing path. Not the best approach since new CPUs come with larger vector support; this emphasizes the need for dynamic dispatching based on CPU (or better would be a mixed approach that also considers GPU).

Overall the advice given in paragraph before the summery seems to be the most practical one:

Fourth, don't re-invent the wheel. Intel, Freescale and ARM all offer libraries and code samples to help you get the most from their processors. These include Intel's Integrated Performance Primitives, Freescale's libmotovec and ARM's OpenMAX.