User: Password:
|
|
Subscribe / Log in / New account

An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)

An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)

Posted Sep 24, 2012 1:06 UTC (Mon) by bluebugs (subscriber, #71022)
Parent article: An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)

The main problem of auto vectorization and any vectorization logic done at compilation time is that it only target one kind of CPU. Handling the switch logic at runtime (required for any packaged application supposed to run on many hardware), require a lot of hack into the build system to produce each possible optimized loop. So very few project do that...


(Log in to post comments)

An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)

Posted Sep 24, 2012 16:39 UTC (Mon) by khim (subscriber, #9252) [Link]

Good point. Note that recent versions of GCC and GLibC support the dispatching.

Perhaps this can be made simpler at the GCC level? Some kind of attribute which you can attach to you function to say "I need versions for Atom, Bulldozer, Core2, K8, K10, and Pentium4 - with automatic switch to pick one of them at runtime".

Intel's compiler does something like this but few guys use it in production...

An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)

Posted Sep 24, 2012 18:29 UTC (Mon) by dashesy (guest, #74652) [Link]

And Intel's IPP when used as dynamic library (and upon calling <i>ippInit</i>) dispatches based on CPU automatically.

An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)

Posted Sep 25, 2012 11:49 UTC (Tue) by nix (subscriber, #2304) [Link]

Perhaps this can be made simpler at the GCC level? Some kind of attribute which you can attach to you function to say "I need versions for Atom, Bulldozer, Core2, K8, K10, and Pentium4 - with automatic switch to pick one of them at runtime".
You can do this with a library, a couple of #defines, and IFUNC. However, for smaller operations this is actually harmful because the hit from the indirection through the PLT exceeds the time spent in the function. (Note that since glibc itself avoids use of the PLT for internal calls, its own calls to memcpy() et al do not benefit from IFUNC at all, and always go to a maximally-generic routine. The cost of this seems to be minimal.)

An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)

Posted Oct 9, 2012 18:35 UTC (Tue) by elanthis (guest, #6227) [Link]

I'm not convinced it's useful at all.

There are largely three classes of applications that really make use of these features:

1) HPC apps, where portability of a binary is not necessary since the whole system is generally compiled for a very specific set of hardware.

2) Media processing apps, which generally offload to the GPU these days, because even the best CPU ops for media processing are slow compared to a 300-2000 core GPU (which might even have specialized circuitry for certain media processing tasks, like video encode/decode).

3) Games (the parts that can't be offloaded to the GPU, or for games where the graphics alone are consuming the GPU's processing bandwidth), which are often mixing in bits and pieces of vector code with non-vector code and in which the overload of dispatch to another routine completely negates the advantage of using the vector instructions in the first place.

In the last case, there are games that just compile the core engine multiple times for different common end-user CPU architectures into shared libraries, and use a small loader executable to select and load the proper shared library. This allows the vector math to be completely inlined as desired while still allowing use with newer instruction sets like SSE4.1, while the game still runs on older baseline SSE3 hardware. Note that we don't generally bother supporting folks without SSE3 even on x86, since nobody who plays high-end games has a CPU old enough to lack SSE3. (Steam Hardware Survey: 99.18% of PC gaming users have SSE3, but only 57.41% have SSE4.1, so SSE3 is baseline supported but SSE4.1 must still be optional in apps.)

tl;dr: the dispatch has overhead and the folks who need vector math either don't care about CPU portability or refuse to accept that overhead.

An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)

Posted Oct 10, 2012 5:19 UTC (Wed) by khim (subscriber, #9252) [Link]

tl;dr: the dispatch has overhead and the folks who need vector math either don't care about CPU portability or refuse to accept that overhead.

GCC dispatching is tied to shared libraries and have no overhead on top of that. Nothing. Exactly zero. Not one single cycle, not one single byte (except for the slow-path which is, obviously, not a big concern: it's called slow-path for a reason). Sure, shared libraries are slower for various reasons yet somehow games use them, but they don't use dispatch. Where is the logic in your statement? Why would you create fully separate engine where you can only create specialized version of some core part?

IMNSHO it's not "refuse to accept", it's "refuse to consider because of ignorance". Ignorance is most definitely not bliss in this particular case.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds