An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)
GCC offers an intermediate between assembly and standard C that can get you more speed and processor features without having to go all the way to assembly language: compiler intrinsics. This article discusses GCC's compiler intrinsics, emphasizing vector processing on three platforms: X86 (using MMX, SSE and SSE2); Motorola, now Freescale (using Altivec); and ARM Cortex-A (using Neon). We conclude with some debugging tips and references."
Posted Sep 21, 2012 21:34 UTC (Fri)
by dashesy (guest, #74652)
[Link] (2 responses)
Overall the advice given in paragraph before the summery seems to be the most practical one:
A really nice article I wished for a few years ago. Although instrinsics prevent going all the way to assembly there are still some gotchas like placing __builtin_ia32_emms here and there to avoid some floating point corner cases. Also in its simplest usage it requires fixed vector length for the processing path. Not the best approach since new CPUs come with larger vector support; this emphasizes the need for dynamic dispatching based on CPU (or better would be a mixed approach that also considers GPU).
An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)
Fourth, don't re-invent the wheel. Intel, Freescale and ARM all offer libraries and code samples to help you get the most from their processors. These include Intel's Integrated Performance Primitives, Freescale's libmotovec and ARM's OpenMAX.
Posted Sep 22, 2012 1:45 UTC (Sat)
by gmaxwell (guest, #30048)
[Link]
Posted Sep 22, 2012 6:45 UTC (Sat)
by iwillneverremember (guest, #65704)
[Link]
Posted Sep 23, 2012 12:22 UTC (Sun)
by ssam (guest, #46587)
[Link]
Posted Sep 24, 2012 1:06 UTC (Mon)
by bluebugs (guest, #71022)
[Link] (5 responses)
Posted Sep 24, 2012 16:39 UTC (Mon)
by khim (subscriber, #9252)
[Link] (4 responses)
Good point. Note that recent versions of GCC and GLibC support the dispatching. Perhaps this can be made simpler at the GCC level? Some kind of attribute which you can attach to you function to say "I need versions for Atom, Bulldozer, Core2, K8, K10, and Pentium4 - with automatic switch to pick one of them at runtime". Intel's compiler does something like this but few guys use it in production...
Posted Sep 24, 2012 18:29 UTC (Mon)
by dashesy (guest, #74652)
[Link]
Posted Sep 25, 2012 11:49 UTC (Tue)
by nix (subscriber, #2304)
[Link]
Posted Oct 9, 2012 18:35 UTC (Tue)
by elanthis (guest, #6227)
[Link] (1 responses)
There are largely three classes of applications that really make use of these features:
1) HPC apps, where portability of a binary is not necessary since the whole system is generally compiled for a very specific set of hardware.
2) Media processing apps, which generally offload to the GPU these days, because even the best CPU ops for media processing are slow compared to a 300-2000 core GPU (which might even have specialized circuitry for certain media processing tasks, like video encode/decode).
3) Games (the parts that can't be offloaded to the GPU, or for games where the graphics alone are consuming the GPU's processing bandwidth), which are often mixing in bits and pieces of vector code with non-vector code and in which the overload of dispatch to another routine completely negates the advantage of using the vector instructions in the first place.
In the last case, there are games that just compile the core engine multiple times for different common end-user CPU architectures into shared libraries, and use a small loader executable to select and load the proper shared library. This allows the vector math to be completely inlined as desired while still allowing use with newer instruction sets like SSE4.1, while the game still runs on older baseline SSE3 hardware. Note that we don't generally bother supporting folks without SSE3 even on x86, since nobody who plays high-end games has a CPU old enough to lack SSE3. (Steam Hardware Survey: 99.18% of PC gaming users have SSE3, but only 57.41% have SSE4.1, so SSE3 is baseline supported but SSE4.1 must still be optional in apps.)
tl;dr: the dispatch has overhead and the folks who need vector math either don't care about CPU portability or refuse to accept that overhead.
Posted Oct 10, 2012 5:19 UTC (Wed)
by khim (subscriber, #9252)
[Link]
GCC dispatching is tied to shared libraries and have no overhead on top of that. Nothing. Exactly zero. Not one single cycle, not one single byte (except for the slow-path which is, obviously, not a big concern: it's called slow-path for a reason). Sure, shared libraries are slower for various reasons yet somehow games use them, but they don't use dispatch. Where is the logic in your statement? Why would you create fully separate engine where you can only create specialized version of some core part? IMNSHO it's not "refuse to accept", it's "refuse to consider because of ignorance". Ignorance is most definitely not bliss in this particular case.
An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)
An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)
An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)
An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)
An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)
An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)
An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)
Perhaps this can be made simpler at the GCC level? Some kind of attribute which you can attach to you function to say "I need versions for Atom, Bulldozer, Core2, K8, K10, and Pentium4 - with automatic switch to pick one of them at runtime".
You can do this with a library, a couple of #defines, and IFUNC. However, for smaller operations this is actually harmful because the hit from the indirection through the PLT exceeds the time spent in the function. (Note that since glibc itself avoids use of the PLT for internal calls, its own calls to memcpy() et al do not benefit from IFUNC at all, and always go to a maximally-generic routine. The cost of this seems to be minimal.)
An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)
An Introduction to GCC Compiler Intrinsics in Vector Processing (Linux Journal)
tl;dr: the dispatch has overhead and the folks who need vector math either don't care about CPU portability or refuse to accept that overhead.