A guide to inline assembly code in GCC
I've decided to write this to consolidate my knowledge related to inline assembly here. As inline assembly statements are quite common in the Linux kernel and we may see them in linux-insides parts sometimes, I thought that it would be useful if we would have a special part which contains descriptions of the more important aspects of inline assembly. Of course you may find comprehensive information about inline assembly in the official documentation, but I like the rules all in one place."
Posted May 2, 2016 9:18 UTC (Mon)
by darwish (guest, #102479)
[Link] (5 responses)
Beside the obvious linux kernel examples, I've once had a hobby kernel with the string methods optimized using gcc inline x86-64 asm .. it might be useful as a quick & easy consolidated recap after reading the guide above:
https://github.com/a-darwish/cuteOS/blob/master/lib/string.c
thanks,
Posted May 2, 2016 22:24 UTC (Mon)
by ballombe (subscriber, #9523)
[Link] (4 responses)
However, I quite like gcc inline asm.
Posted May 3, 2016 0:01 UTC (Tue)
by flussence (guest, #85566)
[Link] (1 responses)
There was another possibility though, Athlon XP CPUs have separate SSE/3DNow hardware (and registers - a big deal on 32-bit x86), and sufficiently clever asm could keep both fed in parallel. Unfortunately nobody was ever insane enough to take advantage of it, at least that I know of.
Posted May 3, 2016 5:42 UTC (Tue)
by eru (subscriber, #2753)
[Link]
A looong time ago, I did some ASM coding for 8088+8087 (and a bit later 286 + 287), and it was fun trying to arrange the instructions so that some integer instructions could progress in parallel with the floating-point math in the separate 8087.
Posted May 3, 2016 5:47 UTC (Tue)
by eru (subscriber, #2753)
[Link]
It is very common in compilers meant for embedded work to have "builtins" that basically just execute a particular instruction that is otherwise hard to trigger from the official programming language. Of course, using these is not portable between compilers.
Posted May 3, 2016 19:53 UTC (Tue)
by jmspeex (subscriber, #51639)
[Link]
Posted May 2, 2016 13:33 UTC (Mon)
by bjacob (guest, #58566)
[Link] (6 responses)
Compiler maintainers, please make more inline asm documentation!
Posted May 2, 2016 19:07 UTC (Mon)
by adobriyan (subscriber, #30858)
[Link] (5 responses)
Posted May 2, 2016 20:33 UTC (Mon)
by daney (guest, #24551)
[Link] (3 responses)
This is not really true. The compiler has knowledge about a subset of the hardware capabilities.
Almost by definition, we write inline assembly to do things outside of what the compiler has knowledge about, because if the compiler knew, we could just write our code using the compiler input language.
To do things outside of the set of things the compiler understands, we have to communicate some aspects of what we need and what we are doing back to the compiler so that it can interact with our inline assembly in a sensible manner. For this there are the constraints.
Posted May 3, 2016 15:16 UTC (Tue)
by adobriyan (subscriber, #30858)
[Link] (2 responses)
Well, compiler generates correct code somehow. This "somehow" should be enough for managing register interdependencies and stuff.
> Almost by definition, we write inline assembly to do things outside of what the compiler has knowledge about, because if the compiler knew, we could just write our code using the compiler input language.
> To do things outside of the set of things the compiler understands, we have to communicate some aspects of what we need and what we are doing back to the compiler so that it can interact with our inline assembly in a sensible manner. For this there are the constraints.
Constraints will be there and supplying additional information could be arranged. But for this? It's 2016.
static __always_inline unsigned long long rdtsc(void)
asm volatile("rdtsc" : EAX_EDX_RET(val, low, high));
return EAX_EDX_VAL(val, low, high);
If you dig out through one level of macros the "low" and "high" are "unsigned long" but they were "unsigned int" earlier and gcc generated dummy MOV instruction. Not much of a loss, of course, but knowledge about RDTSC clearing the upper half of a register will make this code efficient automatically. It would look something like:
#ifdef CONFIG_X86_64
Posted May 3, 2016 17:33 UTC (Tue)
by nybble41 (subscriber, #55106)
[Link]
The compiler generates correct code for the opcodes it uses. It knows nothing about any other opcodes which may appear in inline asm statements—to the compiler these are just opaque strings to be inserted directly into the listing passed to the assembler, which in turn only knows how to convert the opcodes into machine code, not what they actually do. Everything the compiler knows about how any given inline asm statement interacts with the rest of the program is based on the constraints specified by the programmer.
For RDTSC there is an inline intrinsic which is portable to at least GCC, Clang, and Visual C++: __rdtsc(). For GCC and Clang you need to #include <x86intrin.h> and for Visual C++ you need:
> #include <intrin.h>
In all cases you simply write __rdtsc() and get a 64-bit integer, no inline asm required. Similar intrinsics exist for many other useful opcodes.
Posted May 3, 2016 18:48 UTC (Tue)
by khim (subscriber, #9252)
[Link]
Compiler's CPU model does not have instructions like rdtsc or cpuid thus to generate correct code compiler most obtain that information… "somehow". If the constraints which so irritate you would have been were invented for the sole purpose of writing asm—it would have been strange. But the fact that you must use the same method which is used to teach compiler about CPU in the first place does not really looks surprising to me. I mean: what else would you use?
Posted May 2, 2016 20:53 UTC (Mon)
by andresfreund (subscriber, #69562)
[Link]
Posted May 2, 2016 18:03 UTC (Mon)
by gutschke (subscriber, #27910)
[Link] (2 responses)
While the article provides a good high-level overview, it skips a couple of the details that are needed in real-life programming. It's been a while since I had to write assembly, so I have probably forgotten a lot of the details that caused me confusion. The two issues the I do recall are 1) how to deal with register constraints that don't have an architecture-specific register name, and 2) how to deal with more than 10 input/output constraints. The former can be solved by assigning variables to named registers; here is an example of what you would do, if you needed an input to be in And the best solution for providing more than 10 parameters is to place them into a temporarily allocated Finally, specifying correct clobber constraints can be surprisingly counter-intuitive unless you have a really firm grasp of the C language standard. If in doubt, it is always safe to mark the
Posted May 3, 2016 1:43 UTC (Tue)
by pbonzini (subscriber, #60935)
[Link] (1 responses)
Posted May 3, 2016 1:49 UTC (Tue)
by gutschke (subscriber, #27910)
[Link]
Named parameters are going to make the code much more readable too.
Posted May 2, 2016 21:45 UTC (Mon)
by ledow (guest, #11753)
[Link] (9 responses)
At some point in my life, I wrote program inside DOS debug, and have done inline-assembly. But that's something I got out of many years ago because, pretty much, I was always beaten by the compiler.
Especially nowadays with multimedia and parallel instructions being used to calculate across whole arrays or matrices simultaneously, the compiler will almost always do a better job and knowledge of assembler is really for when you're debugging or writing such compilers in the first place, or hitting corner cases at a deep, deep hardware level.
Honestly, how much inline assembler is actually out there, and how far would it differ if you just put equivalent C in the same places.
I remember, I think it was in the emulator fields, when hand-crafted MMX could win-out in certain highly specialised operations. Almost inevitably, though, the code gave way to slightly-slower (or even same-speed) and yet much more readable C equivalents.
Posted May 2, 2016 22:17 UTC (Mon)
by anselm (subscriber, #2796)
[Link] (4 responses)
One application where assembly code is apparently still a thing is cryptography. A colleague of mine is writing cryptographic code and he swears on doing things in assembly language because that lets him ensure more easily that stuff takes the same time no matter what is going on, in order to harden the code against timing attacks.
Posted May 3, 2016 3:59 UTC (Tue)
by raof (subscriber, #57409)
[Link] (3 responses)
Posted May 3, 2016 8:34 UTC (Tue)
by excors (subscriber, #95769)
[Link] (2 responses)
That sounds much too risky for any security-critical code. Writing assembly gives you the straightforward predictability you need.
Also, it can be quite easy to overlook an unintentional timing-dependent operation in C code - any if/for/while, ?:, &&, ||, memory access, integer division, floating point, etc, and any call into another function doing any of those things, hiding anywhere in a line of code. With assembly it's relatively easy to scan down the column of instruction names to spot any suspicious ones.
Posted May 3, 2016 16:45 UTC (Tue)
by mips (guest, #105013)
[Link] (1 responses)
Posted May 2, 2016 23:02 UTC (Mon)
by andresfreund (subscriber, #69562)
[Link]
If you write cross platform concurrent code it's often easier to resort to inline assembly than rely on intrinsics. For one the assembler generated by the intrinsics isn't necessary all that efficient (which can matter a lot for performance related bits), for another the intrinsics are often poorly specified. E.g. postgres (in an unreleased version) recently hit an issue with xlc's barrier semantics of atomic intrinsics being under-specified - we (well, Noah Misch) essentially had to disassemble the generated code to figure out what barriers have to be added in addition to the intrinsic. Not particularly future safe.
The specification of various intrinsics and/or atomics APIs is (including C11's atomics) are often much harder to make sense of than a CPUs architecture manual. E.g. the non-variable specific barrier semantics in C11 atomics are barely understandable in my opinion; yet you need them if you want to incrementally move over.
Posted May 3, 2016 15:34 UTC (Tue)
by ballombe (subscriber, #9523)
[Link] (2 responses)
Posted May 3, 2016 19:12 UTC (Tue)
by khim (subscriber, #9252)
[Link]
Posted May 3, 2016 20:29 UTC (Tue)
by HenrikH (subscriber, #31152)
[Link]
Posted May 3, 2016 21:51 UTC (Tue)
by dgm (subscriber, #49227)
[Link] (4 responses)
Posted May 4, 2016 8:33 UTC (Wed)
by khim (subscriber, #9252)
[Link]
RVCT compiler does that, too. In fact most compilers which only suppiort one architecture are doing this! But for that to work your compiler must know about your CPU, it's assembler and so on intimately. GCC's "crazy" syntax comes from the idea to make the compiler cross-platform. GCC itself does not know about your CPU, it's quite literally built around that "baroque" syntax as I wrote before. The fact that you could just add random assembler to your file is just a nice side-effect of that design decision. IOW: noone ever tried to invent that syntax to help users of GCC. It was invented to make GCC itself possible! No wonder that you find it rather baroque and strange…
Posted May 4, 2016 9:06 UTC (Wed)
by anselm (subscriber, #2796)
[Link] (1 responses)
This probably worked because Borland C++ contained its own assembler and could profit from its knowledge of the rest of the program (or compilation unit, anyway) when processing the assembly code. (I've never used Borland C++ so don't know for sure whether that was actually the case.)
GCC, on the other hand, produces explicit assembly code that is then run through a separate assembler which doesn't have access to the same kind of semantic information GCC had when compiling the C code. That tends to limit the amount of convenience available for the assembly code.
Posted May 4, 2016 18:55 UTC (Wed)
by ballombe (subscriber, #9523)
[Link]
Posted May 5, 2016 11:22 UTC (Thu)
by kleptog (subscriber, #1183)
[Link]
In GCC by specifying constraints you let the compiler keep stuff in other registers and you can tell it what you want where, so you can say things like:
BTW the Microsoft compiler doesn't support inline asm for ARM and x64, probably because the simple method just doesn't work any more. GCC can support new architectures with asm support easily, but it does cost a bit of complexity.
A guide to inline assembly code in GCC
A guide to inline assembly code in GCC
A guide to inline assembly code in GCC
A guide to inline assembly code in GCC
and some operation supported by all CPUs since the m68030 and i386 are still not directly accessible in C or any more modern language, so people has to rely on assembly
A guide to inline assembly code in GCC
A guide to inline assembly code in GCC
A guide to inline assembly code in GCC
Language standardizers, please consider standardizing whichever aspects of inline asm can be standardized! I know it's a tall order, but it's a very important one. Software authors pay a high cost for the current situation, where so-called "GCC-compatible" compilers don't even agree on the way to define a function-local asm label (for example, %= doesn't work in iOS toolchains), etc.
A guide to inline assembly code in GCC
A guide to inline assembly code in GCC
> write down the constraints at all. Compiler has all the knowledge
> of the hardware platform already, ...
A guide to inline assembly code in GCC
{
DECLARE_ARGS(val, low, high);
}
uint64_t rdx, rax;
asm ((rdx, rax) = rdtsc();)
return (rdx <<32) | rax;
#else
uint32_t edx, eax;
asm ((edx, eax) = rdtsc();)
return ((uint64_t)edx <<32) | eax;
#endif
A guide to inline assembly code in GCC
> #pragma intrinsic(__rdtsc)
A guide to inline assembly code in GCC
Well, compiler generates correct code somehow.
Well, yeah.
This "somehow" should be enough for managing register interdependencies and stuff.
Sure. But it's not called "somehow". It's called… drumroll… constraints. All the things which you could use are quite literally described in the same language you are using to describe your inline assembler. Here is description of x86 CPU. And here is an arm.If you dig out through one level of macros the "low" and "high" are "unsigned long" but they were "unsigned int" earlier and gcc generated dummy MOV instruction. Not much of a loss, of course, but knowledge about RDTSC clearing the upper half of a register will make this code efficient automatically. It would look something like:
Sure. But that's not called "assembler" at this point. These are intrinsic. There are many of them and when they work - you should use them. But if they don't… then they don't—compiler must be teached to use the capabilities it knows nothing about.
#ifdef CONFIG_X86_64
uint64_t rdx, rax;
asm ((rdx, rax) = rdtsc();)
return (rdx <<32) | rax;
#else
uint32_t edx, eax;
asm ((edx, eax) = rdtsc();)
return ((uint64_t)edx <<32) | eax;
#endif
A guide to inline assembly code in GCC
A guide to inline assembly code in GCC
%r10
:
int main() {
register int a __asm__("r10") = 5;
int b;
__asm__("movl %1, %0" : "=r"(b) : "r"(a));
return b;
}
struct
on the stack. This can be a little ugly, as you often have to hard-code offsets into the structure, and it can also result in slightly less efficient code. But in practice that often doesn't matter too horribly.__asm__
statement as volatile
and to state that is clobbers memory
. This results in slightly less efficient code, but it is a lot more likely that the optimizer doesn't interfere in ways that the programmer didn't anticipate.A guide to inline assembly code in GCC
A guide to inline assembly code in GCC
A guide to inline assembly code in GCC
A guide to inline assembly code in GCC
A guide to inline assembly code in GCC
A guide to inline assembly code in GCC
A guide to inline assembly code in GCC
A guide to inline assembly code in GCC
A guide to inline assembly code in GCC
Almost all 64bit processor (except ultrasparc) have a 64bit x 64bit to 128bit multiply instruction but there is no way to call it in C. Sure you can emulate it by 3 64x64->64bit multiply but this is 3 time slower and actually harder to do it right (long long are 64bit even on 64bit systems).
A guide to inline assembly code in GCC
A guide to inline assembly code in GCC
I have always found GCC's syntax for inline assembly is rather baroque. Back in it's day, Borland C++ (and possibly Turbo C, but I'm not sure) had a much simpler and nicer syntax. You put
A guide to inline assembly code in GCC
asm {
; your assembly here
jmp loop
}
No strings, no parameters, no modifiers, no non-sense. You could address variables and labels directly from the inline assembly. Microsoft's compiler uses a very similar syntax, in fact.
A guide to inline assembly code in GCC
A guide to inline assembly code in GCC
A guide to inline assembly code in GCC
inline assembly written for gcc 2.95 still work with gcc 6.
I think the reason for this is that Borland C never supported anything other than the x86 family, which has an exceptionally rich instruction set. Almost everywhere where you can put a register you can put a variable. so when you said:
A guide to inline assembly code in GCC
asm {
mov ax, variable
}
The compiler could substitute whatever it liked and make valid assembly. Also, it could probably even parse the assembly because the assembler was integrated. The list of opcodes was limited to the instruction the compiler knew about, otherwise the compiler would have to throw all its optimisation state out the window, all variables would have to be on the stack, nothing could be kept in registers. When Intel has only handful of registers this isn't a big deal. But saving/restoring all the 16 64-bit registers you have these days is expensive.
__asm__ __volatile__ ( \
"cld\n\t" \
"rep\n\t" \
"movsl" \
: \
: "S" (src), "D" (dest), "c" (numwords) \
: "%ecx", "%esi", "%edi" \
)
Whereas in Borland C you need to marshal all the registers yourself.