LWN: Comments on "A guide to inline assembly code in GCC"

A guide to inline assembly code in GCC

kleptog — Thu, 05 May 2016 11:22:47 +0000

I think the reason for this is that Borland C never supported anything other than the x86 family, which has an exceptionally rich instruction set. Almost everywhere where you can put a register you can put a variable. so when you said:

asm {
   mov ax, variable
}

The compiler could substitute whatever it liked and make valid assembly. Also, it could probably even parse the assembly because the assembler was integrated. The list of opcodes was limited to the instruction the compiler knew about, otherwise the compiler would have to throw all its optimisation state out the window, all variables would have to be on the stack, nothing could be kept in registers. When Intel has only handful of registers this isn't a big deal. But saving/restoring all the 16 64-bit registers you have these days is expensive.

In GCC by specifying constraints you let the compiler keep stuff in other registers and you can tell it what you want where, so you can say things like:

__asm__ __volatile__ (                                          \
                       "cld\n\t"                                \
                       "rep\n\t"                                \
                       "movsl"                                  \
                       :                                        \
                       : "S" (src), "D" (dest), "c" (numwords)  \
                       : "%ecx", "%esi", "%edi"                 \
                       )

Whereas in Borland C you need to marshal all the registers yourself.

BTW the Microsoft compiler doesn't support inline asm for ARM and x64, probably because the simple method just doesn't work any more. GCC can support new architectures with asm support easily, but it does cost a bit of complexity.

A guide to inline assembly code in GCC

ballombe — Wed, 04 May 2016 18:55:17 +0000

A comment on gcc inline assembly: it might seems awkward to use, but it has strong backward compatibility:
inline assembly written for gcc 2.95 still work with gcc 6.

A guide to inline assembly code in GCC

anselm — Wed, 04 May 2016 09:06:28 +0000

This probably worked because Borland C++ contained its own assembler and could profit from its knowledge of the rest of the program (or compilation unit, anyway) when processing the assembly code. (I've never used Borland C++ so don't know for sure whether that was actually the case.)

GCC, on the other hand, produces explicit assembly code that is then run through a separate assembler which doesn't have access to the same kind of semantic information GCC had when compiling the C code. That tends to limit the amount of convenience available for the assembly code.

A guide to inline assembly code in GCC

khim — Wed, 04 May 2016 08:33:59 +0000

RVCT compiler does that, too. In fact most compilers which only suppiort one architecture are doing this! But for that to work your compiler must know about your CPU, it's assembler and so on intimately.

GCC's "crazy" syntax comes from the idea to make the compiler cross-platform. GCC itself does not know about your CPU, it's quite literally built around that "baroque" syntax as I wrote before. The fact that you could just add random assembler to your file is just a nice side-effect of that design decision.

IOW: noone ever tried to invent that syntax to help users of GCC. It was invented to make GCC itself possible! No wonder that you find it rather baroque and strange…

A guide to inline assembly code in GCC

dgm — Tue, 03 May 2016 21:51:34 +0000

I have always found GCC's syntax for inline assembly is rather baroque. Back in it's day, Borland C++ (and possibly Turbo C, but I'm not sure) had a much simpler and nicer syntax. You put

asm {
    ; your assembly here
    jmp loop
 }

No strings, no parameters, no modifiers, no non-sense. You could address variables and labels directly from the inline assembly. Microsoft's compiler uses a very similar syntax, in fact.

A guide to inline assembly code in GCC

HenrikH — Tue, 03 May 2016 20:29:02 +0000

doesn't the __int128_t help in that regard? Don't know enough x86/amd64 assembly in order to see if the mul/imul generated by gcc is the 64x64->128 kind or not though.

A guide to inline assembly code in GCC

jmspeex — Tue, 03 May 2016 19:53:32 +0000

The problem is that you can't just change a language to add more some operations that current CPUs support. That being said, as long as most people implement these operations in the same way -- e.g. MAX(a,b) as a>b?a:b -- then the compiler can actually recognize the pattern and use the right instruction. It's hard to do for something like finding the MSB (because there's tons of ways of doing it), but rather easy for other operations. It's not because the language doesn't express it as an operator that the compiler can't implement it with a single instruction.

A guide to inline assembly code in GCC

khim — Tue, 03 May 2016 19:12:57 +0000

Or even much simpler task: long addition. All the CPUs which I know and which are still in wide use have "add" and "adc" instructions (or similar) which make it possible to easily organize arbitrarily long additions. Yet you couldn't express that idea in C! Not even with intrinsics!

A guide to inline assembly code in GCC

khim — Tue, 03 May 2016 19:09:18 +0000

You have proposed really nice stractegy, but… I think this tale answers your question well.

A guide to inline assembly code in GCC

khim — Tue, 03 May 2016 18:48:13 +0000

Well, compiler generates correct code somehow.

Well, yeah.

This "somehow" should be enough for managing register interdependencies and stuff.

Sure. But it's not called "somehow". It's called… drumroll… constraints. All the things which you could use are quite literally described in the same language you are using to describe your inline assembler. Here is description of x86 CPU. And here is an arm.

Compiler's CPU model does not have instructions like rdtsc or cpuid thus to generate correct code compiler most obtain that information… "somehow".

If you dig out through one level of macros the "low" and "high" are "unsigned long" but they were "unsigned int" earlier and gcc generated dummy MOV instruction. Not much of a loss, of course, but knowledge about RDTSC clearing the upper half of a register will make this code efficient automatically. It would look something like:
#ifdef CONFIG_X86_64 uint64_t rdx, rax; asm ((rdx, rax) = rdtsc();) return (rdx <<32) | rax; #else uint32_t edx, eax; asm ((edx, eax) = rdtsc();) return ((uint64_t)edx <<32) | eax; #endif

Sure. But that's not called "assembler" at this point. These are intrinsic. There are many of them and when they work - you should use them. But if they don't… then they don't—compiler must be teached to use the capabilities it knows nothing about.

If the constraints which so irritate you would have been were invented for the sole purpose of writing asm—it would have been strange. But the fact that you must use the same method which is used to teach compiler about CPU in the first place does not really looks surprising to me. I mean: what else would you use?

A guide to inline assembly code in GCC

nybble41 — Tue, 03 May 2016 17:33:10 +0000

> Well, compiler generates correct code somehow.

The compiler generates correct code for the opcodes it uses. It knows nothing about any other opcodes which may appear in inline asm statements—to the compiler these are just opaque strings to be inserted directly into the listing passed to the assembler, which in turn only knows how to convert the opcodes into machine code, not what they actually do. Everything the compiler knows about how any given inline asm statement interacts with the rest of the program is based on the constraints specified by the programmer.

For RDTSC there is an inline intrinsic which is portable to at least GCC, Clang, and Visual C++: __rdtsc(). For GCC and Clang you need to #include <x86intrin.h> and for Visual C++ you need:

> #include <intrin.h>
> #pragma intrinsic(__rdtsc)

In all cases you simply write __rdtsc() and get a 64-bit integer, no inline asm required. Similar intrinsics exist for many other useful opcodes.

A guide to inline assembly code in GCC

mips — Tue, 03 May 2016 16:45:40 +0000

Surely you would test the susceptibility of your executable code to timing attacks regardless of how it was written, if that's a critical requirement, and so therefore you wouldn't particularly need to trust the compiler?

A guide to inline assembly code in GCC

ballombe — Tue, 03 May 2016 15:34:28 +0000

I can tell you.
Almost all 64bit processor (except ultrasparc) have a 64bit x 64bit to 128bit multiply instruction but there is no way to call it in C. Sure you can emulate it by 3 64x64->64bit multiply but this is 3 time slower and actually harder to do it right (long long are 64bit even on 64bit systems).

A guide to inline assembly code in GCC

adobriyan — Tue, 03 May 2016 15:16:05 +0000

> The compiler has knowledge about a subset of the hardware capabilities.

Well, compiler generates correct code somehow. This "somehow" should be enough for managing register interdependencies and stuff.

> Almost by definition, we write inline assembly to do things outside of what the compiler has knowledge about, because if the compiler knew, we could just write our code using the compiler input language.

> To do things outside of the set of things the compiler understands, we have to communicate some aspects of what we need and what we are doing back to the compiler so that it can interact with our inline assembly in a sensible manner. For this there are the constraints.

Constraints will be there and supplying additional information could be arranged. But for this? It's 2016.

static __always_inline unsigned long long rdtsc(void)
{
DECLARE_ARGS(val, low, high);

asm volatile("rdtsc" : EAX_EDX_RET(val, low, high));

return EAX_EDX_VAL(val, low, high);
}

If you dig out through one level of macros the "low" and "high" are "unsigned long" but they were "unsigned int" earlier and gcc generated dummy MOV instruction. Not much of a loss, of course, but knowledge about RDTSC clearing the upper half of a register will make this code efficient automatically. It would look something like:

#ifdef CONFIG_X86_64
uint64_t rdx, rax;
asm ((rdx, rax) = rdtsc();)
return (rdx <<32) | rax;
#else
uint32_t edx, eax;
asm ((edx, eax) = rdtsc();)
return ((uint64_t)edx <<32) | eax;
#endif

A guide to inline assembly code in GCC

excors — Tue, 03 May 2016 08:34:55 +0000

That would mean the security of your cryptographic code is contingent on the compiler correctly providing those optimisation guarantees, for every version of every compiler on every platform with every set of compiler flags that anyone might ever build your code with. Compilers are pretty complicated things, and their developers are constantly trying to improve the performance of optimised code, so it seems inevitable they will occasionally make mistakes here. And probably the only way you can verify correctness is by manually reading the assembly code generated by every version - I'm not sure how you could write an automatic regression test for the absence of timing attacks.

That sounds much too risky for any security-critical code. Writing assembly gives you the straightforward predictability you need.

Also, it can be quite easy to overlook an unintentional timing-dependent operation in C code - any if/for/while, ?:, &&, ||, memory access, integer division, floating point, etc, and any call into another function doing any of those things, hiding anywhere in a line of code. With assembly it's relatively easy to scan down the column of instruction names to spot any suspicious ones.

A guide to inline assembly code in GCC

eru — Tue, 03 May 2016 05:47:13 +0000

and some operation supported by all CPUs since the m68030 and i386 are still not directly accessible in C or any more modern language, so people has to rely on assembly

It is very common in compilers meant for embedded work to have "builtins" that basically just execute a particular instruction that is otherwise hard to trigger from the official programming language. Of course, using these is not portable between compilers.

A guide to inline assembly code in GCC

eru — Tue, 03 May 2016 05:42:24 +0000

A looong time ago, I did some ASM coding for 8088+8087 (and a bit later 286 + 287), and it was fun trying to arrange the instructions so that some integer instructions could progress in parallel with the floating-point math in the separate 8087.

A guide to inline assembly code in GCC

raof — Tue, 03 May 2016 03:59:11 +0000

This seems like the perfect thing for an annotation - __attribute__((const-time)) - that instructs the compiler to (a) not enable optimisations that make the function timing parameter-dependent, or (b) fail to compile if it can't guarantee (a). To simplify implementation you might only be able to apply it to pure functions, but IIUC that's not a huge limitation for this sort of code.

A guide to inline assembly code in GCC

gutschke — Tue, 03 May 2016 01:49:38 +0000

That's awesome news. Can't wait for this feature being universally available in the majority of compilers that people actually use.

Named parameters are going to make the code much more readable too.

A guide to inline assembly code in GCC

pbonzini — Tue, 03 May 2016 01:43:01 +0000

GCC now has named asm operands, they can be used to bypass the 10 operand limit.

A guide to inline assembly code in GCC

flussence — Tue, 03 May 2016 00:01:32 +0000

Your comment reminded me of one interesting bit of trivia: a few chips from the middle of x86's life have separate hardware for separate instruction sets; GCC actually has a "-mfpmath=sse,387" but as its manpage notes it's poorly optimized, and is obsolete nowadays because 387 is ancient history and emulated on the SSE silicon.

There was another possibility though, Athlon XP CPUs have separate SSE/3DNow hardware (and registers - a big deal on 32-bit x86), and sufficiently clever asm could keep both fed in parallel. Unfortunately nobody was ever insane enough to take advantage of it, at least that I know of.

A guide to inline assembly code in GCC

andresfreund — Mon, 02 May 2016 23:02:44 +0000

> Does it explain why this is still needed in this day and age outside of kernel-programming?

If you write cross platform concurrent code it's often easier to resort to inline assembly than rely on intrinsics. For one the assembler generated by the intrinsics isn't necessary all that efficient (which can matter a lot for performance related bits), for another the intrinsics are often poorly specified. E.g. postgres (in an unreleased version) recently hit an issue with xlc's barrier semantics of atomic intrinsics being under-specified - we (well, Noah Misch) essentially had to disassemble the generated code to figure out what barriers have to be added in addition to the intrinsic. Not particularly future safe.

The specification of various intrinsics and/or atomics APIs is (including C11's atomics) are often much harder to make sense of than a CPUs architecture manual. E.g. the non-variable specific barrier semantics in C11 atomics are barely understandable in my opinion; yet you need them if you want to incrementally move over.

A guide to inline assembly code in GCC

ballombe — Mon, 02 May 2016 22:24:21 +0000

Something I find really troubling is that we are in 2016 and some operation supported by all CPUs since the m68030 and i386 are still not directly accessible in C or any more modern language, so people has to rely on assembly (or use libraries that does it for them, as long as their job is not to write such libraries), or more often emulates it using much slower C code (at best). It is a bit like doing 3D graphic on the CPU.

However, I quite like gcc inline asm.

A guide to inline assembly code in GCC

anselm — Mon, 02 May 2016 22:17:47 +0000

One application where assembly code is apparently still a thing is cryptography. A colleague of mine is writing cryptographic code and he swears on doing things in assembly language because that lets him ensure more easily that stuff takes the same time no matter what is going on, in order to harden the code against timing attacks.

A guide to inline assembly code in GCC

ledow — Mon, 02 May 2016 21:45:17 +0000

Does it explain why this is still needed in this day and age outside of kernel-programming?

At some point in my life, I wrote program inside DOS debug, and have done inline-assembly. But that's something I got out of many years ago because, pretty much, I was always beaten by the compiler.

Especially nowadays with multimedia and parallel instructions being used to calculate across whole arrays or matrices simultaneously, the compiler will almost always do a better job and knowledge of assembler is really for when you're debugging or writing such compilers in the first place, or hitting corner cases at a deep, deep hardware level.

Honestly, how much inline assembler is actually out there, and how far would it differ if you just put equivalent C in the same places.

I remember, I think it was in the emulator fields, when hand-crafted MMX could win-out in certain highly specialised operations. Almost inevitably, though, the code gave way to slightly-slower (or even same-speed) and yet much more readable C equivalents.

A guide to inline assembly code in GCC

andresfreund — Mon, 02 May 2016 20:53:48 +0000

You very well might want the compiler to use *looser* semantics than strictly required based on all possible side effects. There's e.g. plenty cases where you might want to atomically increment something, but you don't need the compiler to write back all "dirty registers". But there's also a lot of cases (most prominently locking related stuff) where you really, really can't have that. If you went by the strict definition any instruction involving writing to memory would have to act as a "m" constraint, based on your definition - it could be releasing a lock.

A guide to inline assembly code in GCC

daney — Mon, 02 May 2016 20:33:26 +0000

> What always bugged me is the fact that developer has to explicitly
> write down the constraints at all. Compiler has all the knowledge
> of the hardware platform already, ...

This is not really true. The compiler has knowledge about a subset of the hardware capabilities.

Almost by definition, we write inline assembly to do things outside of what the compiler has knowledge about, because if the compiler knew, we could just write our code using the compiler input language.

To do things outside of the set of things the compiler understands, we have to communicate some aspects of what we need and what we are doing back to the compiler so that it can interact with our inline assembly in a sensible manner. For this there are the constraints.

A guide to inline assembly code in GCC

adobriyan — Mon, 02 May 2016 19:07:51 +0000

What always bugged me is the fact that developer has to explicitly write down the constraints at all. Compiler has all the knowledge of the hardware platform already, so deducing that, say, STOSB clobbers memory and RDI and takes AL as input should not be a rocket science. So, no, please don't standardize GCC's asm, nobody understands it fully anyway -- just read all the implementations of memset/memcpy/strchr in libcs, kernels, random code in the web and watch how developers use =&S/=&D/&=c constraints.

A guide to inline assembly code in GCC

gutschke — Mon, 02 May 2016 18:03:42 +0000

While the article provides a good high-level overview, it skips a couple of the details that are needed in real-life programming. It's been a while since I had to write assembly, so I have probably forgotten a lot of the details that caused me confusion.

The two issues the I do recall are 1) how to deal with register constraints that don't have an architecture-specific register name, and 2) how to deal with more than 10 input/output constraints.

The former can be solved by assigning variables to named registers; here is an example of what you would do, if you needed an input to be in %r10:

int main() {
  register int a __asm__("r10") = 5;
  int b;
  __asm__("movl %1, %0" : "=r"(b) : "r"(a));
  return b;
}

And the best solution for providing more than 10 parameters is to place them into a temporarily allocated struct on the stack. This can be a little ugly, as you often have to hard-code offsets into the structure, and it can also result in slightly less efficient code. But in practice that often doesn't matter too horribly.

Finally, specifying correct clobber constraints can be surprisingly counter-intuitive unless you have a really firm grasp of the C language standard. If in doubt, it is always safe to mark the __asm__ statement as volatile and to state that is clobbers memory. This results in slightly less efficient code, but it is a lot more likely that the optimizer doesn't interfere in ways that the programmer didn't anticipate.

A guide to inline assembly code in GCC

bjacob — Mon, 02 May 2016 13:33:43 +0000

Inline assembly in either GCC or Clang is very under-documented, so the best documentation available is "random" pieces like this scattered across the Web. Thanks for this one!

Compiler maintainers, please make more inline asm documentation!
Language standardizers, please consider standardizing whichever aspects of inline asm can be standardized! I know it's a tall order, but it's a very important one. Software authors pay a high cost for the current situation, where so-called "GCC-compatible" compilers don't even agree on the way to define a function-local asm label (for example, %= doesn't work in iOS toolchains), etc.

A guide to inline assembly code in GCC

darwish — Mon, 02 May 2016 09:18:49 +0000

Excellent.. may I also add that when writing inline asm, it's always good to put as much logic as possibe out of the inlined assembly code to the asm-constraints C expressions. This gives the GCC optimizer way more freedom, especially regarding constant propagation ..

Beside the obvious linux kernel examples, I've once had a hobby kernel with the string methods optimized using gcc inline x86-64 asm .. it might be useful as a quick & easy consolidated recap after reading the guide above:

https://github.com/a-darwish/cuteOS/blob/master/lib/string.c

thanks,