LWN: Comments on "A guide to inline assembly code in GCC" https://lwn.net/Articles/685739/ This is a special feed containing comments posted to the individual LWN article titled "A guide to inline assembly code in GCC". en-us Thu, 30 Oct 2025 04:32:16 +0000 Thu, 30 Oct 2025 04:32:16 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net A guide to inline assembly code in GCC https://lwn.net/Articles/686412/ https://lwn.net/Articles/686412/ kleptog I think the reason for this is that Borland C never supported anything other than the x86 family, which has an exceptionally rich instruction set. Almost everywhere where you can put a register you can put a variable. so when you said: <pre> asm { mov ax, variable } </pre> The compiler could substitute whatever it liked and make valid assembly. Also, it could probably even parse the assembly because the assembler was integrated. The list of opcodes was limited to the instruction the compiler knew about, otherwise the compiler would have to throw all its optimisation state out the window, all variables would have to be on the stack, nothing could be kept in registers. When Intel has only handful of registers this isn't a big deal. But saving/restoring all the 16 64-bit registers you have these days is expensive. <p> In GCC by specifying constraints you let the compiler keep stuff in other registers and you can tell it what you want where, so you can say things like: <pre> __asm__ __volatile__ ( \ "cld\n\t" \ "rep\n\t" \ "movsl" \ : \ : "S" (src), "D" (dest), "c" (numwords) \ : "%ecx", "%esi", "%edi" \ ) </pre> Whereas in Borland C you need to marshal all the registers yourself. <p> BTW the Microsoft compiler doesn't support inline asm for ARM and x64, probably because the simple method just doesn't work any more. GCC can support new architectures with asm support easily, but it does cost a bit of complexity. Thu, 05 May 2016 11:22:47 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/686189/ https://lwn.net/Articles/686189/ ballombe <div class="FormattedComment"> A comment on gcc inline assembly: it might seems awkward to use, but it has strong backward compatibility:<br> inline assembly written for gcc 2.95 still work with gcc 6.<br> <p> </div> Wed, 04 May 2016 18:55:17 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/686185/ https://lwn.net/Articles/686185/ anselm <p> This probably worked because Borland C++ contained its own assembler and could profit from its knowledge of the rest of the program (or compilation unit, anyway) when processing the assembly code. (I've never used Borland C++ so don't know for sure whether that was actually the case.) </p> <p> GCC, on the other hand, produces explicit assembly code that is then run through a separate assembler which doesn't have access to the same kind of semantic information GCC had when compiling the C code. That tends to limit the amount of convenience available for the assembly code. </p> Wed, 04 May 2016 09:06:28 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/686181/ https://lwn.net/Articles/686181/ khim <p>RVCT compiler does that, too. In fact most compilers which only suppiort one architecture are doing this! But for that to work your compiler must know about your CPU, it's assembler and so on intimately.</p> <p>GCC's "crazy" syntax comes from the idea to make the compiler cross-platform. GCC itself does not know about your CPU, it's quite literally built around that "baroque" syntax as I <a href="http://lwn.net/Articles/686105/">wrote before</a>. The fact that you could just add random assembler to your file is just a nice side-effect of that design decision.</p> <p>IOW: noone ever tried to invent that syntax to help <b>users</b> of GCC. It was invented to make GCC <b>itself</b> possible! No wonder that you find it rather baroque and strange…</p> Wed, 04 May 2016 08:33:59 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/686134/ https://lwn.net/Articles/686134/ dgm I have always found GCC's syntax for inline assembly is rather baroque. Back in it's day, Borland C++ (and possibly Turbo C, but I'm not sure) had a much simpler and nicer syntax. You put <pre> asm { ; your assembly here jmp loop } </pre> No strings, no parameters, no modifiers, no non-sense. You could address variables and labels directly from the inline assembly. Microsoft's compiler uses a very similar syntax, in fact. Tue, 03 May 2016 21:51:34 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/686122/ https://lwn.net/Articles/686122/ HenrikH <div class="FormattedComment"> doesn't the __int128_t help in that regard? Don't know enough x86/amd64 assembly in order to see if the mul/imul generated by gcc is the 64x64-&gt;128 kind or not though.<br> </div> Tue, 03 May 2016 20:29:02 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/686114/ https://lwn.net/Articles/686114/ jmspeex <div class="FormattedComment"> The problem is that you can't just change a language to add more some operations that current CPUs support. That being said, as long as most people implement these operations in the same way -- e.g. MAX(a,b) as a&gt;b?a:b -- then the compiler can actually recognize the pattern and use the right instruction. It's hard to do for something like finding the MSB (because there's tons of ways of doing it), but rather easy for other operations. It's not because the language doesn't express it as an operator that the compiler can't implement it with a single instruction. <br> </div> Tue, 03 May 2016 19:53:32 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/686109/ https://lwn.net/Articles/686109/ khim <div class="FormattedComment"> Or even much simpler task: long addition. All the CPUs which I know and which are still in wide use have "add" and "adc" instructions (or similar) which make it possible to easily organize arbitrarily long additions. Yet you couldn't express that idea in C! Not even with intrinsics!<br> </div> Tue, 03 May 2016 19:12:57 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/686108/ https://lwn.net/Articles/686108/ khim You have proposed really nice stractegy, but… I think <a href="http://hrboutique.blogspot.de/2012/01/on-strategy-and-mice.html">this tale</a> answers your question well. Tue, 03 May 2016 19:09:18 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/686105/ https://lwn.net/Articles/686105/ khim <blockquote><font class="QuotedText">Well, compiler generates correct code somehow.</font></blockquote>Well, yeah. <blockquote><font class="QuotedText">This "somehow" should be enough for managing register interdependencies and stuff.</font></blockquote>Sure. But it's not called "somehow". It's called… drumroll… <b>constraints</b>. All the things which you could use are <b>quite literally</b> described in the same language you are using to describe your inline assembler. Here is <a href="https://raw.githubusercontent.com/gcc-mirror/gcc/master/gcc/config/i386/i386.md">description of x86 CPU</a>. And here is <a href="https://raw.githubusercontent.com/gcc-mirror/gcc/master/gcc/config/arm/arm.md">an arm</a>.</p> <p>Compiler's CPU model does not have instructions like rdtsc or cpuid thus to generate correct code compiler most obtain that information… "somehow".</p> <blockquote><font class="QuotedText">If you dig out through one level of macros the "low" and "high" are "unsigned long" but they were "unsigned int" earlier and gcc generated dummy MOV instruction. Not much of a loss, of course, but knowledge about RDTSC clearing the upper half of a register will make this code efficient automatically. It would look something like: <pre> #ifdef CONFIG_X86_64 uint64_t rdx, rax; asm ((rdx, rax) = rdtsc();) return (rdx &lt;&lt;32) | rax; #else uint32_t edx, eax; asm ((edx, eax) = rdtsc();) return ((uint64_t)edx &lt;&lt;32) | eax; #endif </pre> </font></blockquote>Sure. But that's not called "assembler" at this point. These are intrinsic. There are <a href="https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html">many</a> <a href="https://gcc.gnu.org/onlinedocs/gcc-4.2.4/gcc/X86-Built_002din-Functions.html">of</a> <a href="https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html">them</a> and when they work - you should use them. But if they don't… then they don't—compiler must be teached to use the capabilities it knows nothing about.</p> <p>If the constraints which so irritate you would have been were invented for the sole purpose of writing asm—it would have been strange. But the fact that you must use the same method which is used to teach compiler about CPU in the first place does not really looks surprising to me. I mean: what <b>else</b> would you use?</p> Tue, 03 May 2016 18:48:13 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/686097/ https://lwn.net/Articles/686097/ nybble41 <div class="FormattedComment"> <font class="QuotedText">&gt; Well, compiler generates correct code somehow.</font><br> <p> The compiler generates correct code for the opcodes it uses. It knows nothing about any other opcodes which may appear in inline asm statements—to the compiler these are just opaque strings to be inserted directly into the listing passed to the assembler, which in turn only knows how to convert the opcodes into machine code, not what they actually do. Everything the compiler knows about how any given inline asm statement interacts with the rest of the program is based on the constraints specified by the programmer.<br> <p> For RDTSC there is an inline intrinsic which is portable to at least GCC, Clang, and Visual C++: __rdtsc(). For GCC and Clang you need to #include &lt;x86intrin.h&gt; and for Visual C++ you need:<br> <p> <font class="QuotedText">&gt; #include &lt;intrin.h&gt;</font><br> <font class="QuotedText">&gt; #pragma intrinsic(__rdtsc)</font><br> <p> In all cases you simply write __rdtsc() and get a 64-bit integer, no inline asm required. Similar intrinsics exist for many other useful opcodes.<br> </div> Tue, 03 May 2016 17:33:10 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/686094/ https://lwn.net/Articles/686094/ mips <div class="FormattedComment"> Surely you would test the susceptibility of your executable code to timing attacks regardless of how it was written, if that's a critical requirement, and so therefore you wouldn't particularly need to trust the compiler?<br> </div> Tue, 03 May 2016 16:45:40 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/686066/ https://lwn.net/Articles/686066/ ballombe <div class="FormattedComment"> I can tell you.<br> Almost all 64bit processor (except ultrasparc) have a 64bit x 64bit to 128bit multiply instruction but there is no way to call it in C. Sure you can emulate it by 3 64x64-&gt;64bit multiply but this is 3 time slower and actually harder to do it right (long long are 64bit even on 64bit systems).<br> </div> Tue, 03 May 2016 15:34:28 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/686055/ https://lwn.net/Articles/686055/ adobriyan <div class="FormattedComment"> <font class="QuotedText">&gt; The compiler has knowledge about a subset of the hardware capabilities.</font><br> <p> Well, compiler generates correct code somehow. This "somehow" should be enough for managing register interdependencies and stuff.<br> <p> <font class="QuotedText">&gt; Almost by definition, we write inline assembly to do things outside of what the compiler has knowledge about, because if the compiler knew, we could just write our code using the compiler input language.</font><br> <p> <font class="QuotedText">&gt; To do things outside of the set of things the compiler understands, we have to communicate some aspects of what we need and what we are doing back to the compiler so that it can interact with our inline assembly in a sensible manner. For this there are the constraints.</font><br> <p> Constraints will be there and supplying additional information could be arranged. But for this? It's 2016.<br> <p> static __always_inline unsigned long long rdtsc(void)<br> {<br> DECLARE_ARGS(val, low, high);<br> <p> asm volatile("rdtsc" : EAX_EDX_RET(val, low, high));<br> <p> return EAX_EDX_VAL(val, low, high);<br> }<br> <p> If you dig out through one level of macros the "low" and "high" are "unsigned long" but they were "unsigned int" earlier and gcc generated dummy MOV instruction. Not much of a loss, of course, but knowledge about RDTSC clearing the upper half of a register will make this code efficient automatically. It would look something like:<br> <p> #ifdef CONFIG_X86_64<br> uint64_t rdx, rax;<br> asm ((rdx, rax) = rdtsc();)<br> return (rdx &lt;&lt;32) | rax;<br> #else<br> uint32_t edx, eax;<br> asm ((edx, eax) = rdtsc();)<br> return ((uint64_t)edx &lt;&lt;32) | eax;<br> #endif<br> </div> Tue, 03 May 2016 15:16:05 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/686017/ https://lwn.net/Articles/686017/ excors <div class="FormattedComment"> That would mean the security of your cryptographic code is contingent on the compiler correctly providing those optimisation guarantees, for every version of every compiler on every platform with every set of compiler flags that anyone might ever build your code with. Compilers are pretty complicated things, and their developers are constantly trying to improve the performance of optimised code, so it seems inevitable they will occasionally make mistakes here. And probably the only way you can verify correctness is by manually reading the assembly code generated by every version - I'm not sure how you could write an automatic regression test for the absence of timing attacks.<br> <p> That sounds much too risky for any security-critical code. Writing assembly gives you the straightforward predictability you need.<br> <p> Also, it can be quite easy to overlook an unintentional timing-dependent operation in C code - any if/for/while, ?:, &amp;&amp;, ||, memory access, integer division, floating point, etc, and any call into another function doing any of those things, hiding anywhere in a line of code. With assembly it's relatively easy to scan down the column of instruction names to spot any suspicious ones.<br> </div> Tue, 03 May 2016 08:34:55 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/686003/ https://lwn.net/Articles/686003/ eru <i> and some operation supported by all CPUs since the m68030 and i386 are still not directly accessible in C or any more modern language, so people has to rely on assembly </i> <p> It is very common in compilers meant for embedded work to have "builtins" that basically just execute a particular instruction that is otherwise hard to trigger from the official programming language. Of course, using these is not portable between compilers. Tue, 03 May 2016 05:47:13 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/686002/ https://lwn.net/Articles/686002/ eru <p>A looong time ago, I did some ASM coding for 8088+8087 (and a bit later 286 + 287), and it was fun trying to arrange the instructions so that some integer instructions could progress in parallel with the floating-point math in the separate 8087. Tue, 03 May 2016 05:42:24 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/685993/ https://lwn.net/Articles/685993/ raof <div class="FormattedComment"> This seems like the perfect thing for an annotation - __attribute__((const-time)) - that instructs the compiler to (a) not enable optimisations that make the function timing parameter-dependent, or (b) fail to compile if it can't guarantee (a). To simplify implementation you might only be able to apply it to pure functions, but IIUC that's not a huge limitation for this sort of code.<br> <p> </div> Tue, 03 May 2016 03:59:11 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/685979/ https://lwn.net/Articles/685979/ gutschke <div class="FormattedComment"> That's awesome news. Can't wait for this feature being universally available in the majority of compilers that people actually use.<br> <p> Named parameters are going to make the code much more readable too.<br> </div> Tue, 03 May 2016 01:49:38 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/685977/ https://lwn.net/Articles/685977/ pbonzini <div class="FormattedComment"> GCC now has named asm operands, they can be used to bypass the 10 operand limit.<br> </div> Tue, 03 May 2016 01:43:01 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/685960/ https://lwn.net/Articles/685960/ flussence <div class="FormattedComment"> Your comment reminded me of one interesting bit of trivia: a few chips from the middle of x86's life have separate hardware for separate instruction sets; GCC actually has a "-mfpmath=sse,387" but as its manpage notes it's poorly optimized, and is obsolete nowadays because 387 is ancient history and emulated on the SSE silicon.<br> <p> There was another possibility though, Athlon XP CPUs have separate SSE/3DNow hardware (and registers - a big deal on 32-bit x86), and sufficiently clever asm could keep both fed in parallel. Unfortunately nobody was ever insane enough to take advantage of it, at least that I know of.<br> </div> Tue, 03 May 2016 00:01:32 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/685945/ https://lwn.net/Articles/685945/ andresfreund <div class="FormattedComment"> <font class="QuotedText">&gt; Does it explain why this is still needed in this day and age outside of kernel-programming?</font><br> <p> If you write cross platform concurrent code it's often easier to resort to inline assembly than rely on intrinsics. For one the assembler generated by the intrinsics isn't necessary all that efficient (which can matter a lot for performance related bits), for another the intrinsics are often poorly specified. E.g. postgres (in an unreleased version) recently hit an issue with xlc's barrier semantics of atomic intrinsics being under-specified - we (well, Noah Misch) essentially had to disassemble the generated code to figure out what barriers have to be added in addition to the intrinsic. Not particularly future safe.<br> <p> The specification of various intrinsics and/or atomics APIs is (including C11's atomics) are often much harder to make sense of than a CPUs architecture manual. E.g. the non-variable specific barrier semantics in C11 atomics are barely understandable in my opinion; yet you need them if you want to incrementally move over.<br> </div> Mon, 02 May 2016 23:02:44 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/685944/ https://lwn.net/Articles/685944/ ballombe <div class="FormattedComment"> Something I find really troubling is that we are in 2016 and some operation supported by all CPUs since the m68030 and i386 are still not directly accessible in C or any more modern language, so people has to rely on assembly (or use libraries that does it for them, as long as their job is not to write such libraries), or more often emulates it using much slower C code (at best). It is a bit like doing 3D graphic on the CPU.<br> <p> However, I quite like gcc inline asm.<br> <p> <p> <p> </div> Mon, 02 May 2016 22:24:21 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/685941/ https://lwn.net/Articles/685941/ anselm <p> One application where assembly code is apparently still a thing is cryptography. A colleague of mine is writing cryptographic code and he swears on doing things in assembly language because that lets him ensure more easily that stuff takes the same time no matter what is going on, in order to harden the code against timing attacks. </p> Mon, 02 May 2016 22:17:47 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/685933/ https://lwn.net/Articles/685933/ ledow <div class="FormattedComment"> Does it explain why this is still needed in this day and age outside of kernel-programming?<br> <p> At some point in my life, I wrote program inside DOS debug, and have done inline-assembly. But that's something I got out of many years ago because, pretty much, I was always beaten by the compiler.<br> <p> Especially nowadays with multimedia and parallel instructions being used to calculate across whole arrays or matrices simultaneously, the compiler will almost always do a better job and knowledge of assembler is really for when you're debugging or writing such compilers in the first place, or hitting corner cases at a deep, deep hardware level.<br> <p> Honestly, how much inline assembler is actually out there, and how far would it differ if you just put equivalent C in the same places.<br> <p> I remember, I think it was in the emulator fields, when hand-crafted MMX could win-out in certain highly specialised operations. Almost inevitably, though, the code gave way to slightly-slower (or even same-speed) and yet much more readable C equivalents.<br> </div> Mon, 02 May 2016 21:45:17 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/685918/ https://lwn.net/Articles/685918/ andresfreund <div class="FormattedComment"> You very well might want the compiler to use *looser* semantics than strictly required based on all possible side effects. There's e.g. plenty cases where you might want to atomically increment something, but you don't need the compiler to write back all "dirty registers". But there's also a lot of cases (most prominently locking related stuff) where you really, really can't have that. If you went by the strict definition any instruction involving writing to memory would have to act as a "m" constraint, based on your definition - it could be releasing a lock. <br> </div> Mon, 02 May 2016 20:53:48 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/685914/ https://lwn.net/Articles/685914/ daney <div class="FormattedComment"> <font class="QuotedText">&gt; What always bugged me is the fact that developer has to explicitly</font><br> <font class="QuotedText">&gt; write down the constraints at all. Compiler has all the knowledge</font><br> <font class="QuotedText">&gt; of the hardware platform already, ...</font><br> <p> This is not really true. The compiler has knowledge about a subset of the hardware capabilities.<br> <p> Almost by definition, we write inline assembly to do things outside of what the compiler has knowledge about, because if the compiler knew, we could just write our code using the compiler input language.<br> <p> To do things outside of the set of things the compiler understands, we have to communicate some aspects of what we need and what we are doing back to the compiler so that it can interact with our inline assembly in a sensible manner. For this there are the constraints.<br> </div> Mon, 02 May 2016 20:33:26 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/685902/ https://lwn.net/Articles/685902/ adobriyan <div class="FormattedComment"> What always bugged me is the fact that developer has to explicitly write down the constraints at all. Compiler has all the knowledge of the hardware platform already, so deducing that, say, STOSB clobbers memory and RDI and takes AL as input should not be a rocket science. So, no, please don't standardize GCC's asm, nobody understands it fully anyway -- just read all the implementations of memset/memcpy/strchr in libcs, kernels, random code in the web and watch how developers use =&amp;S/=&amp;D/&amp;=c constraints.<br> </div> Mon, 02 May 2016 19:07:51 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/685891/ https://lwn.net/Articles/685891/ gutschke <p>While the article provides a good high-level overview, it skips a couple of the details that are needed in real-life programming. It's been a while since I had to write assembly, so I have probably forgotten a lot of the details that caused me confusion.</p> <p>The two issues the I do recall are 1) how to deal with register constraints that don't have an architecture-specific register name, and 2) how to deal with more than 10 input/output constraints.</p> <p>The former can be solved by assigning variables to named registers; here is an example of what you would do, if you needed an input to be in <code>%r10</code>:</p> <pre> int main() { register int a __asm__("r10") = 5; int b; __asm__("movl %1, %0" : "=r"(b) : "r"(a)); return b; } </pre> <p>And the best solution for providing more than 10 parameters is to place them into a temporarily allocated <code>struct</code> on the stack. This can be a little ugly, as you often have to hard-code offsets into the structure, and it can also result in slightly less efficient code. But in practice that often doesn't matter too horribly.</p> <p>Finally, specifying correct clobber constraints can be surprisingly counter-intuitive unless you have a really firm grasp of the C language standard. If in doubt, it is always safe to mark the <code>__asm__</code> statement as <code>volatile</code> and to state that is clobbers <code>memory</code>. This results in slightly less efficient code, but it is a lot more likely that the optimizer doesn't interfere in ways that the programmer didn't anticipate.</p> Mon, 02 May 2016 18:03:42 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/685804/ https://lwn.net/Articles/685804/ bjacob <div class="FormattedComment"> Inline assembly in either GCC or Clang is very under-documented, so the best documentation available is "random" pieces like this scattered across the Web. Thanks for this one!<br> <p> Compiler maintainers, please make more inline asm documentation!<br> Language standardizers, please consider standardizing whichever aspects of inline asm can be standardized! I know it's a tall order, but it's a very important one. Software authors pay a high cost for the current situation, where so-called "GCC-compatible" compilers don't even agree on the way to define a function-local asm label (for example, %= doesn't work in iOS toolchains), etc.<br> </div> Mon, 02 May 2016 13:33:43 +0000 A guide to inline assembly code in GCC https://lwn.net/Articles/685762/ https://lwn.net/Articles/685762/ darwish <div class="FormattedComment"> Excellent.. may I also add that when writing inline asm, it's always good to put as much logic as possibe out of the inlined assembly code to the asm-constraints C expressions. This gives the GCC optimizer way more freedom, especially regarding constant propagation ..<br> <p> Beside the obvious linux kernel examples, I've once had a hobby kernel with the string methods optimized using gcc inline x86-64 asm .. it might be useful as a quick &amp; easy consolidated recap after reading the guide above:<br> <p> <a href="https://github.com/a-darwish/cuteOS/blob/master/lib/string.c">https://github.com/a-darwish/cuteOS/blob/master/lib/string.c</a><br> <p> thanks,<br> <p> </div> Mon, 02 May 2016 09:18:49 +0000