|
|
Log in / Subscribe / Register

Boucher: rustc_codegen_gcc can now bootstrap rustc

On his blog, Antoni Boucher updates the status of rustc_codegen_gcc, which "is a GCC codegen for rustc, meaning that it can be loaded by the existing rustc frontend, but benefits from GCC by having more architectures supported and having access to GCC’s optimizations". A significant milestone has been reached: "the GCC codegen has made enough progress to be able to compile rustc itself". For the Rust programming language, rustc is the standard compiler, so this work will eventually allow programs to be built for a number of architectures that are not supported by rustc. He also made progress beyond just building the compiler as he "was able to compile rustc using the GCC codegen and use the resulting rustc to compile a Hello World".

to post comments

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 1, 2022 15:08 UTC (Fri) by flussence (guest, #85566) [Link] (2 responses)

If there's anyone keeping track of the details I have a question: how much does this shorten the bare-metal bootstrap chain to get to a modern Rust by? I'm vaguely aware there was a project to do this for an entire OS but afaik Rust needed to be built up via its own lineage previously.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 1, 2022 16:03 UTC (Fri) by Gaelan (guest, #145108) [Link]

The previous state of the art was mrustc [0], which is implemented in C++ and is generally capable of compiling a rustc a few versions behind the latest; then you use that to compile a newer version of rustc, then use that version of rustc to compile an even newer rustc, and so on until you’re at the latest version.

[0]: https://github.com/thepowersgang/mrustc

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 1, 2022 16:09 UTC (Fri) by moltonel (subscriber, #45207) [Link]

This doesn't change the bootstrap chain, it "only" opens up the possibility to (cross)compile to platforms only supported by GCC. Note that platform support also needs to added to rustc/stdlib itself, regardless of the backend used.

Bootstrapping is typically done by cross-compiling using rustc version N-1. If you want to bootstrap from a C compiler, you can use mrustc which compiles rustc-1.54 as C source and use that to build 1.55, 1.56 etc up to the version you need. That mrustc chain gets shortened about once a year.

There's a longterm goal to be able to compile rustc N using rustc N-2 or older, but it'll be a while yet. There's also gccrs which will use gcc's bootstrap machinery, but it's unclear how desirable it'll be as a Rust compiler.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 1, 2022 15:15 UTC (Fri) by artefact (guest, #154379) [Link]

>this work will eventually allow programs to be built for a number of architectures that are not supported by rustc

By rustc's LLVM codegen, which is the current default. rustc_codegen_gcc allows rustc to target architectures supported by gcc.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 1, 2022 18:32 UTC (Fri) by jhoblitt (subscriber, #77733) [Link] (44 responses)

I understand it is the early days but have there been any comparisons [yet] of the asm generated between the llvm and gcc backends? The README includes the statement "A secondary goal is to check if using the gcc backend will provide any run-time speed improvement for the programs compiled using rustc.".

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 1, 2022 19:15 UTC (Fri) by david.a.wheeler (subscriber, #72896) [Link]

> I understand it is the early days but have there been any comparisons [yet] of the asm generated between the llvm and gcc backends?

I wouldn't bother checking those comparisons right now. They just got it working at *all*. GCC's back-end does a lot, but I expect that this new front-end will need to provide more information & be refined further to fully use the GCC back-end's optimizations.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 1, 2022 22:10 UTC (Fri) by developer122 (guest, #152928) [Link] (42 responses)

What I'd like to know is about correctness. One recent rust community discovery is that GCC and LLVM don't agree on how 128 bit numbers should be expressed in memory, for example. The stuff of nightmares: https://gankra.github.io/blah/c-isnt-a-language/

Unfortunately, if going through several rounds of GCC optimization that may be hard to verify just by comparing binaries.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 2, 2022 12:06 UTC (Sat) by Vipketsh (guest, #134480) [Link] (36 responses)

The int128 thing isn't great but I would hazard a guess that it's not a big deal because it has not been widely used in ABIs. It has been found and, I would hope, the appropriate fix will be made to whichever compiler needs to be fixed.

Otherwise, I have to say that article seems to be more about generating hysteria against C rather than trying to get people to be aware of the issues. What that article paints as being C's evil spewed onto the world (ABI) is actually being used by *every* compiled language, not because it originates in C, but because it has been tuned to the platforms involved. I shudder to imagine the mess if C, rust, etc. all had their own private calling conventions and structure layouts.

All examples given are not wrong, but painted in a way that the difficulty in matching their calling convention is exclusively C's fault and that somehow calling C code can not be avoided. It can, it just may be easier to, say, call into GTK (despite all the pains involved) than write a new toolkit in your new language. The author's first example is possibly the worst: if you want to interact with an OS (make I/O) you need to match *some* convention -- the OS defined one (system calls) or some wrapping thereof (e.g. C). Neither may be easy, but that is not C's fault. The rant about parsing C being difficult is about singling out a specific language and making it look as bad as possible. Firstly, there is no reason to have to interact with C (as mentioned above) and secondly every single programing language is filled with quirks and hard to parse.

The intended humor about a long target list is dishonest at best and manipulative at worst. ABIs are matched to the target architecture (ARM, x86, etc.) not only for performance reasons (e.g. endianess) but also out of necessity coming from inherent differences in how the machines operate (e.g. where return addresses are placed). It can not be avoided, nor is it C's fault. Then there is the historical context that ABIs have been changed at various times to get some more convenient property (e.g. performance). There is no suggestion of what an alternative could be.

The example about opaque structs and symbol versioning is just pain wrong. The whole point of using an opaque struct in an ABI is so you can change the struct without changing the ABI. There is no reason that you have to version symbols and very few libraries do so. Furthermore, if you wish to maintain old and new versions of your APIs, there is no reason to hide them behind symbol versions -- just expose both and the user can select (at their leisure) which one to use so the whole problem explained is side-stepped.

The minidump example tries to shoehorn a problem (fixed binary file layout) into a structure layout ABI problem. Whomever designed it made a choice to carefully have the same structure layout as the file layout, but that is not the only way nor is it impossible to handle if the structure would not match the file layout. Pretty much every function from the windows API could have been used as an example here, with the difference that the hysteria about reserved fields and structure size alignments could not be written.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 2, 2022 13:01 UTC (Sat) by excors (subscriber, #95769) [Link] (1 responses)

> The int128 thing isn't great but I would hazard a guess that it's not a big deal because it has not been widely used in ABIs. It has been found and, I would hope, the appropriate fix will be made to whichever compiler needs to be fixed.

The linked article does spend a lot of time obscuring any useful technical information behind pages of invective, but it appears to be a bug in LLVM/Clang that was raised in 2017 and hasn't been resolved yet: https://reviews.llvm.org/D86310 . (Specifically it seems to be that when an __int128 argument is passed via the stack (i.e. it's not one of the early arguments that go through registers), GCC will align it to 16 bytes but Clang will only align to 8 bytes.)

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 2, 2022 18:14 UTC (Sat) by khim (subscriber, #9252) [Link]

It's not just alignment. Compare.

Note that if you use stdarg.h then clang doesn't work.

And no, that's not because stdarg.h comes from glibc, each compiler brings its own stdarg.h.

This I think clang should just be ignored when __int128 is discussed, it's just a bug in the compiler, plain and simple.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 2, 2022 22:47 UTC (Sat) by mjg59 (subscriber, #23239) [Link] (33 responses)

You can't get away from C by simply writing new infrastructure libraries yourself - at some point you have to deal with system calls only being defined in terms of C, and unless you want to reimplement a bunch of corner case support yourself (like how the setuid() syscall only affects the current thread, and extending that out to the entire process is handled in glibc) you're going to end up relying on glibc as well.

The argument isn't fundamentally about C, it's about system design. Nobody deliberately set out to make anything bad, but since C's ABIs aren't generally defined in any kind of machine parsable manner any other language that wants to interoperate with C (which it's going to have to at some point unless it's going to write its own kernel as well) has to end up reimplementing that ABI by hand.

I don't think it's unreasonable to say that in an ideal universe, the interface definitions that provide the de-facto implementation of a platform's standard functionality would be easy to parse. It's not C's *fault* that it's in the position it's in, but the world is more difficult for people as a result.

(And the range of C ABIs to care about is absolutely not just down to inherent differences in the hardware! Windows and Linux have entirely incompatible C ABIs on amd64, which isn't something a language that wants to support both gets to ignore)

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 3, 2022 1:38 UTC (Sun) by developer122 (guest, #152928) [Link] (8 responses)

It is *absolutely* C's fault that we're in this mess because these "unparsable" ABIs and "system designs" are in fact codified C-isms and always have been. They're unparsable because they often still change with the weather.

No system since the invention of Unix has been written in anything but C. They were not designed in a vaccum. The ABI/calling convention/etc started out at "whatever the C compiler does" and inherited C's undefined nature.

"How many bits are there in a byte?"
"how long is a long?"
"How much padding should X have? How should it be aligned?"

For a long time this was completely left to chance and would freely change with each platform/OS/compiler combination. It was only later codefied into defacto "standards" (hey triples!) by writing down the behaviour of the most prominent implementation(s).

The #1 rule of C standards-writing is that the implementation determines the standard and *never* the other way around. In areas where implementations disagree, it becomes defined as undefined behaviour and this has happened more often than anyone would like to admit.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 3, 2022 12:32 UTC (Sun) by Wol (subscriber, #4433) [Link] (1 responses)

> The #1 rule of C standards-writing is that the implementation determines the standard and *never* the other way around. In areas where implementations disagree, it becomes defined as undefined behaviour and this has happened more often than anyone would like to admit.

> "How many bits are there in a byte?"

So you clearly haven't been around long enough. How would you like your program, written in C, to run like a snail on tranquillisers because the HARDWARE defined a byte as six bits. Or as nine. Or whatever other funny the hardware guys gave you.

True, I think today's standard and compiler writers have forgotten that the purpose of a language is to make life easy for the USERS, but pretty much all of what you've complained about in this post actually IS defined. It's "Whatever the hardware gives us".

What the standards writers *should* do now is say "hardware is standardised. Let's formalise that state of affairs. The new standard says a byte is 8-bit and if you want something else you have to require an earlier version of the standard".

Cheers,
Wol

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 3, 2022 14:11 UTC (Sun) by atnot (guest, #124910) [Link]

> So you clearly haven't been around long enough. How would you like your program, written in C, to run like a snail on tranquillisers because the HARDWARE defined a byte as six bits.

I'm not sure who you're arguing with so aggresively here. The entire point of this discussion is that programming languages make for poor interface definition languages. Indeed, ambiguities are a great opportunity for performance in a programming language. But in a binary interface definition, they are profoundly undesirable.

I can appreciate that you're fond of C's specific ambiguities and feel the need to defend them, but that doesn't really change anything in this context.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 3, 2022 18:26 UTC (Sun) by mti (subscriber, #5390) [Link]

> No system since the invention of Unix has been written in anything but C.

Not true. CP/M, MS-DOS and many, many others were written in assembler and used calling conventions not compatible with C compilers. MacOS was written in Pascal with lots of assembler and used Pascal calling convention. I don't know what language early Windows used apart from assembler but the calling convention was Pascal-compatible, not C-compatible. When writing C-code you had to insert the nonstandard 'pascal' keyword in function prototypes.

The Transputer was programmed in Occam.

I assume lisp machines used Lisp.

And so on.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 3, 2022 23:40 UTC (Sun) by ilammy (subscriber, #145312) [Link] (1 responses)

> The #1 rule of C standards-writing is that the implementation determines the standard and *never* the other way around.

That’s not a necessarily bad thing. Most of the Internet is defined in the same way, by practical “implementation-first” standards in the form of RFCs.

The crux of the approach is how willing you are to grind the implementations against each other to make them interoperate. With internet, you have a strong incentive to keep implementations interoperable. With C compilers, you don’t give a damn about other compilers on other OSes, and often on the same OS too, because they won’t likely have to interoperate with the compiled code your compiler produces.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 4, 2022 7:28 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

> That’s not a necessarily bad thing. Most of the Internet is defined in the same way, by practical “implementation-first” standards in the form of RFCs.

In practice, this has balkanized the internet[1] into "Blink" and "not Blink," but nobody cares because "not Blink" is barely large enough to matter,[2] so if supporting "not Blink" is nontrivial, you just say "ah, fuck 'em," and now your website is Blink-only. I hardly think that's a good thing. The upside, one might argue, is that HTML5 has grown by leaps and bounds, but web developers have used these new features primarily for evil, in one of the worst examples of Wirth's law that I have ever seen. A typical modern website is much slower and more painful to use than anything that existed in (say) the mid 2000's.[3]

Disclaimer: I work for Google; views are my own.

[1]: Technically, the web. But email is de facto web because everyone uses HTML emails by default, which they then proceed to render in their web browsers. Non-web non-email internet services barely exist outside of low-level stuff like BGP and proprietary crap like online gaming and walled gardens. Sure, you *can* fire up an NNTP client and doodle around on Usenet, but the vast majority of users aren't doing that. To a first approximation, the web (plus email) is the only end-user-visible part of the internet where interoperability still means anything.
[2]: "Not Blink" can be further subdivided into Gecko, WebKit, and "this will definitely never be supported by anybody so don't bother asking."
[3]: By that point, everybody blocked old-school popups, JavaScript popups either hadn't been invented or were still uncommon, and ad-blocking basically worked on most sites without displaying a "please turn off your ad-blocker" message.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 3, 2022 23:53 UTC (Sun) by milesrout (subscriber, #126894) [Link] (2 responses)

It's not even clear what you're complaining about here. How would this be different? What is a 'byte'? It's the smallest addressable unit of memory. There are systems out there where that is not an 8-bit quantity. When you are specifying a language you have two options: you can require that bytes are 8 bits or you can leave it up to the platform. Most languages specified these days do the former: they sacrifice any chance of compatibility with non-8-bit platforms to make life simpler for people writing software for the most common types of platform. That makes some level of sense: something like ripgrep is unlikely to ever be run on a 12-bit-byte DSP anyway, so forcing its author to write his Rust code while remembering "CHAR_BIT isn't necessarily 8" is pointless.

When C was specified though, it chose a different path: to maintain compatibility with as many platforms as possible. You can still write C code that assumes that CHAR_BIT=8. There's nothing stopping you. It's not "undefined behaviour". It's implementation-specified what the size of a char is. You are only going to run it on platforms where that is true anyway. It might exhibit undefined behaviour when compiled with a compiler for a platform that is specified with CHAR_BIT=16 or CHAR_BIT=12 or something, but so what? That's the problem of whatever muppet decided to try to compile your code somewhere that doesn't comply with the range of platforms you have decided to support.

This is what people seem to fail to understand about C and its "undefined behaviour" that they dislike so much. Nothing is forcing you to write code that is totally agnostic to the size of a byte or the size of a long. If you want to write code that is portable across a wide variety of platforms, you can do, because that's how the C standard is defined. It's IMPOSSIBLE to write such portable code in Rust. You simply CANNOT write code that will transparently work with a platform with 12-bit bytes. In C you can. But you can also choose *not* to do this. You can write code that works only on a restricted set of platform. You just aren't forced to do so.

> For a long time this was completely left to chance and would freely change with each platform/OS/compiler combination. It was only later codefied into defacto "standards" (hey triples!) by writing down the behaviour of the most prominent implementation(s).

This is not true. This behaviour naturally changes with different platforms. How many bits there are in a byte is not "left to chance". It does not "change with each platform". It just is different on different platforms. It's an inherent property of the platform. You can emulate 8-bit bytes on top of a platform where this is not naturally true but it will lead to very inefficient code. The natural choices for these values vary because the platforms themselves vary. What C does is that it allows you to write code that works on a variety of platforms that work differently.

Different platforms have different sized bytes, different sized addresses and different sized registers. Different platforms have different capabilities to load values at different alignments. For example there are processors that have no problem loading 8-byte values that are only 4-byte aligned, or unaligned completely. You seem to be in favour of a single rule for every platform, which would presumably mean that on platforms where alignment is less of a concern, we're still forced to have bloated structs full of padding even where they are completely unnecessary. Would that be a good thing? I don't think so. It would just result in people using a specialised language with a specialised compiler for efficient code.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 4, 2022 8:08 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (1 responses)

8-bit bytes is a silly example because nobody really loses sleep over it anyway (we all just assume CHAR_BIT == 8 as you suggest). A better example would be LP64 vs. LLP64. This is a real disparity between systems that people actually use on a regular basis, and it has nothing to do with hardware differences. Windows says sizeof(long) == 4 and Linux says sizeof(long) == 8, on exactly the same hardware (x86_64). There is no logical reason that this disparity had to exist. C could just as easily have said "on 64-bit platforms with 8-bit bytes, char is 1 byte, short is 2, int is 4, long is 8, also int_fastNN_t exists if you really want it, but please don't use that to specify a real API that anyone is going to rely on." But they didn't do that, instead electing to make int a shorter way of writing int_fast16_t.[1]

But even that isn't *really* necessary. We could just as easily have said "open(2) returns int32_t," (int16_t might have been more likely at the time this was getting standardized...), and the same for every other interface under the sun, but even if that were a standard POSIX policy (it's not), lots of other (library) interfaces would end up using int anyway because int is easier to type. So, while you can make a case for this being an interface problem and not a C problem, C is driving the getaway car.

[1]: Yes, I know that it's really the other way around, that int_fast16_t is really a longer way of writing int, but that's beside the point. I care about the semantics of these types; I don't care which one is fundamental and which one is a typedef to the other.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 8, 2022 1:07 UTC (Fri) by bartoc (guest, #124262) [Link]

C23 (2x really, pretty likely 23) adds new `_BitInt(XX)` types, that let you just specify the bit-width of types (kinda like Ada's type Int32 is range -1**15+1 .. +2**15-1), but they don't require the oddball "maybe the system doesn't use base-2 arithmetic range thing", and have more permissive conversion rules (it's Ada, can't really get less permissive than that, although I think even Ada allows widening in some situations, unlike some other languages).

Anyway, they are pretty cool, non-power-of-two widths can give the optimizer interesting busywork, and importantly THEY ARE NOT SUBJECT TO INTEGER PROMOTION (but still to widening, as noted). So if you're writing code that needs to do a bunch of bit-twiddling on stuff smaller than an int you no longer have to
insert casts all over the place!

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 3, 2022 9:16 UTC (Sun) by roc (subscriber, #30627) [Link] (23 responses)

> at some point you have to deal with system calls only being defined in terms of C

This confuses me, because the syscall calling convention and C calling conventions are different on most (maybe all!) architectures.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 3, 2022 10:25 UTC (Sun) by atnot (guest, #124910) [Link] (22 responses)

If you look at some random syscall, say epoll_ctl, you will see it defined as:

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event)

Where epoll_event is further defined as a C struct and union I won't paste here.

This is subject to all of the other C ABI semantics aside from calling convention, including the implicit size of the types, general layout, alignment and padding of the struct, union and all of it's members.

This means that unless you are using an existing C compiler, you need to parse C, which is impossible, and match all of these precise semantics identically, which is almost impossible.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 3, 2022 20:33 UTC (Sun) by wahern (subscriber, #37304) [Link] (9 responses)

And it would be better how if those interfaces were defined using Rust or Swift layouts? Or some other specification? Why have I not read any solid suggestions for changes that can actually be critiqued?

The entire C ABI rant-fest only begs the question: what then? It's like WebAssembly folks pining for the day WASM defines a tracing GC facility, assuming doing so will then make it trivial to port all GC'd languages to WASM. Words can't even describe how utterly naive that is.

I have my own beef with typical platform ABIs: fixed-sized stacks make it extremely difficult for most languages to introduce better threading and concurrency semantics. Go did, but of course everybody laments Go's "slow" FFI to C code (i.e. C and everything else with contiguous, non-movable stacks). But I also admit that it's a real pickle because those fixed-sized stacks exist for myriad reasons, few of which have to do with C specifically--in fact, almost none, because the C standard requires neither contiguous nor non-movable stacks.

Nothing is stopping languages from having better semantics. Nothing is stopping languages from having better FFI semantics. Well, nothing except time and motivation. If the Rust and Swift communities want better FFI between them, then do it! And if they want to make it generic enough that other languages can join in, go for it! But I must warn you--the more generic and universal it becomes, the closer it will parallel existing ABIs.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 3, 2022 21:35 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

> And it would be better how if those interfaces were defined using Rust or Swift layouts? Or some other specification? Why have I not read any solid suggestions for changes that can actually be critiqued?

Some microkernel OSes in the past used formally defined IDLs to describe the kernel-user interface. Which makes total sense, because messages go well with formal structure definitions.

> The entire C ABI rant-fest only begs the question: what then?

An IDL that can formally describe structures, including precise bit widths, byte order, alignment and padding requirements. And a formally defined calling convention that strictly defines the registers used for argument passing (for each architecture), stack usage, etc.

That's at minimum. Ideally we would also want to have a formally specified way to unroll stacks for backtraces and exception handling.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 3, 2022 23:38 UTC (Sun) by milesrout (subscriber, #126894) [Link] (7 responses)

> An IDL that can formally describe structures, including precise bit widths, byte order, alignment and padding requirements. And a formally defined calling convention that strictly defines the registers used for argument passing (for each architecture), stack usage, etc.

Let me present to you: the System V ABI! All of this is exceedingly well-documented and specified already. The calling convention is not exactly complicated, and nobody actually disagrees about things like how you pass 32-bit unsigned integers and 64-bit signed integers around, or how structs are laid out. It's all really quite simple.

The problem is that every language implementation that wants things that aren't in C does stuff its own way. How do vtables work? How do 128-bit integers work? Etc. That depends on the language, and the implementation of that language. But of course it needs to: the languages have different semantics *for a good reason* presumably, and that means different implementations that are incompatible for things like virtual functions etc.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 3, 2022 23:53 UTC (Sun) by mjg59 (subscriber, #23239) [Link] (4 responses)

It's all quite simple, other than having to parse it out of human readable text that isn't necessarily even a complete reference itself (the amd64 ABI doc incorporates the ia32 ABI doc by reference and only defines the differences!) and then do the same for every other ABI you want to target. But knowledge of the platform ABI isn't sufficient to interpret C headers, you still need to deal with macros and other preprocessor behaviour to turn them into something that another language can make use of, so what ends up happening a lot is that people write bindings by hand.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 4, 2022 0:05 UTC (Mon) by milesrout (subscriber, #126894) [Link] (3 responses)

Now you're talking about something completely different. API compatibility is a different set of issues. Different languages are different. If you want to interface with Rust code using Rust language features you need to virtually reimplement Rust. If you want to interface with C code using C language features you need to virtually reimplement C. This is not special to C at all. Are we talking about API or ABI? They're different.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 4, 2022 0:17 UTC (Mon) by mjg59 (subscriber, #23239) [Link] (2 responses)

If I want to use a C library from another language, knowing the ABI for that library depends on knowing what the layout of any structures it defines are, and that (potentially) depends on having to either implement the full behaviour of the C preprocessor or reconstruct it by hand. This wouldn't matter hugely except for C being the de-facto interface for FFI - even libraries written in other languages may offer a C interface for cross-language compatibility. This isn't a criticism of C, it's just pointing out that C is being used for purposes it was never intentionally designed for, and the world is a little more awkward as a result.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 4, 2022 3:26 UTC (Mon) by wahern (subscriber, #37304) [Link] (1 responses)

I think both LLVM and GCC now support CTF, which is a minimalist data structure description developed by Sun as an alternative to DWARF, specifically for simple and easy reflection. It found it's way to BSDs through D-Trace, which provided the original impetus for inclusion into GCC and LLVM. But mindshare in Linux-land blossomed with eBPF and BTF--BTF being a dialect of CTF employed as a way for eBPF to reflect on the running kernel without having to parse headers or load externalized symbol tables.

Anyhow, the obvious path forward, IMO, would be to have major distributions enable built-in (i.e. not externalized to separate files) CTF symbols by default. That's literally why it was invented--so people would have less reason to strip them. You could do the same with DWARF, but realistically much fewer people would agree to making built-in DWARF descriptions standard practice as they're more complex. And you want them to be built-in, because it creates uniformity--if you can load program or library, you're guaranteed to be able to load the API description.

IIRC, last time LWN reported on it there was still moderate resistance to making in-kernel BTF symbols mandatory for using eBPF. But it's a much easier sell when it comes to userland binaries. And for userland access to CTF descriptions of public Linux kernel interfaces, there's much less need for CTF descriptions to be included in the running image--because it's stable, they could just as well be included kept separately on disk, and anyhow access them would necessarily need to be different than for userland binaries--you don't dlopen the syscall API.

I don't know the finer details (haven't worked with CTF directly), but I believe it can be used as-is on Windows and most other platforms. It looks like Microsoft already has a port of D-Trace, including CTF, for Windows: https://github.com/microsoft/DTrace-on-Windows/tree/windo...

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 13, 2022 11:55 UTC (Wed) by nix (subscriber, #2304) [Link]

> I think both LLVM and GCC now support CTF

Up to a point, Lord Copper. GCC 12 will be able to emit a form of CTF which can be read by a libctf library shipped in recent GNU binutils, and GNU ld can link such objects together and deduplicate them. But this is *not* the same format as Solaris, FreeBSD, etc use: those formats had severe limitations on the number of types they could support which more or less required a format break (either that, or force the Linux kernel people to go through the same conniptions the FreeBSD kernel devs do to keep the total number of types down: just no), and since we're doing one of those why not fix a bunch of other problems too?

There is no support in binutils libctf for reading Solaris, FreeBSD etc CTF, for the simple reason that nobody has ever asked for it. It could be added, in theory, if someone had the time.

There is also no support in LLVM for generating CTF, though it probably wouldn't be hard to add. (Equally, libctf does all the work of linking CTF dicts together itself, so licenses permitting lld could use it just like GNU ld does. Of course 'licenses permitting' is the big part here.)

> to have major distributions enable built-in (i.e. not externalized to separate files) CTF symbols by default

This is the design intent of CTF and why it sacrifices a lot of outwardly sensible things (like knowing which specific TUs a type was defined in) to save space. This is also why strip doesn't strip .ctf sections. CTF is meant to Just Work, Dammit, with no symbol servers or huge debuginfo files or anything like that.

So up until yesterday I'd have said, yeah, go for it! Now I'm afraid I have to say "let me fix one last bug first", since it turned out that objcopy --{keep,strip}*-symbols corrupts the CTF symbol->type tables, and oh look what does RPM's find-debuginfo.sh call? :/ So, maybe next week? I guess I'll backport the fix to binutils 2.38 and possibly 2.37 and any earlier releases anyone asks for, since this has fairly nasty impacts on RPM-based distros trying to use CTF.

> And for userland access to CTF descriptions of public Linux kernel interfaces, there's much less need for CTF descriptions to be included in the running image--because it's stable, they could just as well be included kept separately on disk

This is what DTrace for Linux does, using a dedicated archive format which is also used inside .ctf sections when the inputs contain types with multiple distinct definitions and the same name. (libctf abstracts over all this stuff for you). The kernel never needs to know about CTF at all: only its build system does, and the impact there is minimal.

> It looks like Microsoft already has a port of D-Trace, including CTF, for Windows

This is the old, Solaris-era CTF, with all its size limitations: 32k types max in a single CTF dictionary, etc. It will definitely not suffice for the Windows kernel, even if CTF supported C++, which it doesn't. (At least, not yet -- I hope to add enough support to our version in time that we can at least encode ABI differences that might impact C++ programs using it.)

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 4, 2022 12:32 UTC (Mon) by vasvir (subscriber, #92389) [Link]

Also strings can be quite different across languages. But of course if we can't agree on what int128 looks like then what chances do we have to agree on a common string representation?

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 5, 2022 20:56 UTC (Tue) by khim (subscriber, #9252) [Link]

>Let me present to you: the System V ABI! All of this is exceedingly well-documented and specified already.

Can you tell me where I can download that wonderful documentation and which part of it describes the second argument of signal syscall.

>It's all really quite simple.

Only if you are dealing with C because all calling conventions and data structures are defined in terms of C.

You can only know if seek would return 2, 4, or 8 bytes if you would parse the headers supplied with system C compiler.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 3, 2022 23:32 UTC (Sun) by milesrout (subscriber, #126894) [Link] (11 responses)

The epoll_ctl system call is not that function prototype. That is a prototype for a library function in glibc which wraps the actual system call. There is no requirement to interface with the Linux kernel using glibc. Go, for example, is quite well known for not using the standard library to interface with the kernel because of its special runtime issues: it uses quite small stacks and glibc doesn't provide any guarantees as to how much stack space is required to call its functions.

Now on some systems, system calls are actually defined as library functions. On the BSDs, you are required to interface with the system using C library functions. There, the interface really is C. But on Linux, the interface is "make a raw system call with one of these stable system call numbers we guarantee will be around forever using the SysV ABI".

> This means that unless you are using an existing C compiler, you need to parse C, which is impossible, and match all of these precise semantics identically, which is almost impossible.

This is the whole point of the independently documented and language-agnostic System V ABI. For amd64 this is specified here: https://raw.githubusercontent.com/wiki/hjl-tools/x86-psAB...

This includes all the information about where to put arguments, struct layout, etc. Struct layout is also really not that complicated on amd64 at least: everything is in the order specified with the minimum offset that is properly aligned. Bitfields and packed structs etc. are a bit more complicated, but still documented appropriately.

It's quite irrelevant how int128 is laid out. It's not standardised. It's not used in the Linux kernel system call interface for that reason. It's no more relevant to the Linux system call interface than gcc's layout of vtables in C++ is. It's a non-standard language extension on top of C.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 4, 2022 18:12 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (3 responses)

> Now on some systems, system calls are actually defined as library functions.

To the best of my understanding, this is true of Windows, macOS, and pretty much all other systems that large* numbers of people actually care about. Linux is the odd one out here.

* Yes, I'm sure lots of people use weird OSes like Haiku and Plan 9, but I'm deliberately ignoring those, partly because I don't know how they work, and partly because they're basically hobbyist affairs at this point.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 5, 2022 21:32 UTC (Tue) by khim (subscriber, #9252) [Link] (2 responses)

> Linux is the odd one out here.

It's slightly odd, yes, but not odd enough. Yes, the actual syscall parameter passing is different from C ABI, but data structures which you have to pass to kernel are only defined in a set of .h files which effectively means you are dealing with a C library of a slightly inconvenient form.

The only way to get any information about how that ABI works is to parse these headers.

Compare that to MS-DOS interface. There all the data structures are described with byte offsets and sizes, not in term of C.

I'm not saying MS-DOS ABI was good. Far from it. But that one wasn't C library ABI. Linux syscalls? Nope, still are built around C.

My favorite example is Vulkan. It's API is defined in machine-readable XML file here.

It works like that. Here is definition of data structure:

<type category="struct" name="VkDebugMarkerMarkerInfoEXT">
    …
    <member><type>float</type> <name>color</name>[4]</member>
</type>

Field color here is an array of four floats.

And here is Vulkan command definition:

<command queues="graphics" renderpass="both" cmdbufferlevel="primary,secondary">
    <proto><type>void</type> <name>vkCmdSetBlendConstants</name></proto>
    …
    <param>const <type>float</type> <name>blendConstants</name>[4]</param>
</command>

Command argument blendConstants here is pointer to an array of four floats.

Nice, easy to parse, totally not similar to C header IDL.

</sarcasm mode off>

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 5, 2022 23:02 UTC (Tue) by excors (subscriber, #95769) [Link] (1 responses)

> My favorite example is Vulkan. It's API is defined in machine-readable XML file here. [...]

Vulkan doesn't claim that its machine-readable XML is a programming-language-independent IDL. It explicitly says:

> The XML schema is not pure XML all the way down. In particular, command return types/names and parameters, and structure members, are described in mixed-mode tag containing C declarations of the appropriate information, with some XML nodes annotating particular parts of the declaration such as its base type and name. This choice is based on prior experience with the SGI .spec file format used to describe OpenGL, and greatly eases human reading and writing the XML, and generating C-oriented output. The cost is that people writing output generators for other languages will have to include enough logic to parse the C declarations and extract the relevant information.
(https://www.khronos.org/registry/vulkan/specs/1.3/registr...)

I expect a large majority of Vulkan users (including drivers and applications) are in C/C++, because it's a fundamentally unsafe low-level API and any application in a higher-level language should be using a higher-level wrapper or graphics engine. API performance was one of the design goals for Vulkan (because call overhead had become a real problem in OpenGL), so they couldn't afford extra argument encoding/decoding per call and there was no real alternative to depending on the platform's default C ABI. So the C-centric specification was a deliberate, pragmatic choice based on who would be using the API, and was not because C is the (terrible) de facto standard FFI IDL that is expected to be supported by all languages (which I think is the complaint that started this thread).

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 6, 2022 15:38 UTC (Wed) by khim (subscriber, #9252) [Link]

> So the C-centric specification was a deliberate, pragmatic choice based on who would be using the API, and was not because C is the (terrible) de facto standard FFI IDL that is expected to be supported by all languages (which I think is the complaint that started this thread).

You made a tiny typo. If you remove the word not from that sentence then it would be correct.

Vulkan made that choice precisely because C is a terrible FFI IDL but it is de facto standard FFI IDL.

The only thing that can be done with that XML easily is generation of human-readable PDF. Which is, actually the only thing The Khronos Group wanted.

If you look on scripts which have to deal with it for anything else (e.g. here or here) you'll see large, messy, badly defined parsers needed to, somehow, process that mess. This is despite the fact that the first one, actually, produces .h files! Even that cannot be done without massaging it with regexps and other such nonsense.

It's really a perfect example of both that C is used as FFI IDL today and that it doesn't work well for that role.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 5, 2022 21:02 UTC (Tue) by khim (subscriber, #9252) [Link] (6 responses)

> But on Linux, the interface is "make a raw system call with one of these stable system call numbers we guarantee will be around forever using the SysV ABI".

…but pass around a data structure which is only defined in a certain C header and may include pointers to C functions.

At which point you are basically working with C library of a slightly unusual form.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 7, 2022 19:37 UTC (Thu) by ballombe (subscriber, #9523) [Link] (5 responses)

No, the SysV ABI is defined at the binary level.
It is just that C has a convenient way to map raw binary data.
Any C replacement must be able to do the same.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 7, 2022 19:57 UTC (Thu) by farnz (subscriber, #17727) [Link] (4 responses)

And that's exactly what we're all complaining about. Take fstat for example - if I can parse sys/stat.h, I then know how the buffer pointed to by the second parameter is laid out in memory (because the SysV ABI describes how the C structure definition maps to a binary layout). But the documented way to do that requires a full C preprocessor complete with predefined preprocessor macros that match those a supported C compiler would set, so that I get a file that I can parse as C that includes a definition of struct stat - and of course, I need a full C parser to make sense of that structure and map it back to the SysV ABI.

Which is problematic, because it means that to implement a new language, I first need to implement a complete C preprocessor and parser, to allow me to get to a point where the SysV ABI actually matters.

Ideally, we'd have some form of language-independent ABI document from which C headers are generated, but that's a lot of work that's not guaranteed to be accepted since as long as C or C++ is your language of choice, it's "obfuscation" of the "real" definition that the compiler actually works with.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 7, 2022 21:02 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (3 responses)

> Which is problematic, because it means that to implement a new language, I first need to implement a complete C preprocessor and parser, to allow me to get to a point where the SysV ABI actually matters.

It's a bit ugly, but for structure layouts you could use the system C compiler to produce a binary which prints the offsets and sizes of the fields in a standard, easy-to-parse format for other languages to consume. I imagine that would be *much* easier than reimplementing the C preprocessor and parser. You would need to know the field names up front, and bitfields might present some special challenges.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 7, 2022 21:08 UTC (Thu) by johill (subscriber, #25196) [Link]

Probably easier yet to make it emit debug information and use tools similar to pahole.

Yeah, I've actually done this in the past. And doesn't BTF in the kernel do something like that too?

(I've also got a separate similar tool that uses pycparser, but .. yuck)

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 7, 2022 21:12 UTC (Thu) by farnz (subscriber, #17727) [Link] (1 responses)

The difficulty is that you'd want this for any system ABI struct, union, #defined constant, enum etc. Which means that you're now writing a codegen thing that takes a function definition that you want to treat as ABI, and using the C compiler to then discover what the layout and field names of the function parameters are.

Which isn't impossible - it's effectively looking at debug information that the system C compiler can output - but it's a bit messy that the way to discover what the system ABI actually is (at the machine code level) is to compile C snippets, then extract the debug information from the resulting binary. And you still, even in that case, have the pain of determining what constants are valid for a given flags field, because the C way to do that is to #define suitably named constants and hope that the user is clever enough to only OR together valid combinations.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 8, 2022 1:07 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

> Which isn't impossible - it's effectively looking at debug information that the system C compiler can output

It also means if you want your language to be cross-compiling out-of-the-box (like, say, Go), you need C cross compilers for every target you care about.

Speaking of Go, the compiler apparently used to call the compiler (in this case, `clang`) with things like `-Dint=__some_invalid_sentinal` and *parse the error messages* to tell "what things are of type `int` in this snippet". Rinse and repeat for all kinds of built in things. Apparently there wasn't much communication and the LLVM/Clang team learned of this…ingenuity when an issue was filed about a change to the error message format(s).

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 3, 2022 1:41 UTC (Sun) by developer122 (guest, #152928) [Link] (4 responses)

People Don't seem to be paying attention, so I'll re-iterate:

How much can we rely on the code being *correct* after having passed through GCC's code-gen backend?

The one thing that has held up a GCC implementation of rust the most has been the inability to express certain constraints and concepts in the GCC IR. It may not matter how much checking your language does if it gets mangled by the GCC backend's optimizations.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 4, 2022 10:13 UTC (Mon) by Wol (subscriber, #4433) [Link] (2 responses)

And from following the llvm mailing list, I can vouch for the fact that's a (lesser) problem also with llvm/clang.

Unfortunately, because the majority of llvm guys seem to be C programmers, the IR is biased towards C. What else do you expect? But because there's a significant minority of non-C guys there, these things get picked up and sorted fairly quickly.

Who knows what C-specific horrors are lurking in GCC that need to be fixed.

Cheers,
Wol

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 4, 2022 12:21 UTC (Mon) by ballombe (subscriber, #9523) [Link]

Remember that the gcc IR was designed by lisp-addicts.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 4, 2022 14:13 UTC (Mon) by dvdeug (subscriber, #10998) [Link]

GCC has had an Ada compiler for 25 years, and the GCC developer pre-EGCS was employed by the Ada company. It has had a Fortran 77 frontend since forever, and a modern Fortran front end for 15 years; it had Java and CHILL frontends designed by the people who basically created EGCS and have run the compiler since for 20 years.

Back in the day, when the Fortran 77 frontend was first written, it couldn't quite be one pass, even though C was and Fortran 77 could be, due to C assumptions. There's probably still things like that, but any limits on what can be done aren't going to be C-specific. Possibly 20th century procedural/OO language specific, but not just C.

Boucher: rustc_codegen_gcc can now bootstrap rustc

Posted Apr 4, 2022 14:14 UTC (Mon) by antoyo (guest, #141125) [Link]

(Author of rustc_codegen_gcc here.)

There are other differences with the LLVM codegen than that: for instance stuff regarding NaN [1] and those are things that the rustc developers seem to care about: https://github.com/rust-lang/unsafe-code-guidelines/issue...

My guess is that whatever inconsistencies we find between the LLVM and GCC codegen, they will want to address them.

Do you have some examples of concepts that can't be expressed in the GCC IR?

[1] https://github.com/rust-lang/rustc_codegen_gcc/issues/75


Copyright © 2022, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds