Improving bindgen for the kernel

By Daroc Alden
October 9, 2024

Kangrejos 2024

Bindgen is a widely used tool that automatically generates Rust bindings from C headers. The Rust-for-Linux project uses it to create some of the bindings between Rust code and the rest of the kernel. John Baublitz presented at Kangrejos about the improvements that he has made to the tool in order to make the generated bindings easier to use, including improved support for macros, bitfields, and enums.

Baublitz noted that there has been a wishlist of features to add to bindgen for the Rust-for-Linux project for some time. After he ran into some of the same problems in his own projects, he decided to tackle them. There are three main problems that he wants to address: macro expansion, accessing bitfields via raw pointers, and supporting better conversions for Rust enums.

Macro expansion

There is no way that bindgen can usefully support the full richness of C macros. But there are a subset of macros that are useful to have represented in the generated Rust code: macros that are just used as a name for a constant value. Currently, bindgen specially recognizes simple macros and turns them into constants:

    #define NAME 3
    // becomes
    pub const NAME: u32 = 3;

However, it's relatively common for a macro to be defined in terms of other macros, which requires expanding the macro to determine its value. Since bindgen doesn't include a reimplementation of the C preprocessor, it can't handle these more complex macros. Baublitz gave the example of cryptsetup, which added the UINT32_C macro around some of its constants and broke the generated Rust bindings.

He has come up with a way to make it work, however. With his changes, bindgen can now capture the name of the macro, create a temporary C file with a main function that returns the value of the macro, and then use Clang to compile it. Baublitz described this as "a bit hacky", but working. For now, the new code remains opt-in using the --clang-macro-fallback flag, for two reasons. First of all, even small changes to generated bindings can cause problems, such as by introducing duplicated names, so bindgen tries not to change the default behavior. Secondly, the approach does have performance implications, since it involves invoking Clang.

The performance impact isn't that bad, however. Baublitz measured the time taken to evaluate the macros in a consolidated header file containing all of the constants defined in the kernel's headers, which was 3-5 seconds. His initial prototype was significantly worse, taking nearly 35 minutes. The majority of that time was spent doing I/O; switching to Clang's in-memory API made that much faster, but still too slow for practical work. His final design takes advantage of Clang's support for precompiled headers, by compiling the headers once, and then generating multiple C files in memory to evaluate the different constants.

There is one complication to using precompiled headers. Clang actually only supports using one precompiled header per source file, and silently ignores any others passed. So, Baublitz generates a synthetic header that imports all of the others, and then pre-compiles that. Still, despite the problems, the new option was released in bindgen 0.70 on August 16, and is available to users. In the future, Baublitz would like to add a Clang API that retains macro information when parsing, and use that directly, instead of maintaining this workaround. Miguel Ojeda confirmed that the two of them had spoken to a Clang maintainer, who had approved of that approach. For now, however, this solution works, and makes many more constants available between the two languages.

Bitfield access

Since C does not have Rust's lifetime tracking, programmers often need to refer to structures shared between Rust and C using raw pointers instead of Rust's references. This poses a problem for bitfields. Rust doesn't have a native concept of bitfields, so when a C structure contains a bitfield, bindgen generates accessor functions to access the value correctly. The generated functions take a reference to the structure, since that is the idiomatic way to define methods for a type in Rust. This poses a problem for structures that need to be referred to with raw pointers.

Baublitz addressed this problem by adding an additional set of unsafe helper functions to access bitfields using raw pointers. At the time of his talk, the Rust-for-Linux developers had reviewed his code and agreed that it would be helpful, but it still needed a review from the bindgen maintainer.

Luckily that maintainer, Christian Poveda Ruiz, was also in attendance, and agreed to look at the pull request shortly. As of September 24, the new helpers have been merged, and they should be available in the next release.

Enum conversions

The last item Baublitz discussed was improving how bindgen represents enums. The problem in this case has to do with a mismatch between how C and Rust treat invalid enum variants. In C, enums are essentially named constants, and it is not undefined behavior assign a value to an enum variable that has not been defined for the enum type. In Rust, creating an enum with an invalid bit pattern, such as a nonexistent variant, is instant undefined behavior. Because of that, bindgen currently translates C enums to compile-time constants.

It would be more convenient to translate them directly to Rust enums, since then the compiler could then perform exhaustiveness checking and so on. Baublitz's solution is to have two types: a raw type that is just an alias for the C enum's storage type (such as unsigned integer), and another type that is a normal Rust enum. Then bindgen can generate two sets of conversion functions: safe functions that check that the enum is valid and could return an error, and unsafe unconditional functions for when the programmer can guarantee that there won't be any invalid values.

Changing the way enums are translated would be a breaking change, so Baublitz has added a command-line flag — --rustified-enum — that lets users select whether they want the old behavior, safe conversions, or unsafe conversions. There were some challenges to making this code work, he added. He needed to change how bindgen does its command-line parsing, and adapt some of the internals to handle both translated and untranslated types.

The updated enum code is still in progress, however, because there are some questions that Baublitz wants feedback on. In particular, he would like to still generate constants for enum values, to make switching between the different enum translations as small a change as possible — but that could lead to problems with namespacing. Gary Guo suggested using associated constant items, but Baublitz explained that bindgen currently doesn't do that in other cases, so it wouldn't be consistent. Also, the constants would clash with the names of the actual variants.

Alice Ryhl had further questions about how the new enum translation interacts with control-flow-integrity (CFI) protections. While there are many CFI techniques, she specifically referred to type-based CFI, where the compiler inserts checks that a call through a function pointer is only made to a function of a compatible type. This cuts down on the amount of unintended control flow an attacker can cause by overwriting function pointers. She was worried specifically about the case where, using the new translation, the Rust compiler sees a FFI function as taking a c_int, while the C side sees it as taking an enum type. These types might have compatible storage layouts, but they have different type names, which would generate different CFI tags. Baublitz was unfamiliar with the details of CFI, and after a short back and forth agreed with Ryhl's suggestion to add a wrapper type with the correct name.

Benno Lossin wanted to take the opportunity to explain why the new enum translations would be helpful in the driver he is working on: currently, it has a lot of manual checks that could ideally be simplified by having the tooling do it. Poveda Ruiz clarified that he thinks Baublitz's style would be a sensible default, but that every time the bindgen project changes the defaults, things break and people complain. So while the new style may become an option, it will not be the default.

In all, it seems like users of bindgen should have more options for correct, ergonomic translation of C interfaces — but that they must be aware to take advantage of them. Readers who use bindgen in their own projects might wish to keep an eye out for Baublitz's changes.

Index entries for this article
Kernel	Development tools/bindgen
Conference	Kangrejos/2024

editions?

Posted Oct 9, 2024 17:59 UTC (Wed) by shironeko (subscriber, #159952) [Link] (1 responses)

it sounds like the backwards compatible defaults might overtime drift away from "sensible" defaults. is there something like editions that could help with this?

editions?

Posted Oct 9, 2024 21:20 UTC (Wed) by intelfx (subscriber, #130118) [Link]

> it sounds like the backwards compatible defaults might overtime drift away from "sensible" defaults. is there something like editions that could help with this?

This was my first thought as I was finishing reading the article.

Sounds like even what was described in the article is already enough changes to warrant bundling them together into some sort of `--new-and-better` flag. And, indeed, such a flag should be generalized to an edition selection flag instead.

enum conversion

Posted Oct 9, 2024 18:12 UTC (Wed) by JoeBuck (subscriber, #2330) [Link] (9 responses)

Often in C, enums are used as bitmasks and defined to have values like 1, 2, 4, 8 etc, and bitwise-ors of the values are considered valid values. This design doesn't map cleanly into a Rust enum. I suppose the case can be detected by noting when the numerical values associated with enum constants are powers of two: in this case the translation should produce integer variables and constants of appropriate width, rather than a Rust enum. Or perhaps there's a type-safe way to implement this kind of model.

enum conversion

Posted Oct 9, 2024 20:15 UTC (Wed) by WolfWings (subscriber, #56790) [Link] (7 responses)

There's plenty of cases where enums are used as grouped bitfields as well. 0x1/0x2/0x3/0x4/0x8/0x10/0x20/0x30 as the defined values for example, so unfortunately 'power of two' wouldn't help as much I'm afraid.

enum conversion

Posted Oct 10, 2024 0:27 UTC (Thu) by JoeBuck (subscriber, #2330) [Link] (6 responses)

An alternative would be to just translate to integers and not to a Rust enum if any explicit numeric values are provided, because this would imply that the values matter and must be preserved.

enum conversion

Posted Oct 10, 2024 3:51 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (5 responses)

Technically, we can do better than that, but only if the C side is willing to play ball.

If we know (or at least reasonably believe) that C will never give us a value outside of the range of possible flag values, then in principle, we could define a newtype around the flag, add functions that validate the flag is within the expected range, and even write a custom Debug impl to automatically expand the flag out into symbolic names (instead of displaying a number). Whether all of that trouble is worth it is a harder question.

Unfortunately, what we can't do is make it compatible with match expressions. It would be nice to be able to match a flag on "is the nth bit set?" with some sort of match pattern, but I don't believe that is practical with Rust's existing match syntax. As such, probably the sensible thing to do is to either work with the flag as-is with typical bit fiddling, or if it is sufficiently complicated and we don't need to share memory with C on an ongoing basis, construct a proper enum or struct to represent the full state of the interface (which is probably more than just a flag variable).

enum conversion

Posted Oct 10, 2024 10:34 UTC (Thu) by farnz (subscriber, #17727) [Link]

Note that there's a neat trick to make match statements work in limited cases (i.e. not "is nth bit set", but "is only the nth bit set":

#[repr(transparent)]
#[derive(Copy, Clone, Eq, PartialEq)]
struct Thing(u32);

impl Thing {
    const ONE: Self = Self(1);
    const USER: Self = Self(0x1000);
    const ALL_USER: Self = Self(0x1001);
    const KERNEL: Self =Self (0x2000);
    …
}

This is sometimes useful, if the C enum contains all the combinations you want to match on already - but it's not useful if you want a "don't care" bit.

enum conversion

Posted Oct 10, 2024 13:16 UTC (Thu) by gdt (subscriber, #6284) [Link] (3 responses)

If we know (or at least reasonably believe) that C will never give us a value outside of the range of possible flag values...

That assumption is shaky if the source of the value is an I/O register. New for formerly unknown bits can appear in the register. The question is if the language allows that situation to be handled with grace: it's often safe for the program to continue since the hardware vendor does not want its new hardware to cause shipped drivers to fail.

enum conversion

Posted Oct 10, 2024 16:50 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (2 responses)

This comes down to exactly what promises you want your newtype to make (and, in principle, what promises you make to rustc, although AFAIK there is no stable way to make such promises right now). If you promise the user (and/or rustc, for niche optimization) that bit pattern X can never exist in your type, then you can't construct instances with that bit pattern, so you have to either mask them out or panic (or fail in some more graceful way - Rust does not require the use of infallible constructor functions as in C++, so you can return an Option or Result if you so desire). But sometimes, purity yields to pragmatism. If you know that bit pattern X plausibly could appear in the future, and you don't want to require future developers to explicitly accommodate that change, one option would be to preserve all bits but only allow Rust code to construct values within the acceptable range (unless using some sort of from_raw() function, that is safe, but documented as "can return values outside the expected range"). Another option would be to make an enum with one variant for "all set bits are recognized" and another for "some unrecognized bits are set," and then make all of your methods variant-agnostic and let the user write whatever code they see fit. But that probably can't be #[repr(transparent)], so IDK if that's workable in every context.

The other question is how you implement Eq and Hash for such a type. Do you mask out the unused bits before doing the comparison, or do you leave them as-is? Both options feel like they might make sense in some circumstances, so probably the best way is to compare all bits, but also provide a method that explicitly masks out the unexpected bits (and returns a new instance, rather than modifying the existing one in-place, since I assume this object would be Copy anyway).

enum conversion

Posted Oct 10, 2024 17:03 UTC (Thu) by farnz (subscriber, #17727) [Link] (1 responses)

A nit - C++ doesn't require infallible constructor functions, either; they are expected to throw exceptions for errors. Many places forbid this (Google definitely does, for example), because you either need everything to provide some level of exception safety (weak is enough - you don't need strong here), or you get into a major mess trying to keep track of which things are initialized, and which things aren't.

enum conversion

Posted Oct 11, 2024 3:25 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

Yes, you can do that, but in practice what people actually do is much worse, usually one of the following:

* Force everything to use a Factory or Builder interface (you too can write Java EE in C++ if you try hard enough).
* Create the object in an "empty" state and then initialize/populate it as a separate step after it has been constructed (the object's empty state is exposed to client code, and not just in a moved-from temporary variable that nobody is going to look at or use, so now all client code has to defensively account for the possibility of an empty object).

There is also the saner option of making the constructor private and exposing a public static method that constructs instances in one step, with validation, and returns std::optional. But the latter was only introduced in C++17, so it's hardly idiomatic in preexisting codebases.

enum conversion

Posted Oct 9, 2024 23:42 UTC (Wed) by guillemj (subscriber, #49706) [Link]

clang has this nice attribute that makes it possible to describe how C enums are intended to be used <https://clang.llvm.org/docs/AttributeReference.html#enum-...>.

This could perhaps be used in Linux by annotating such enums via a macro when the attribute is supported (so when using clang), then I assume bindgen could probably use that to decide how to best map these enums into Rust?

glibc's approach

Posted Oct 9, 2024 18:41 UTC (Wed) by fw (subscriber, #26023) [Link] (3 responses)

This seems rather slow? In glibc, we use gcc -E -dM to get the macro names, and then generate assembler files for all constants of interest in one go: https://sourceware.org/git/?p=glibc.git;a=blob;f=scripts/...

I think it takes a fraction of a second per header file, pretty much independently of the number of macro values we need to extract.

glibc's approach

Posted Oct 9, 2024 19:25 UTC (Wed) by comex (subscriber, #71521) [Link] (2 responses)

I had the same thought.

To be fair, bindgen by default tries to generate bindings for all macros, rather than only a subset filtered by regular expression. If you try to group everything into one C file, there will almost certainly be at least one macro that expands to some series of tokens other than a constant expression, causing compilation errors.

But with libclang it should be possible to parse a C file that contains errors, and still extract information about the non-erroneous parts. Or you could run multiple passes (but still fewer than one for every macro).

There would still be the risk of macro expansions 'escaping' from their references and producing unexpected definitions. Suppose bindgen generated a single source file that looked like

enum { foo = (FOO) };
enum { bar = (BAR) };
enum { baz = (BAZ) };

and so on for every macro. What if one of the macros had a definition like this?

#define FOO ) }; enum { baz = (1000

…That's probably too pathological a case to care about, though.

There's also the option of just biting the bullet and adding a regex filter. Bindgen already supports such filters ('allowlists'); they're just not enabled by default. I don't know whether Linux uses them.

glibc's approach

Posted Oct 10, 2024 8:02 UTC (Thu) by taladar (subscriber, #68407) [Link]

It has been a few years since I looked at the source code but I do remember the PHP project doing weird things like that with C preprocessor macros in their source code.

glibc's approach

Posted Oct 10, 2024 8:20 UTC (Thu) by fw (subscriber, #26023) [Link]

With the one-file approach for all headers, I would be more worried about conflicting #defines in the headers. And there are probably #defines that are unused today and expand to something that can no longer be parsed or type-checked. Either way, you'd have to special-case some stuff—or send cleanup patches, given that this is not some open-world scenario that has to deal with arbitrary header files (like regular bindgen).

macro are not const

Posted Oct 9, 2024 20:47 UTC (Wed) by ballombe (subscriber, #9523) [Link] (7 responses)

In C, why do
#define NAME 3
instead of
const int NAME=3;
?
I can see several reasons, but all of them are incompatible with
pub const NAME: u32 = 3;

macro are not const

Posted Oct 9, 2024 21:36 UTC (Wed) by roc (subscriber, #30627) [Link] (4 responses)

In C `const int NAME = 3;` is technically not a compile time constant so you can't do e.g.
int foo[NAME + 1];
Also, `const int NAME = 3;` defines the `NAME` variable, so if you #include that from multiple translation units that are then linked together, you get an error at link time due to multiple definitions.

The C-compatible way to do this is using enums:
enum { NAME = 3 };

macro are not const

Posted Oct 9, 2024 21:51 UTC (Wed) by Sesse (subscriber, #53779) [Link] (3 responses)

C23 finally imported constexpr from C++11. I guess C23 isn't allowed in the kernel yet, though?

macro are not const

Posted Oct 10, 2024 4:55 UTC (Thu) by milesrout (subscriber, #126894) [Link] (2 responses)

It isn't and shouldn't ever be. C23 is C in name only: the C committee has been taken over by Rust and C++ enthusiasts that say explicitly that they hate the language and want it to die, and are treating the C standard development process as a method for transitioning people away from C towards those languages.

macro are not const

Posted Oct 10, 2024 8:08 UTC (Thu) by taladar (subscriber, #68407) [Link] (1 responses)

That seems unlikely. Why would anyone who hates C put so much effort into the language. Not to mention that most people who hate C probably also hate C++ so your grouping of C++ and Rust together makes for a very unlikely alliance.

Which new features in C23 cause you so much worry anyway, most of the changes seem quite small judging by the feature overview on Wikipedia.

macro are not const

Posted Oct 10, 2024 10:12 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

People from WG14 (the C committee) have complained that WG21 (C++) exerts undue influence trying to get their language to shift in ways that would be convenient for the other language while retaining its claim of "compatibility" without giving much back in exchange.

But I'm not aware of any complaints about Rust, WG21 cares what WG14 standardises because it's awkward to pretend that C++ is somehow a "superset" of a language which has instead evolved in different ways. They would prefer WG14 to take care of all the awkward low level problems they don't care about, treating it as a junior partner which is not appreciated. But Rust doesn't care about C at all and especially about the ISO document. To the extent Rust interfaces with C it's via the implementations for ABI reasons. The ISO standard may say A and B are different, but if in practice they're the same on every real platform, Rust needs to know that.

macro are not const

Posted Oct 9, 2024 23:19 UTC (Wed) by neggles (subscriber, #153254) [Link] (1 responses)

C has semantic differences between an inserted literal (which a macro definition becomes, since #define NAME 3 is effectively shorthand for "replace all instances of NAME with 3") and a constant variable, especially when the const int is declared in a header file. I'm not sufficiently familiar with the spec to know what the specific differences actually are, so I'll defer to others on that.

Rust lacks an exact equivalent to #DEFINE, but semantically treats a const u32 NAME almost the same as as C treats #define NAME; it's a constant value (albeit with an explicit type) and can thus be folded with other constants in an expression at compile time, etc.

macro are not const

Posted Oct 10, 2024 3:37 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

Technically, macro_rules! is mostly equivalent to #define (with numerous minor semantic differences that are not particularly interesting or relevant here), but nobody wants to go around writing NAME!() everywhere when you could just use a const and call it NAME like a normal person.

Reason for clang behavior?

Posted Oct 10, 2024 7:44 UTC (Thu) by taladar (subscriber, #68407) [Link]

> Clang actually only supports using one precompiled header per source file, and silently ignores any others passed.

Is there a good reason for that or did someone just decide that it wouldn't matter to ignore others because it would "only" impact performance? I would have expected a well written tool to give an error in this situation.