Improving bindgen for the kernel
Bindgen is a widely used tool that automatically generates Rust bindings from C headers. The Rust-for-Linux project uses it to create some of the bindings between Rust code and the rest of the kernel. John Baublitz presented at Kangrejos about the improvements that he has made to the tool in order to make the generated bindings easier to use, including improved support for macros, bitfields, and enums.
Baublitz noted that there has been a wishlist of features to add to bindgen for the Rust-for-Linux project for some time. After he ran into some of the same problems in his own projects, he decided to tackle them. There are three main problems that he wants to address: macro expansion, accessing bitfields via raw pointers, and supporting better conversions for Rust enums.
Macro expansion
There is no way that bindgen can usefully support the full richness of C macros. But there are a subset of macros that are useful to have represented in the generated Rust code: macros that are just used as a name for a constant value. Currently, bindgen specially recognizes simple macros and turns them into constants:
#define NAME 3 // becomes pub const NAME: u32 = 3;
However, it's relatively common for a macro to be defined in terms of other macros, which requires expanding the macro to determine its value. Since bindgen doesn't include a reimplementation of the C preprocessor, it can't handle these more complex macros. Baublitz gave the example of cryptsetup, which added the UINT32_C macro around some of its constants and broke the generated Rust bindings.
He has come up with a way to make it work, however. With his changes, bindgen
can now capture the name of the macro, create a temporary C file with a main
function that returns the value of the macro, and then use Clang to compile it. Baublitz
described this as "a bit hacky
", but working. For now, the new code
remains opt-in using the --clang-macro-fallback flag, for two reasons.
First of all, even small changes to generated bindings can cause problems, such
as by introducing duplicated names, so bindgen tries not to change the default
behavior. Secondly, the approach does have performance implications, since it
involves invoking Clang.
The performance impact isn't that bad, however. Baublitz measured the time taken to evaluate the macros in a consolidated header file containing all of the constants defined in the kernel's headers, which was 3-5 seconds. His initial prototype was significantly worse, taking nearly 35 minutes. The majority of that time was spent doing I/O; switching to Clang's in-memory API made that much faster, but still too slow for practical work. His final design takes advantage of Clang's support for precompiled headers, by compiling the headers once, and then generating multiple C files in memory to evaluate the different constants.
There is one complication to using precompiled headers. Clang actually only supports using one precompiled header per source file, and silently ignores any others passed. So, Baublitz generates a synthetic header that imports all of the others, and then pre-compiles that. Still, despite the problems, the new option was released in bindgen 0.70 on August 16, and is available to users. In the future, Baublitz would like to add a Clang API that retains macro information when parsing, and use that directly, instead of maintaining this workaround. Miguel Ojeda confirmed that the two of them had spoken to a Clang maintainer, who had approved of that approach. For now, however, this solution works, and makes many more constants available between the two languages.
Bitfield access
Since C does not have Rust's lifetime tracking, programmers often need to refer to structures shared between Rust and C using raw pointers instead of Rust's references. This poses a problem for bitfields. Rust doesn't have a native concept of bitfields, so when a C structure contains a bitfield, bindgen generates accessor functions to access the value correctly. The generated functions take a reference to the structure, since that is the idiomatic way to define methods for a type in Rust. This poses a problem for structures that need to be referred to with raw pointers.
Baublitz addressed this problem by adding an additional set of unsafe helper functions to access bitfields using raw pointers. At the time of his talk, the Rust-for-Linux developers had reviewed his code and agreed that it would be helpful, but it still needed a review from the bindgen maintainer.
Luckily that maintainer, Christian Poveda Ruiz, was also in attendance, and agreed to look at the pull request shortly. As of September 24, the new helpers have been merged, and they should be available in the next release.
Enum conversions
The last item Baublitz discussed was improving how bindgen represents enums. The problem in this case has to do with a mismatch between how C and Rust treat invalid enum variants. In C, enums are essentially named constants, and it is not undefined behavior assign a value to an enum variable that has not been defined for the enum type. In Rust, creating an enum with an invalid bit pattern, such as a nonexistent variant, is instant undefined behavior. Because of that, bindgen currently translates C enums to compile-time constants.
It would be more convenient to translate them directly to Rust enums, since then the compiler could then perform exhaustiveness checking and so on. Baublitz's solution is to have two types: a raw type that is just an alias for the C enum's storage type (such as unsigned integer), and another type that is a normal Rust enum. Then bindgen can generate two sets of conversion functions: safe functions that check that the enum is valid and could return an error, and unsafe unconditional functions for when the programmer can guarantee that there won't be any invalid values.
Changing the way enums are translated would be a breaking change, so Baublitz has added a command-line flag — --rustified-enum — that lets users select whether they want the old behavior, safe conversions, or unsafe conversions. There were some challenges to making this code work, he added. He needed to change how bindgen does its command-line parsing, and adapt some of the internals to handle both translated and untranslated types.
The updated enum code is still in progress, however, because there are some questions that Baublitz wants feedback on. In particular, he would like to still generate constants for enum values, to make switching between the different enum translations as small a change as possible — but that could lead to problems with namespacing. Gary Guo suggested using associated constant items, but Baublitz explained that bindgen currently doesn't do that in other cases, so it wouldn't be consistent. Also, the constants would clash with the names of the actual variants.
Alice Ryhl had further questions about how the new enum translation interacts with control-flow-integrity (CFI) protections. While there are many CFI techniques, she specifically referred to type-based CFI, where the compiler inserts checks that a call through a function pointer is only made to a function of a compatible type. This cuts down on the amount of unintended control flow an attacker can cause by overwriting function pointers. She was worried specifically about the case where, using the new translation, the Rust compiler sees a FFI function as taking a c_int, while the C side sees it as taking an enum type. These types might have compatible storage layouts, but they have different type names, which would generate different CFI tags. Baublitz was unfamiliar with the details of CFI, and after a short back and forth agreed with Ryhl's suggestion to add a wrapper type with the correct name.
Benno Lossin wanted to take the opportunity to explain why the new enum translations would be helpful in the driver he is working on: currently, it has a lot of manual checks that could ideally be simplified by having the tooling do it. Poveda Ruiz clarified that he thinks Baublitz's style would be a sensible default, but that every time the bindgen project changes the defaults, things break and people complain. So while the new style may become an option, it will not be the default.
In all, it seems like users of bindgen should have more options for correct, ergonomic translation of C interfaces — but that they must be aware to take advantage of them. Readers who use bindgen in their own projects might wish to keep an eye out for Baublitz's changes.
Index entries for this article | |
---|---|
Kernel | Development tools/bindgen |
Conference | Kangrejos/2024 |
Posted Oct 9, 2024 17:59 UTC (Wed)
by shironeko (subscriber, #159952)
[Link] (1 responses)
Posted Oct 9, 2024 21:20 UTC (Wed)
by intelfx (guest, #130118)
[Link]
This was my first thought as I was finishing reading the article.
Sounds like even what was described in the article is already enough changes to warrant bundling them together into some sort of `--new-and-better` flag. And, indeed, such a flag should be generalized to an edition selection flag instead.
Posted Oct 9, 2024 18:12 UTC (Wed)
by JoeBuck (subscriber, #2330)
[Link] (9 responses)
Posted Oct 9, 2024 20:15 UTC (Wed)
by WolfWings (subscriber, #56790)
[Link] (7 responses)
Posted Oct 10, 2024 0:27 UTC (Thu)
by JoeBuck (subscriber, #2330)
[Link] (6 responses)
Posted Oct 10, 2024 3:51 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (5 responses)
If we know (or at least reasonably believe) that C will never give us a value outside of the range of possible flag values, then in principle, we could define a newtype around the flag, add functions that validate the flag is within the expected range, and even write a custom Debug impl to automatically expand the flag out into symbolic names (instead of displaying a number). Whether all of that trouble is worth it is a harder question.
Unfortunately, what we can't do is make it compatible with match expressions. It would be nice to be able to match a flag on "is the nth bit set?" with some sort of match pattern, but I don't believe that is practical with Rust's existing match syntax. As such, probably the sensible thing to do is to either work with the flag as-is with typical bit fiddling, or if it is sufficiently complicated and we don't need to share memory with C on an ongoing basis, construct a proper enum or struct to represent the full state of the interface (which is probably more than just a flag variable).
Posted Oct 10, 2024 10:34 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
Note that there's a neat trick to make match statements work in limited cases (i.e. not "is nth bit set", but "is only the nth bit set":
This is sometimes useful, if the C enum contains all the combinations you want to match on already - but it's not useful if you want a "don't care" bit.
Posted Oct 10, 2024 13:16 UTC (Thu)
by gdt (subscriber, #6284)
[Link] (3 responses)
If we know (or at least reasonably believe) that C will never give us a value outside of the range of possible flag values... That assumption is shaky if the source of the value is an I/O register. New for formerly unknown bits can appear in the register. The question is if the language allows that situation to be handled with grace: it's often safe for the program to continue since the hardware vendor does not want its new hardware to cause shipped drivers to fail.
Posted Oct 10, 2024 16:50 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
The other question is how you implement Eq and Hash for such a type. Do you mask out the unused bits before doing the comparison, or do you leave them as-is? Both options feel like they might make sense in some circumstances, so probably the best way is to compare all bits, but also provide a method that explicitly masks out the unexpected bits (and returns a new instance, rather than modifying the existing one in-place, since I assume this object would be Copy anyway).
Posted Oct 10, 2024 17:03 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (1 responses)
A nit - C++ doesn't require infallible constructor functions, either; they are expected to throw exceptions for errors. Many places forbid this (Google definitely does, for example), because you either need everything to provide some level of exception safety (weak is enough - you don't need strong here), or you get into a major mess trying to keep track of which things are initialized, and which things aren't.
Posted Oct 11, 2024 3:25 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link]
* Force everything to use a Factory or Builder interface (you too can write Java EE in C++ if you try hard enough).
There is also the saner option of making the constructor private and exposing a public static method that constructs instances in one step, with validation, and returns std::optional. But the latter was only introduced in C++17, so it's hardly idiomatic in preexisting codebases.
Posted Oct 9, 2024 23:42 UTC (Wed)
by guillemj (subscriber, #49706)
[Link]
This could perhaps be used in Linux by annotating such enums via a macro when the attribute is supported (so when using clang), then I assume bindgen could probably use that to decide how to best map these enums into Rust?
Posted Oct 9, 2024 18:41 UTC (Wed)
by fw (subscriber, #26023)
[Link] (3 responses)
I think it takes a fraction of a second per header file, pretty much independently of the number of macro values we need to extract.
Posted Oct 9, 2024 19:25 UTC (Wed)
by comex (subscriber, #71521)
[Link] (2 responses)
To be fair, bindgen by default tries to generate bindings for all macros, rather than only a subset filtered by regular expression. If you try to group everything into one C file, there will almost certainly be at least one macro that expands to some series of tokens other than a constant expression, causing compilation errors.
But with libclang it should be possible to parse a C file that contains errors, and still extract information about the non-erroneous parts. Or you could run multiple passes (but still fewer than one for every macro).
There would still be the risk of macro expansions 'escaping' from their references and producing unexpected definitions. Suppose bindgen generated a single source file that looked like
enum { foo = (FOO) };
and so on for every macro. What if one of the macros had a definition like this?
#define FOO ) }; enum { baz = (1000
…That's probably too pathological a case to care about, though.
There's also the option of just biting the bullet and adding a regex filter. Bindgen already supports such filters ('allowlists'); they're just not enabled by default. I don't know whether Linux uses them.
Posted Oct 10, 2024 8:02 UTC (Thu)
by taladar (subscriber, #68407)
[Link]
Posted Oct 10, 2024 8:20 UTC (Thu)
by fw (subscriber, #26023)
[Link]
Posted Oct 9, 2024 20:47 UTC (Wed)
by ballombe (subscriber, #9523)
[Link] (7 responses)
Posted Oct 9, 2024 21:36 UTC (Wed)
by roc (subscriber, #30627)
[Link] (4 responses)
The C-compatible way to do this is using enums:
Posted Oct 9, 2024 21:51 UTC (Wed)
by Sesse (subscriber, #53779)
[Link] (3 responses)
Posted Oct 10, 2024 4:55 UTC (Thu)
by milesrout (subscriber, #126894)
[Link] (2 responses)
Posted Oct 10, 2024 8:08 UTC (Thu)
by taladar (subscriber, #68407)
[Link] (1 responses)
Which new features in C23 cause you so much worry anyway, most of the changes seem quite small judging by the feature overview on Wikipedia.
Posted Oct 10, 2024 10:12 UTC (Thu)
by tialaramex (subscriber, #21167)
[Link]
But I'm not aware of any complaints about Rust, WG21 cares what WG14 standardises because it's awkward to pretend that C++ is somehow a "superset" of a language which has instead evolved in different ways. They would prefer WG14 to take care of all the awkward low level problems they don't care about, treating it as a junior partner which is not appreciated. But Rust doesn't care about C at all and especially about the ISO document. To the extent Rust interfaces with C it's via the implementations for ABI reasons. The ISO standard may say A and B are different, but if in practice they're the same on every real platform, Rust needs to know that.
Posted Oct 9, 2024 23:19 UTC (Wed)
by neggles (subscriber, #153254)
[Link] (1 responses)
C has semantic differences between an inserted literal (which a macro definition becomes, since
Rust lacks an exact equivalent to
Posted Oct 10, 2024 3:37 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link]
Posted Oct 10, 2024 7:44 UTC (Thu)
by taladar (subscriber, #68407)
[Link]
Is there a good reason for that or did someone just decide that it wouldn't matter to ignore others because it would "only" impact performance? I would have expected a well written tool to give an error in this situation.
editions?
editions?
enum conversion
enum conversion
enum conversion
enum conversion
enum conversion
#[repr(transparent)]
#[derive(Copy, Clone, Eq, PartialEq)]
struct Thing(u32);
impl Thing {
const ONE: Self = Self(1);
const USER: Self = Self(0x1000);
const ALL_USER: Self = Self(0x1001);
const KERNEL: Self =Self (0x2000);
…
}
enum conversion
enum conversion
enum conversion
enum conversion
* Create the object in an "empty" state and then initialize/populate it as a separate step after it has been constructed (the object's empty state is exposed to client code, and not just in a moved-from temporary variable that nobody is going to look at or use, so now all client code has to defensively account for the possibility of an empty object).
enum conversion
glibc's approach
glibc's approach
enum { bar = (BAR) };
enum { baz = (BAZ) };
glibc's approach
glibc's approach
macro are not const
#define NAME 3
instead of
const int NAME=3;
?
I can see several reasons, but all of them are incompatible with
pub const NAME: u32 = 3;
macro are not const
int foo[NAME + 1];
Also, `const int NAME = 3;` defines the `NAME` variable, so if you #include that from multiple translation units that are then linked together, you get an error at link time due to multiple definitions.
enum { NAME = 3 };
macro are not const
macro are not const
macro are not const
macro are not const
macro are not const
#define NAME 3
is effectively shorthand for "replace all instances of NAME
with 3
") and a constant variable, especially when the const int
is declared in a header file. I'm not sufficiently familiar with the spec to know what the specific differences actually are, so I'll defer to others on that.
#DEFINE
, but semantically treats a const u32 NAME
almost the same as as C treats #define NAME
; it's a constant value (albeit with an explicit type) and can thus be folded with other constants in an expression at compile time, etc.
macro are not const
Reason for clang behavior?