Efficient Rust tracepoints

By Daroc Alden
October 8, 2024

Kangrejos 2024

Alice Ryhl has been working to enable tracepoints — which are widely used throughout the kernel — to be seamlessly placed in Rust code as well. She spoke about her approach at Kangrejos. Her patch set enables efficient use of static tracepoints, but supporting dynamic tracepoints will take some additional effort.

Ryhl described tracepoints as a kind of logging that records information from specific places in the kernel when they are reached. She gave binder_ioctl() as an example of a trace event in her slides; that tracepoint is triggered every time an ioctl() for Android's binderfs filesystem occurs. A developer trying to debug kernel problems can look at the log of tracepoints hit by a driver to figure out what's happening.

In C, the programmer places a tracepoint with a line that looks like a normal function call. Most of the time, this call does nothing. When in use, a programmer can attach an arbitrary function to it at run time that will be called when the tracepoint is hit. Since most tracepoints are disabled most of the time, Linux uses static keys (patching the call into the code at run time) to make this efficient.

Production-ready Rust drivers must be able to support the same standard of debugging, and therefore be able to place tracepoints, Ryhl said. That could be done today, by wrapping existing C tracepoints in Rust wrappers, but this loses one of the most important benefits of tracepoints: their low overhead. Ideally, hitting a disabled tracepoint from Rust should have the same performance cost as C (i.e., almost none).

Her solution is a small Rust macro that creates the necessary static-key machinery on the Rust side. Rust code uses declare_trace!() to refer to a tracepoint defined in C; the macro creates an inline unsafe function on the Rust side that can be used to trigger the tracepoint. The generated function uses inline assembly to define a place for the static-key machinery to patch in a call to the C tracepoint when necessary.

Ryhl took this approach because it represents implementing the bare minimum in Rust, leaving most of the tracepoint implementation in unchanged C, she said. The static-key functionality has to be implemented on the Rust side for performance, but this way she does not have to reimplement any of the functionality for defining tracepoints, and can instead just link to the C code.

There is a catch, though. Static keys in C also use inline assembly to create a target for the patched-in jump. In her first attempt, Ryhl copied the inline assembly to use on the Rust side. This was rejected for introducing code duplication, which is usually frowned upon in the kernel.

To solve that, Ryhl took the "horrible" approach of having a Rust source file generated using the C preprocessor that gets included in the macro. The original C sources have a comment to show where the shared inline assembly is located, and the build system uses sed to extract it and put it in the generated Rust file. This avoids any code duplication, at the cost of complicating the build.

The attendees were a bit surprised at the presented solution. Paul McKenney gave some background information on the reason that kernel developers care so much about avoiding code duplication: in addition to the normal reasons of code quality, it makes rebasing changes much easier. The kernel deals with a lot of patches flying around, and any code that exists in two places can easily get out of sync. Ryhl agreed, saying that there are good reasons not to duplicate code. It made her life difficult, she joked, but she sees why the static-key maintainer insisted.

Gary Guo said that it is probably not a good idea to use the C preprocessor to generate Rust code. Ryhl replied that it might be possible to generate both the C and Rust from a common format, if that would be preferable. An alternative would be to teach Rust to read C header files itself, but that is much more work. Some other alternate ideas were floated around. McKenney was of the opinion that any approach was acceptable — as long as it actually gets documented, because otherwise all this unusual code-sharing is going to confuse future programmers.

Dynamic tracing

Richard Weinberger asked about dynamic tracepoints (Kprobes) — which allow the user to attach a tracepoint anywhere in the code using BPF. Does this work with Rust? Ryhl was unfamiliar with the mechanism. Andreas Hindborg suggested that addressing static tracepoints first, and then looking into dynamic tracepoints later would make sense. Weinberger did think that support for dynamic tracepoints would be needed eventually, because people want their debug tooling to work throughout the whole kernel.

Ryhl thought that support for dynamic tracing would need to be added to the Rust compiler, based on Hindborg's description of the kernel's function tracing code. Static tracepoints would still be needed, however, since they are also used as a way for vendors to hook into the functions of a driver in some cases. (Some Android hardware vendors rely on tracepoints to react to events in the kernel, for example.) Boqun Feng agreed, saying that both kinds of tracepoint were needed for different use cases. Hindborg pointed out that function tracing also interacts strangely with function inlining — finding the location of the hook after inlining depends on having BTF information available. So Rust will need native BTF support before that is possible.

Hindborg was worried that having a solution which requires defining the tracepoint in C as well will make it harder to have a pure-Rust solution in the future. Ryhl responded that, although she has so far only tackled the declaration of tracepoints in Rust, someone could in the future add the definition of tracepoints as well.

Despite the discussion of future work, the attendees had no problems with Ryhl's current design. It seems likely that static tracepoints will soon be usable with Rust code in the kernel, which will enable vendor integration with drivers written in Rust. Dynamic tracepoints and other debugging features will take some more time.

Index entries for this article
Kernel	Development tools/Kernel tracing
Kernel	Development tools/Rust
Kernel	Releases/6.13
Conference	Kangrejos/2024

Uhh what?

Posted Oct 8, 2024 14:29 UTC (Tue) by atnot (subscriber, #124910) [Link] (10 responses)

> This was rejected for introducing code duplication

I was curious and clicked on through to see the more detailed reasoning, since in my recollection, the asm involved is just nop instruction with a linker annotation. I found none of that, only:

> I really think that whoever created rust was an esoteric language freak. Hideous crap

Which I found a bit confusing because I have recently been assured that this sort of thing was merely a single isolated incident. It does not seem conducive to a productive discussion either way.

Not up to me to decide what code a maintainer accepts of course. But if anyone has a more substantive reason why a nop instruction is an undue burden on the whole kernel, more so than the described horrible sed and preprocessor hacks, I'd love to know.

Uhh what?

Posted Oct 8, 2024 15:39 UTC (Tue) by jgg (subscriber, #55211) [Link] (4 responses)

Each arch has it's own asm, so it is not just one asm, but 21 copies. Then it is the overall slippery slope principle, if rust can duplicate C code because it has technical issues with consuming it directly, then where does it end?

Presumably the horrible sed will work on all arches and scale as we add more arches. But somehow I think this is just the tip of the iceberg on these issues and the sed script will have to evolve into something much more powerful. We have many little tricky inline assembly things and wrappering them in function calls is not the right thing to do, they are tricky inline assembly for a good reason.

Uhh what?

Posted Oct 8, 2024 17:55 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (3 responses)

I have to wonder if Rust (or maybe even C) could somehow grow a compiler intrinsic to emit "exactly N bytes of NOPs that won't be optimized out, or else a compiler error if no single-byte NOP is available and N can't be made from the available multi-byte NOPs." Is that good enough for tracepoint support, or do you need more specific guarantees than that?

The other question, I suppose, is whether LLVM IR and/or GCC's IR have support for such a thing, or if it would need to be invented first.

Uhh what?

Posted Oct 8, 2024 18:30 UTC (Tue) by daroc (editor, #160859) [Link]

What exactly is required for a tracepoint depends on architecture — and in particular, the details of how the instruction decoder synchronizes with memory — but for x86_64, just a sequence of nops is not enough, it needs to be one single nop of the right size (and aligned, if I recall correctly). Otherwise, you can end up executing part of the jump target as though it was an instruction, if the instruction pointer is inside the sequence of nops when the replacement happens.

Luckily, x86_64 has nops of every size up to ... 12, I think it was? So in practice, you just need to make sure you choose the right size nop.

Compiler-generated NOPs

Posted Oct 8, 2024 20:50 UTC (Tue) by riking (subscriber, #95706) [Link] (1 responses)

This does actually exist already: https://doc.rust-lang.org/beta/unstable-book/compiler-fla... places NOPs at or before the start of a function. The problem is putting them in the middle of a function.

Compiler-generated NOPs

Posted Oct 31, 2024 13:43 UTC (Thu) by sammythesnake (guest, #17693) [Link]

How about an inlined function that starts with the required NOPS and then does nothing? Is that something available in Rust and would it do the job...?

Uhh what?

Posted Oct 9, 2024 12:26 UTC (Wed) by intelfx (subscriber, #130118) [Link] (3 responses)

>> I really think that whoever created rust was an esoteric language freak. Hideous crap

Sad. This is disappointing (yet not really unexpected).

Uhh what?

Posted Oct 10, 2024 5:00 UTC (Thu) by milesrout (subscriber, #126894) [Link] (2 responses)

Speak for yourself. It doesnt disappoint me. There is no rule that everyone has to like everything. People have said the same sort of thing about every programming language. If you are going to be upset that some people dont like your pet language you are going to be perpetually upset.

Uhh what?

Posted Oct 10, 2024 8:36 UTC (Thu) by fishface60 (subscriber, #88700) [Link] (1 responses)

> There is no rule that everyone has to like everything.

This isn't disappointment that someone doesn't like something you do, it's disappointment that they have forgotten all their manners.

Uhh what?

Posted Oct 31, 2024 13:45 UTC (Thu) by sammythesnake (guest, #17693) [Link]

Paddington would suggest that a Long Hard Stare might provide time to remember them :-P

Uhh what?

Posted Oct 10, 2024 5:34 UTC (Thu) by mb (subscriber, #50428) [Link]

>> Urgh, more unreadable gibberish :-(

>> I really think that whoever created rust was an esoteric language freak. Hideous crap

>> the creator of Rust must've been an esoteric language freak and must've wanted to make this unreadable on purpose

Well, thanks for giving me yet another confirmation, that it was correct for me to leave the kernel development community behind.
I'm not interested in this kind of nontechnical nonsense replies anymore.

I would actually like to work on R4L, but I don't like being insulted anymore. Too old for wasting my time on things like that. Thanks.

Cost vs benefit?

Posted Oct 8, 2024 15:47 UTC (Tue) by kleptog (subscriber, #1183) [Link] (7 responses)

I understand the desire to avoid code duplication, but not at the cost of making everything else more complicated.

I'd have just accepted a few lines of duplicated ASM, and added a test case that fails if the ASM goes out of sync and called it a day.

Then again, this is C and perhaps a test case is even more complicated than this solution.

Cost vs benefit?

Posted Oct 8, 2024 18:08 UTC (Tue) by raven667 (subscriber, #5198) [Link] (6 responses)

I'd say it's still duplicating the few lines of ASM needed, but just wrapped in automation rather than one-time manually, so that it can automagically update itself if any one of the arch sources changes. This makes things more brittle in one way, changing the format of the source can break the build process extracting it, but it's a compromise. Maybe in the future you could turn the extraction process into an audit, where you manually copy into both places but use the same automation to test and fail the build if they don't match, but either way there is probably some maintenance cost down the road. Maybe as dynamic tracepoints are added or the structure is changed this will become unnecessary as someone will refactor the relevant feature so that it can be cleanly consumed by both C and Rust at the same time, but it all doesn't need to happen at once, getting it to a working reliable stopping point which can be shipped, then revisiting later with the benefit of experience and hindsight is better than waiting for a perfect future-proof design now.

Cost vs benefit?

Posted Oct 8, 2024 18:48 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (5 responses)

Perhaps it would be simplest to just have both Rust and C grow the ability to #include asm files (without any C or Rust syntax, just a raw foo.s file). I suspect that C's preprocessor can in fact do that as-is, but Rust would probably have to use some kind of proc macro hack, which is not ideal but also not completely terrible.

Cost vs benefit?

Posted Oct 8, 2024 20:10 UTC (Tue) by iabervon (subscriber, #722) [Link] (3 responses)

One potential issue with using literally just foo.s is that architectures with caller-saved registers will presumably want the targets to specify that these registers are clobbered by the inline assembly, so that the compiler doesn't try keeping anything in them across the tracepoint. That information is normally in the qualifiers after the code in an inline asm statement, and it's also going to be architecture-dependent, and could get updated if there are future changes to the implementation.

Cost vs benefit?

Posted Oct 8, 2024 20:18 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

Ugh, so what you're saying is that they'd need to invent a full-blown inline assembly file format that specifies all of this information in a way that can be ingested by Rust proc macros (relatively easy, those are Turing complete) and also the C compiler/preprocessor (hahaha, no, unless the format is C or extremely C-like... which is the status quo anyway).

Cost vs benefit?

Posted Oct 9, 2024 11:29 UTC (Wed) by ianmcc (subscriber, #88379) [Link] (1 responses)

I'd have thought that it would be better, for tracepoint functions, to require that they don't clobber registers, even if that isn't the usual convention on some particular arch? If the tracepoints are almost always not active, it doesn't make much sense to have the inactive tracepoint implemented as (save registers, nop, restore registers). Well, I guess in the inactive case you could replace the save/restore with nop's as well, but that still makes a much longer sequence than necessary.

Cost vs benefit?

Posted Oct 9, 2024 13:20 UTC (Wed) by daroc (editor, #160859) [Link]

Tracepoints aren't implemented that way. Just as a single nop the same size as a jump. To activate the tracepoint, the nop is replaced by a jump to a bit of code that does the saving and restoring of registers.

Cost vs benefit?

Posted Oct 8, 2024 20:53 UTC (Tue) by riking (subscriber, #95706) [Link]

Yes, Rust can do that:

> asm!(include_str!("nops.s"), options(preserves_flags))

You would then need to specify what registers input is taken in, which would have to be arch-specific.