|
|
Log in / Subscribe / Register

A strange BPF error message

By Daroc Alden
April 4, 2025

LSFMM+BPF

Yonghong Song brought a story about tracking down the cause of a strange verifier error message to the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit. He then presented some possible ways to improve Clang's user experience for anyone running into the same class of error in the future. Toward the end of his allotted time, he also discussed the problems with optimizations that change the signature of functions — a problem that José Marchesi had also brought up in the previous session.

An unhelpful error

Song started by presenting an example taken from a real program. The example was a bit dense, but the problem basically comes down to this code:

    bool icmp6_ndisc_validate() {
        __u8 nexthdr;
        // ...
        int offset = ipv6_hdrlen_offset(&nexthdr);
        // ...
    }

    static __always_inline int ipv6_hdrlen_offset(__u8 *nexthdr) {
        __u8 nh = *nexthdr;
        // ...
        switch (nh) {
        case NEXTHDR_NONE:
            return DROP_INVALID_EXTHDR;
        // ...
        }
    }

The code features an uninitialized variable (nexthdr) that was passed by reference into another function. This is not invalid in C, because the other function might initialize the variable by writing to it, so Clang doesn't issue a warning. In this case, though, ipv6_hdrlen_offset() does not initialize it, and instead reads from it in order to decide which branch of a switch statement to take. Clang doesn't warn in that function either, because it assumes the function argument points to initialized memory.

At that point, the code is passed to the optimizer, and everything goes wrong. The optimizer inlines one function into the other function, notices that the program is reading from an uninitialized variable (which would be undefined behavior, which it assumes cannot happen), and decides that this code must be unreachable. It turns the entire tail of the function into a single unreachable instruction in the LLVM IR, and then hands that off to the BPF code-generation backend. That backend ignores the unreachable instruction, but since the function's original return has been subsumed into it, the code generator ends the function without emitting a return instruction. That leads in turn to this somewhat confusing error from the BPF verifier:

    last insn is not an exit or jmp

While this error makes sense with an understanding of the sequence of events that led to it, at first Song found it a good deal more puzzling. It's not intuitive that an uninitialized variable would cause this error message, he said. He actually ran into this same problem helping someone with another program — so this isn't an isolated incident. People are seeing this message and being justifiably confused.

Marchesi and David Faust said that GCC does pretty much the same thing, and therefore has pretty much the same problem. One audience member asked why LLVM was generating an unreachable instruction instead of inferring that the value of the variable was undef (LLVM's representation of a value which could be anything). Song answered that LLVM's undef has "a lot of interesting semantics" that made it not always the right fit.

[Yonghong Song]

There have been a few attempts to avoid this kind of error, Song said. One option is to use -ftrivial-auto-var-init=zero to make the compiler initialize all variables with zero, where possible. This sort of works, in the sense that the generated program is no longer rejected, but it may hide a real bug. It's also a performance problem for some express data path (XDP) programs that may need to initialize lots of IP headers.

Another approach that he tried was to have the BPF backend recognize the unreachable instruction and emit an error at that point. This is better, but it's not an airtight defense, because there's no guarantee that the optimizer won't do something else in the future that results in different code being generated. For example, it could have just chosen to assume that the value of the variable matched whichever switch statement it found most convenient.

If the presence of unreachable could be relied on, the BPF backend could emit a useful error message when it sees it. So the approach Song is currently pursuing is to try and make it so that the optimizer will not use transformations that can eliminate unreachable instructions when compiling for BPF. He also has a pull request for LLVM open that tries to generate unreachable in more cases, although it looks unlikely to be accepted in its current form.

One attendee suggested using LLVM's poison value, which is subtly different from undef (as a presentation from the 2020 LLVM developer's meeting explains). Song agreed that it was possible in theory, but it wasn't likely to be accepted by the LLVM maintainers for various reasons.

Marchesi wondered whether this same kind of behavior could manifest in other verifier errors, or whether it was always the same message. Song answered that he had only observed this specific error in testing, but that in general there was no reason to assume that other verifier errors were impossible. Eduard Zingerman said that he had actually seen some sched_ext code that did not result in the "last insn is not an exit or jmp" message, but had caused a verification failure in a different place in the program. Marchesi suggested that this specific case could be caught by examining the program's control-flow graph at compile time. Song said that was not possible, because LLVM's BPF backend doesn't have access to the control-flow graph. Marchesi asserted that this was a problem with LLVM's design, and that the backend needs access to the program's control-flow graph for several reasons.

As a partial solution that would at least deliver better error messages, Song proposed having the BPF backend generate a call to the non-existent bpf_unreachable() kernel function when it sees a unreachable instruction. This would still result in a verifier failure on existing kernels, but hopefully one that is more specific and therefore easily searched for. Future kernels could recognize calls to bpf_unreachable() and supply a nicer failure message. Specifically, he proposed:

    last isns marked as unreachable, maybe due to uninitialized variable?

Some other alternatives he considered included adding an unreachable instruction to the BPF virtual machine, adding a bpf_unreachable() kernel function, or actually making the Clang frontend detect all uninitialized variable usage across functions. The first two are not really necessary, he said. Someone working at Google actually had a patch that implemented the latter option, but it never got merged. At the time, the project didn't consider it a priority because people normally use a sanitizer to detect problems with uninitialized variables. Unfortunately, that's not really an option for BPF programs.

Faust commented that this sounded like another use case where it would be helpful to have the rules of the verifier extracted out of the kernel so they could be run elsewhere. If that were done, the compiler could check the binary itself, and then use its context on the program to produce a more helpful error message.

Signature changes

With the time remaining in the session, Song turned to another topic: how optimization can change the signatures of functions, and how to represent that in BPF's debugging information format, BTF. According to an analysis of the DWARF debug information of a recent kernel, there are 64,129 functions in the kernel. Of those, 635 have arguments changed, 306 have the return value removed, and 18 have both.

The DWARF debug format does actually have a way to represent that information, in the form of the DW_AT_calling_convention tag, but it's not specific enough — it only tells the user that something changed, not what changed. Song then briefly described two proposed new ways of representing the original signature of an optimized function in BTF. Unfortunately, the group didn't have much time to dig into the the topic before it was time for the next session.


Index entries for this article
KernelBPF/Verifier
ConferenceStorage, Filesystem, Memory-Management and BPF Summit/2025


to post comments

Isn't this a language problem?

Posted Apr 4, 2025 18:21 UTC (Fri) by geofft (subscriber, #59789) [Link] (3 responses)

Isn't this fundamentally a (classic) weakness in the C language, that there's no way to distinguish a pointer to an uninitialized value that you need to write to first and a pointer to an initialized value that is okay to read from?

BPF is its own environment, and there's no particular requirement to be fully backwards compatible with the large body of existing C code. (I would bet that the overwhelming majority of existing C code won't pass the verifier!) Why is using standard C the answer here?

Of course one approach is to use another language that can distinguish these at the type level (like Rust's MaybeUninit<T>, and using Rust would unsurprisingly be my preferred solution), but wouldn't it also be feasible to write BPF programs in some variant of C that addresses these sorts of issues? If you have no need to compile existing, unmodified C code, couldn't you do something along the lines of Cyclone and add some new pointer types that capture the information needed to avoid these sorts of issues?

Isn't this a language problem?

Posted Apr 4, 2025 23:23 UTC (Fri) by cesarb (subscriber, #6266) [Link] (1 responses)

> Isn't this fundamentally a (classic) weakness in the C language, that there's no way to distinguish a pointer to an uninitialized value that you need to write to first and a pointer to an initialized value that is okay to read from?

Some Microsoft library headers use annotations like __in or __out to specify whether a pointer is supposed to be an input or an output for a function call, probably borrowed from IDL. I don't know whether gcc or clang have an equivalent __attribute__ which could be used to implement these annotations.

In the example on this article, if the nexthdr argument of ipv6_hdrlen_offset were annotated as __in, the compiler could know that passing it a pointer to an uninitialized value is a mistake, allowing it to output a warning on the call in icmp6_ndisc_validate; if, on the other hand, the nexthdr argument of ipv6_hdrlen_offset were annotated as __out, the compiler could generate a warning when that function tries to read from that argument. Either way, it would be easier to see that something is wrong with the code.

Isn't this a language problem?

Posted Apr 5, 2025 16:51 UTC (Sat) by ABCD (subscriber, #53650) [Link]

GCC (but apparently not Clang) does have an attribute that might be useful here; unfortunately, it is applied to the function as a whole, not a particular parameter (although it does reference a specific parameter, and can be applied multiple times for multiple parameters). For the function in question, it would probably look something like:

static __always_inline int ipv6_hdrlen_offset(__u8 *nexthdr) __attribute__((access(read_only, 1))) {
    /* ... */
}

For further information, see the GCC documentation for the access attribute.

Isn't this a language problem?

Posted Apr 5, 2025 9:42 UTC (Sat) by dottedmag (subscriber, #18590) [Link]

Or nag the people working on C standard to strike another item off the "undefined behaviour" list, so that demons stop flying off one's nose in this particular case. Fortunately, this is is a work being done already.

Trivial for static code analysis

Posted Apr 5, 2025 2:32 UTC (Sat) by jreiser (subscriber, #11027) [Link] (2 responses)

Almost any tool for static [offline] code analysis will detect the error in the exhibited code that deals with nexthdr. In many cases, running a BPF code requires privileges. The administrator of the privileges can establish a requirement for a satisfactory report from a static analysis tool, perhaps authenticated using cryptography or sourced from within a restricted area of a filesystem. It is too much to expect BPF to do everything dynamically [online].

Trivial for static code analysis

Posted Apr 5, 2025 3:06 UTC (Sat) by intelfx (subscriber, #130118) [Link] (1 responses)

> Almost any tool for static [offline] code analysis will detect the error in the exhibited code that deals with nexthdr. In many cases, running a BPF code requires privileges. The administrator of the privileges can establish a requirement for a satisfactory report from a static analysis tool, perhaps authenticated using cryptography or sourced from within a restricted area of a filesystem. It is too much to expect BPF to do everything dynamically [online].

If this mechanism existed, there would have been no need for BPF in the first place. Why bother with BPF at all, with its verifier and all its idiosyncrasies, if it were possible, feasible, and acceptable to "just" require an unforgeable report about the safety of a given program?

Trivial for static code analysis

Posted Apr 5, 2025 11:15 UTC (Sat) by ballombe (subscriber, #9523) [Link]

They serve a different purpose. The BPF verifier checks that the program is harmless, the static checker checks that it has no obvious bugs.

Is this an IPv6 problem?

Posted Apr 5, 2025 3:22 UTC (Sat) by buck (subscriber, #55985) [Link] (1 responses)

Perhaps a wee bit off-topic, but successful compilation resulting in dropping IPv6 packets would arguably be expected behavior here

https://www.rfc-editor.org/rfc/rfc9098.html#name-introduc...

On a slightly more serious note, if this code is just peeking at the next header, then, OK, but, in general, can one chase nexthdrs and still have the BPF verifier accept the code? I thought it was somehow responsible for verifying the code is guaranteed to terminate (without solving the halting problem)

Is this an IPv6 problem?

Posted Apr 7, 2025 12:06 UTC (Mon) by daroc (editor, #160859) [Link]

In this case I think the code was just peeking at the next header, yes. But modern BPF does allow pointer chasing — the trick is that any loop must be verifiably bounded. Depending on the exact way you write the loop this is done in different ways. Options include: having a bounded maximum number of iterations, or using the may_goto instruction to allow the BPF runtime to terminate a loop early if it wants to.

Difference in Culture

Posted Apr 8, 2025 12:46 UTC (Tue) by jeremyhetzler (subscriber, #127663) [Link] (3 responses)

What's interesting to me is the culture difference.

In C, reading from an uninitialized variable is a cardinal sin and is punishable by undefined behavior. If a program does that, compilers choose to exercise their standard-given right to provide the least-helpful behavior possible: ignoring statements or entire functions, exhibiting unrelated errors far away from the actual problem, etc. Certainly not emitting a diagnostic along the lines of "hey, just fyi, you're trying to read an uninitialized variable".

BPF is a restricted variant of C. Yet in BPF, compiler writers seem to feel like letting your program crash and burn in the face of undefined behavior is not good enough. Here if your program reads from an uninitialized variable, you're not on your own. The compiler is going to try and help you.

I wonder why.

Difference in Culture

Posted Apr 8, 2025 13:33 UTC (Tue) by pizza (subscriber, #46) [Link] (2 responses)

> In C, reading from an uninitialized variable is a cardinal sin and is punishable by undefined behavior. If a program does that, compilers choose to exercise their standard-given right to provide the least-helpful behavior possible: ignoring statements or entire functions, exhibiting unrelated errors far away from the actual problem, etc. Certainly not emitting a diagnostic along the lines of "hey, just fyi, you're trying to read an uninitialized variable".

Sorry to spoil your rant, but that should read "compilers MAY choose"

I can't speak to other compilers, but GCC will happily emit diagnostics about uninitialized variable access if the user so desires. -Wuninitialized has been available since 1992 (in GCC 2.1), and has been part of the (still-growing) set enabled by -Wall since 2005 (in GCC 4.0).

Difference in Culture

Posted Apr 9, 2025 8:48 UTC (Wed) by chris_se (subscriber, #99706) [Link] (1 responses)

> I can't speak to other compilers, but GCC will happily emit diagnostics about uninitialized variable access if the user so desires. -Wuninitialized has been available since 1992 (in GCC 2.1), and has been part of the (still-growing) set enabled by -Wall since 2005 (in GCC 4.0).

In this specific example, -Wuninitialized for GCC actually warns about this, but only if the optimizer is turned on to at least -O1. However, clang (at least versions 14 through current trunk) doesn't warn about this at all, even with -O3, see the following example in godbolt:

> https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCc...

So it appears to me that clang's diagnostics could be better here (my guess is the warning is only ever emitted in clang before the optimizer is even ran, thus not catching this case that is spread over two functions), which is why this compiler warning wouldn't have caught this here.

That said: that issue with clang is fixable, and in my eyes one should always compile C/C++ code with -Werror=uninitialized -- if the compiler is absolutely sure that something is uninitialized, it should be an error to read from it.

-Wmaybe-uninitialized is a different beast, I've had plenty of false positives for that, especially once switching to a newer compiler version; there I'd not want to promote that to an error. (Still I take it seriously if it does appear.)

Difference in Culture

Posted Apr 9, 2025 11:46 UTC (Wed) by pizza (subscriber, #46) [Link]

> In this specific example, -Wuninitialized for GCC actually warns about this, but only if the optimizer is turned on to at least -O1.

That is because the optimizer is what figures this (and most other potential UB) out. The corollary is if optimizer is not used, the compiler won't have any way of recognizing that variable access is uninitialized, and thus won't be able to helpfully replace it with an invocation of your PC's self-destruct feature.

> That said: that issue with clang is fixable, and in my eyes one should always compile C/C++ code with -Werror=uninitialized -- if the compiler is absolutely sure that something is uninitialized, it should be an error to read from it.

FWIW I generally agree with you but naive ways of preventing this warning (such as always initializing the variable upon declaration) can hide logic bugs that this warning would have potentially exposed.


Copyright © 2025, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds