Footguns

Posted Jul 19, 2021 10:26 UTC (Mon) by farnz (subscriber, #17727)
In reply to: Footguns by mrugiero
Parent article: Rust for Linux redux

The problem with UB is not when all ways to interpret the code interpret UB - everyone, even compiler writers, agrees that there should be diagnostics for that.

The problem is when there are multiple ways to interpret the code, such that some interpretations don't lead to UB, but others do. The classic example is:


struct useful {
    int value;
    /* And a lot more bits */
};

void ub(void * forty) {
    struct useful * ptr = (useful*) forty;
    int value = ptr-> value;
    if (!forty) {
        return;
    }
    /* Do stuff with value */
}

There are two interpretations here; either forty is NULL, or it isn't. If forty is NULL, then ptr->value; is UB, and thus the code snippet invokes UB. If forty is not NULL, then ptr->value is not UB, and so the compiler reasons that !forty must be false, always.

And generated code, among other uses of C compilers, often depends on the compiler being able to spot that an expression must be true, and optimising accordingly. As a result, there's plenty of real code that depends on the optimizations that fire when forty cannot be NULL, and the compiler simply works backwards from what it knows - that if forty was NULL, then there would be UB, ergo forty must be known not to be NULL.

In this particular example, the compiler's reasoning chain (and the user's mistake) is fairly obvious, and it wouldn't be challenging to have a warning that says "hey, !forty is always false because ptr->value would be UB otherwise, so this if statement is not meaningful", and have the user spot the error from there, but it gets more challenging with complex code. And neither compiler developers nor standard committees seem willing to say that good C compilers give a friendly warning when they find a codepath that invokes UB - in part because it'd trigger a lot of noise on legacy codebases.

Footguns

Posted Jul 19, 2021 11:20 UTC (Mon) by smurf (subscriber, #17840) [Link] (7 responses)

> There are two interpretations here

Three. The third is that whoever wrote the program mis-ordered these statements. So the compiler might as well rearrange them (the code has UB, so it's allowed to do that), and emit a warning while it's at it.

Footguns

Posted Jul 19, 2021 11:45 UTC (Mon) by farnz (subscriber, #17727) [Link]

That is not an interpretation of the code as written - that's a guess at the authorial intention. And while this is a simple case, chosen because it's painfully obvious what the author meant, but it's also obvious how the compiler gets to "!forty must be false", one of the things that makes UB so painful is that the chain of reasoning the compiler uses from "if this is true, then UB" to "the meaning of this program is nowhere near what the author intended" can be huge, and even span multiple files if you're unlucky (assumption of not-NULL in a header file resulting in the only non-UB case being pointer is not NULL, or unsigned value is less than 64, or more complex things around signed overflow).

Which is why I think that there should be a way to mark areas where UB could happen (Rust uses the unsafe keyword for this), and the compiler should be on the hook for defining all behaviour outside those blocks. If that's not reasonable (e.g. due to legacy code), then as a QoI issue, I'd like compilers to explicitly call out in a warning when they're reasoning backwards from "anything else would be UB".

And yes, I know this is not easy. But it'd help out the people who want to write good modern C or C++ if their compilers could be trusted to alert them when there's a risk of UB resulting in badly behaved programs.

Footguns

Posted Jul 19, 2021 12:46 UTC (Mon) by excors (subscriber, #95769) [Link] (5 responses)

> Three. The third is that whoever wrote the program mis-ordered these statements. So the compiler might as well rearrange them (the code has UB, so it's allowed to do that), and emit a warning while it's at it.

Reordering the statements means it would have to emit instructions to perform the NULL check. If the function is never called with NULL, that's an unnecessary performance cost. It's not a "might as well" situation - it's choosing to sacrifice performance of correct code, in exchange for less surprising behaviour if the code is buggy.

I think farnz's point about "generated code" is particularly relevant for C++ because many libraries (including the standard library) depend heavily on template metaprogramming, which is basically a program that runs at compile-time where the inputs are C++ types and the outputs are C++ functions, so most C++ programs will include a lot of generated code in that sense. And that relies heavily on inlining, constant propagation, dead code elimination, etc, to get good performance - the metaprogram itself is (usually) functional and very recursive and has huge amounts of dead code, but needs to end up compiling the same as a simple imperative function.

Because the library developers know they can rely on the compiler doing that optimisation, they can design the libraries to be generic and type-safe and easy-to-use, while still getting as good performance as specialised hand-tuned code (sometimes purely by relying on clever compiler optimisation, sometimes by explicitly selecting different algorithms based on properties of the type it's being specialised for). And then users of the library can write much safer programs than if they tried to implement it by hand, because the library has been carefully designed and tested and reviewed by experts for years.

If you try to sacrifice performance for safety, then those libraries may become much slower and developers would stop using them, so they'd write bad manual replacements, and then you'd end up losing performance *and* safety.

(That doesn't apply so much to C, which doesn't have that metaprogramming facility (the closest thing is macros, which are terrible). But I guess the issue there is that few people care enough about C to write a dedicated C compiler, they just write a C++ compiler and stick a C frontend on it, and if C programmers refuse to use C++ then they'll share the costs of that C++-focused optimisation but will miss out on most of the high-level safety benefits.)

Footguns

Posted Jul 19, 2021 13:22 UTC (Mon) by mrugiero (guest, #153040) [Link] (4 responses)

> Reordering the statements means it would have to emit instructions to perform the NULL check. If the function is never called with NULL, that's an unnecessary performance cost. It's not a "might as well" situation - it's choosing to sacrifice performance of correct code, in exchange for less surprising behaviour if the code is buggy.
If the programmer put the check there then the programmer told the compiler the pointer could be NULL, so it shouldn't assume the opposite unless it can absolutely prove the programmer was wrong.

> If you try to sacrifice performance for safety, then those libraries may become much slower and developers would stop using them, so they'd write bad manual replacements, and then you'd end up losing performance *and* safety.
The template metaprogramming case is special in the sense the compiler knows if for that particular invocation the argument would be NULL (although you can't pass pointers as template arguments AFAIR, if we assume that it relies on inlining then we know for sure once the check has been performed at the outmost call we can erase it from all of the recursive calls). So you don't really sacrifice that much performance.

Footguns

Posted Jul 19, 2021 13:58 UTC (Mon) by farnz (subscriber, #17727) [Link] (2 responses)

Trivial plain C counterexample, where the check is seen by the compiler, but the programmer did not intend to tell the compiler that the pointer could be NULL:


/* In a header file, for config handling */
struct configs {
/* Lots of pointers to config members here, omitted */
};

#define GET_STR_CONFIG_OR_DEFAULT(config, member, default) \
    (config ? config->member ? config->member : default : default)

#define GET_INT_CONFIG_OR_DEFAULT(config, member, default)
    (config ? config->member ; default)

/* In a C file, using the header file */
void do_amazing_things(void *config) {
    int rating = GET_INT_CONFIG_OR_DEFAULT(config, rating, 0);
    char * expertise = GET_STR_CONFIG_OR_DEFAULT(config, expertise, "novice");

    /* Do the amazing things */
}

In this case, the call to GET_INT_CONFIG_OR_DEFAULT (a macro) has promised that config is not NULL because it does an early return; a compiler that doesn't optimize this will be issuing repeated NULL checks for something that it knows must be non-NULL, or complaining because the user has an unnecessary NULL check in the expansion of GET_STR_CONFIG_OR_DEFAULT.

The code the compiler sees when the preprocessor hands it off is:


struct configs {
/* Lots of pointers to config members here, omitted */
};
/* Rest of headers etc */

void do_amazing_things(void *config) {
    int rating = (config ? config->member : 0);
    char * expertise = (config ? config->member ? "expertise" : "expertise");
}

Which leads you into the question of provenance of your "always true" if statement; in this case, the compiler has to know that it knows config is not NULL because the programmer did something that's an explicit check, not a "if it's NULL, then we have UB" case, otherwise you get a noisy warning because the programmer included an unnecessary NULL check in the macro. But they did that because that way the macro is safe to use even if it's only used once in a function, and if the user does do:


/* In a C file, using the header file */
void do_amazing_things(void *config) {
    int rating = ((struct configs*) config)-gt;rating;
    char * expertise = GET_STR_CONFIG_OR_DEFAULT(config, expertise, "novice");

    /* Do the amazing things */
}

then the compiler will need to warn because its "config is not NULL" assumption came from "if config is NULL, then there is UB in the execution". But first we need compiler authors to get onboard with the idea that whenever UB is invoked to permit an optimization (noting that this can require you to do complex tracking to get from your low-level IR (MachineInstr in the LLVM case, for example) all the way back to C, so that you can confirm that a given decision in codegen is not causing issues because it results in bad signed overflow behaviour (for example).

Footguns

Posted Jul 19, 2021 19:11 UTC (Mon) by mrugiero (guest, #153040) [Link] (1 responses)

> In this case, the call to GET_INT_CONFIG_OR_DEFAULT (a macro) has promised that config is not NULL because it does an early return; a compiler that doesn't optimize this will be issuing repeated NULL checks for something that it knows must be non-NULL, or complaining because the user has an unnecessary NULL check in the expansion of GET_STR_CONFIG_OR_DEFAULT.
The compiler can prove the extra check is unnecessary there. I'm not sure it would be wrong for it to complain, but it's an entirely different case than adding a test after a dereference of something you can't prove to not be NULL.

> Which leads you into the question of provenance of your "always true" if statement; in this case, the compiler has to know that it knows config is not NULL because the programmer did something that's an explicit check, not a "if it's NULL, then we have UB" case, otherwise you get a noisy warning because the programmer included an unnecessary NULL check in the macro. But they did that because that way the macro is safe to use even if it's only used once in a function, and if the user does do:

> /* In a C file, using the header file */
> void do_amazing_things(void *config) {
> int rating = ((struct configs*) config)-gt;rating;
> char * expertise = GET_STR_CONFIG_OR_DEFAULT(config, expertise, "novice");

> /* Do the amazing things */
> }

> then the compiler will need to warn because its "config is not NULL" assumption came from "if config is NULL, then there is UB in the execution". But first we need compiler authors to get onboard with the idea that whenever UB is invoked to permit an optimization (noting that this can require you to do complex tracking to get from your low-level IR (MachineInstr in the LLVM case, for example) all the way back to C, so that you can confirm that a given decision in codegen is not causing issues because it results in bad signed overflow behaviour (for example).

The only thing I said is always true is that you can't dereference something you can't prove not to be NULL and expect it to not be UB; the check reordering is allowed because it was UB before and we can assume at that point what the author wanted. This doesn't mean a warning should not be emitted unless (somehow) specified otherwise. If there's a check before in the same function it's trivial to prove it's no longer NULL after that. Besides, you don't need to know whether an optimization resulted from the UB. Knowing it exists is enough. And AFAICT that can be known before transforming to IR.

Footguns

Posted Jul 20, 2021 11:54 UTC (Tue) by farnz (subscriber, #17727) [Link]

> The only thing I said is always true is that you can't dereference something you can't prove not to be NULL and expect it to not be UB; the check reordering is allowed because it was UB before and we can assume at that point what the author wanted. This doesn't mean a warning should not be emitted unless (somehow) specified otherwise. If there's a check before in the same function it's trivial to prove it's no longer NULL after that. Besides, you don't need to know whether an optimization resulted from the UB. Knowing it exists is enough. And AFAICT that can be known before transforming to IR.

This is the difference between your model, and the C model. In C, you can dereference something that you can't prove not to be NULL, and expect it to not be UB; it's only UB if, on a specific execution, the thing is NULL. If it happens not to be NULL, then there's no UB. This comes down to whole program analysis - as a programmer, you might ensure that you only call ub(config) with non-NULL pointers, and thus it's fine.

So the C compiler is able to reason backwards - if thing *is* NULL, then there is UB. Ergo is must not be NULL, because UB isn't allowed, which permits optimization on the basis that it's not NULL. This is fine if the explicit NULL check is (e.g.) from a macro, or in generated code, or left over from refactoring and not yet removed; you've taken out something that's always false in this context, and just deleted the dead code.

It only becomes a problem where the human is surprised by the deletion of dead code - i.e. where the human thought the check was still important, but it's actually considered always false (or always true) by the compiler. And as deletion of dead code takes place at every level - from AST, through IR, through machine instruction choice - the compiler needs to link back its elimination of dead code and determine whether or not the human would be surprised by this particular elimination.

And that's what makes it a hard problem - we may know that ub(config) is only ever called with a non-NULL pointer, but that needs a whole program analysis which the compiler cannot perform. We may also know this simply because we've refactored so that callers of ub(config) do the NULL check themselves, and the extra NULL check inside the function may be there because a preprocessor (which can be outside the compiler!) has used a general form of code that works regardless of the presence of the NULL checks; why use two macros and force the human to think about whether or not a NULL check has happened already, when you can use one?

Footguns

Posted Jul 19, 2021 14:56 UTC (Mon) by excors (subscriber, #95769) [Link]

> The template metaprogramming case is special in the sense the compiler knows if for that particular invocation the argument would be NULL (although you can't pass pointers as template arguments AFAIR, if we assume that it relies on inlining then we know for sure once the check has been performed at the outmost call we can erase it from all of the recursive calls)

But the compiler may not know that. E.g. you could have code like:

struct S {
    int n;
    // ...
};

template <typename T, typename Pred>
void reset_if(T &items, Pred pred) {
    for (auto &item : items)
        if (pred(item))
            item = nullptr;
}

// Precondition: all items are not nullptr
// Postcondition: all items are even-numbered or are nullptr
void remove_odd(std::vector<std::unique_ptr<S>> &items)
{
    reset_if(items, [](auto &item) { return item->n % 2 != 0; });
}

(i.e. there's a dynamically-allocated array of pointers to S, and the function deallocates all the Ss that match some condition. This is somewhat artificial, but I don't think it's totally implausible as a component of some larger operation). The function's precondition is guaranteed by the programmer but is not visible to the compiler. (Maybe the caller is in a separate translation unit, or a dynamically-loaded library.)

On "item = nullptr", the unique_ptr will deallocate its original contents (to prevent memory leaks). Because unique_ptr is generic and supports custom deallocator functions, and the custom deallocator might not accept nullptr as an argument, the unique_ptr destructor is specified to do something like "if (get() != nullptr) get_deleter()(get());".

In this function, the unique_ptr destructor is only called after the "item->n % 2" check, which is dereferencing the pointer. Then the compiler knows the pointer can't be nullptr, so it can omit the "if (get() != nullptr)" test and save some time and code size. (In this case it's only a couple of instructions, but it's not hard to imagine similar cases where the 'if' is followed by an 'else' with a much larger chunk of code, and eliminating the whole 'else' clause could make a big difference.)

The null check wasn't intentionally added to this code by any programmer, it's just a part of the (relatively) safe abstraction provided by unique_ptr, so it's included in the code that's generated by the template instantiation. In this function the code is used in a way where the check is redundant and the compiler is able to figure that out because it sees the programmer was deliberately dereferencing the pointer, so the abstraction has zero cost and the programmer doesn't have to revert to a less-safe kind of pointer to get the same performance.

Footguns

Posted Jul 19, 2021 13:16 UTC (Mon) by mrugiero (guest, #153040) [Link] (3 responses)

> There are two interpretations here; either forty is NULL, or it isn't. If forty is NULL, then ptr->value; is UB, and thus the code snippet invokes UB. If forty is not NULL, then ptr->value is not UB, and so the compiler reasons that !forty must be false, always.
AFAICT, the logical conclusion here is that every pointer argument can be NULL unless proved otherwise. Since you can't prove that for non-static functions, that's necessarily UB. This is why many compilers have extensions to explicitly tell the compiler an argument can never be NULL.

> In this particular example, the compiler's reasoning chain (and the user's mistake) is fairly obvious, and it wouldn't be challenging to have a warning that says "hey, !forty is always false because ptr->value would be UB otherwise, so this if statement is not meaningful", and have the user spot the error from there, but it gets more challenging with complex code. And neither compiler developers nor standard committees seem willing to say that good C compilers give a friendly warning when they find a codepath that invokes UB - in part because it'd trigger a lot of noise on legacy codebases.
Which was exactly my point. Besides, the point of triggering noise is moot IMO, since you can always add a flag to ignore the warning like with any other warning.

Footguns

Posted Jul 19, 2021 13:27 UTC (Mon) by farnz (subscriber, #17727) [Link] (2 responses)

> > There are two interpretations here; either forty is NULL, or it isn't. If forty is NULL, then ptr->value; is UB, and thus the code snippet invokes UB. If forty is not NULL, then ptr->value is not UB, and so the compiler reasons that !forty must be false, always.

> AFAICT, the logical conclusion here is that every pointer argument can be NULL unless proved otherwise. Since you can't prove that for non-static functions, that's necessarily UB. This is why many compilers have extensions to explicitly tell the compiler an argument can never be NULL.

This is why the function included the check "if (!forty) { return; }"; after that test, forty is provably not NULL, and the compiler can optimize knowing that forty is not NULL. The gotcha is that the user invoked UB before that check by dereferencing ptr, and the compiler used that information to know that forty cannot be NULL after the dereference, because if it is, UB is invoked.

> > In this particular example, the compiler's reasoning chain (and the user's mistake) is fairly obvious, and it wouldn't be challenging to have a warning that says "hey, !forty is always false because ptr->value would be UB otherwise, so this if statement is not meaningful", and have the user spot the error from there, but it gets more challenging with complex code. And neither compiler developers nor standard committees seem willing to say that good C compilers give a friendly warning when they find a codepath that invokes UB - in part because it'd trigger a lot of noise on legacy codebases.

> Which was exactly my point. Besides, the point of triggering noise is moot IMO, since you can always add a flag to ignore the warning like with any other warning.

And my point is that this is entirely a social problem - nobody with the power to say "compilers that use UB to optimize must tell the user what UB the compiler is exploiting" is willing to do so. If compiler writers for C were doing that, even if it was the preserve of the very best C compilers (so GCC and CLang), it'd be enough to reduce the number of developers who get surprised by compiler use of UB.

Footguns

Posted Jul 19, 2021 18:57 UTC (Mon) by mrugiero (guest, #153040) [Link] (1 responses)

> This is why the function included the check "if (!forty) { return; }"; after that test, forty is provably not NULL, and the compiler can optimize knowing that forty is not NULL. The gotcha is that the user invoked UB before that check by dereferencing ptr, and the compiler used that information to know that forty cannot be NULL after the dereference, because if it is, UB is invoked.
Of course. My point was that _on entry_ it isn't provably not NULL, so NULL dereference is a possibility in that line, and thus the conclusion should consistently that the first dereference is UB. My point is that there's no true ambiguity about that.

> And my point is that this is entirely a social problem - nobody with the power to say "compilers that use UB to optimize must tell the user what UB the compiler is exploiting" is willing to do so. If compiler writers for C were doing that, even if it was the preserve of the very best C compilers (so GCC and CLang), it'd be enough to reduce the number of developers who get surprised by compiler use of UB.
So we agree, then?

Footguns

Posted Jul 19, 2021 21:10 UTC (Mon) by farnz (subscriber, #17727) [Link]

And the problem with your point is that C is not very amenable to whole program analysis (for a variety of reasons to do with only having a global and a local static namespace, plus using textual inclusion instead of modules), which means that there's a lot of legacy code where the first dereference is not UB. It's only UB if the program execution can call ub with forty being NULL, which is not guaranteed. If the rest of the code ensures that ub is always called with a valid pointer to configs, then no UB is present.

As I've pointed out in another comment, that check could be the consequence of a macro expansion - in which case, pessimising the output machine code because you've used a macro to access configs isn't a good look - there was no UB before, now you're adding code that verifies something that the human writing the code has already verified is true outside ub, and the use of a macro has been sufficient to introduce extra machine instructions.

This is part of why it's a damn hard social problem - it's possible that ub is only called from code in 3 different C files, and all of those files already check for NULL before calling ub, so in the actual final application, there is no UB. The compiler using the fact that the check can only be true and cause an early return if UB has already happened to optimize ub simply speeds things up at no cost.