Footguns

Posted Jul 19, 2021 11:20 UTC (Mon) by smurf (subscriber, #17840)
In reply to: Footguns by farnz
Parent article: Rust for Linux redux

> There are two interpretations here

Three. The third is that whoever wrote the program mis-ordered these statements. So the compiler might as well rearrange them (the code has UB, so it's allowed to do that), and emit a warning while it's at it.

Footguns

Posted Jul 19, 2021 11:45 UTC (Mon) by farnz (subscriber, #17727) [Link]

That is not an interpretation of the code as written - that's a guess at the authorial intention. And while this is a simple case, chosen because it's painfully obvious what the author meant, but it's also obvious how the compiler gets to "!forty must be false", one of the things that makes UB so painful is that the chain of reasoning the compiler uses from "if this is true, then UB" to "the meaning of this program is nowhere near what the author intended" can be huge, and even span multiple files if you're unlucky (assumption of not-NULL in a header file resulting in the only non-UB case being pointer is not NULL, or unsigned value is less than 64, or more complex things around signed overflow).

Which is why I think that there should be a way to mark areas where UB could happen (Rust uses the unsafe keyword for this), and the compiler should be on the hook for defining all behaviour outside those blocks. If that's not reasonable (e.g. due to legacy code), then as a QoI issue, I'd like compilers to explicitly call out in a warning when they're reasoning backwards from "anything else would be UB".

And yes, I know this is not easy. But it'd help out the people who want to write good modern C or C++ if their compilers could be trusted to alert them when there's a risk of UB resulting in badly behaved programs.

Footguns

Posted Jul 19, 2021 12:46 UTC (Mon) by excors (subscriber, #95769) [Link] (5 responses)

> Three. The third is that whoever wrote the program mis-ordered these statements. So the compiler might as well rearrange them (the code has UB, so it's allowed to do that), and emit a warning while it's at it.

Reordering the statements means it would have to emit instructions to perform the NULL check. If the function is never called with NULL, that's an unnecessary performance cost. It's not a "might as well" situation - it's choosing to sacrifice performance of correct code, in exchange for less surprising behaviour if the code is buggy.

I think farnz's point about "generated code" is particularly relevant for C++ because many libraries (including the standard library) depend heavily on template metaprogramming, which is basically a program that runs at compile-time where the inputs are C++ types and the outputs are C++ functions, so most C++ programs will include a lot of generated code in that sense. And that relies heavily on inlining, constant propagation, dead code elimination, etc, to get good performance - the metaprogram itself is (usually) functional and very recursive and has huge amounts of dead code, but needs to end up compiling the same as a simple imperative function.

Because the library developers know they can rely on the compiler doing that optimisation, they can design the libraries to be generic and type-safe and easy-to-use, while still getting as good performance as specialised hand-tuned code (sometimes purely by relying on clever compiler optimisation, sometimes by explicitly selecting different algorithms based on properties of the type it's being specialised for). And then users of the library can write much safer programs than if they tried to implement it by hand, because the library has been carefully designed and tested and reviewed by experts for years.

If you try to sacrifice performance for safety, then those libraries may become much slower and developers would stop using them, so they'd write bad manual replacements, and then you'd end up losing performance *and* safety.

(That doesn't apply so much to C, which doesn't have that metaprogramming facility (the closest thing is macros, which are terrible). But I guess the issue there is that few people care enough about C to write a dedicated C compiler, they just write a C++ compiler and stick a C frontend on it, and if C programmers refuse to use C++ then they'll share the costs of that C++-focused optimisation but will miss out on most of the high-level safety benefits.)

Footguns

Posted Jul 19, 2021 13:22 UTC (Mon) by mrugiero (guest, #153040) [Link] (4 responses)

> Reordering the statements means it would have to emit instructions to perform the NULL check. If the function is never called with NULL, that's an unnecessary performance cost. It's not a "might as well" situation - it's choosing to sacrifice performance of correct code, in exchange for less surprising behaviour if the code is buggy.
If the programmer put the check there then the programmer told the compiler the pointer could be NULL, so it shouldn't assume the opposite unless it can absolutely prove the programmer was wrong.

> If you try to sacrifice performance for safety, then those libraries may become much slower and developers would stop using them, so they'd write bad manual replacements, and then you'd end up losing performance *and* safety.
The template metaprogramming case is special in the sense the compiler knows if for that particular invocation the argument would be NULL (although you can't pass pointers as template arguments AFAIR, if we assume that it relies on inlining then we know for sure once the check has been performed at the outmost call we can erase it from all of the recursive calls). So you don't really sacrifice that much performance.

Footguns

Posted Jul 19, 2021 13:58 UTC (Mon) by farnz (subscriber, #17727) [Link] (2 responses)

Trivial plain C counterexample, where the check is seen by the compiler, but the programmer did not intend to tell the compiler that the pointer could be NULL:


/* In a header file, for config handling */
struct configs {
/* Lots of pointers to config members here, omitted */
};

#define GET_STR_CONFIG_OR_DEFAULT(config, member, default) \
    (config ? config->member ? config->member : default : default)

#define GET_INT_CONFIG_OR_DEFAULT(config, member, default)
    (config ? config->member ; default)

/* In a C file, using the header file */
void do_amazing_things(void *config) {
    int rating = GET_INT_CONFIG_OR_DEFAULT(config, rating, 0);
    char * expertise = GET_STR_CONFIG_OR_DEFAULT(config, expertise, "novice");

    /* Do the amazing things */
}

In this case, the call to GET_INT_CONFIG_OR_DEFAULT (a macro) has promised that config is not NULL because it does an early return; a compiler that doesn't optimize this will be issuing repeated NULL checks for something that it knows must be non-NULL, or complaining because the user has an unnecessary NULL check in the expansion of GET_STR_CONFIG_OR_DEFAULT.

The code the compiler sees when the preprocessor hands it off is:


struct configs {
/* Lots of pointers to config members here, omitted */
};
/* Rest of headers etc */

void do_amazing_things(void *config) {
    int rating = (config ? config->member : 0);
    char * expertise = (config ? config->member ? "expertise" : "expertise");
}

Which leads you into the question of provenance of your "always true" if statement; in this case, the compiler has to know that it knows config is not NULL because the programmer did something that's an explicit check, not a "if it's NULL, then we have UB" case, otherwise you get a noisy warning because the programmer included an unnecessary NULL check in the macro. But they did that because that way the macro is safe to use even if it's only used once in a function, and if the user does do:


/* In a C file, using the header file */
void do_amazing_things(void *config) {
    int rating = ((struct configs*) config)-gt;rating;
    char * expertise = GET_STR_CONFIG_OR_DEFAULT(config, expertise, "novice");

    /* Do the amazing things */
}

then the compiler will need to warn because its "config is not NULL" assumption came from "if config is NULL, then there is UB in the execution". But first we need compiler authors to get onboard with the idea that whenever UB is invoked to permit an optimization (noting that this can require you to do complex tracking to get from your low-level IR (MachineInstr in the LLVM case, for example) all the way back to C, so that you can confirm that a given decision in codegen is not causing issues because it results in bad signed overflow behaviour (for example).

Footguns

Posted Jul 19, 2021 19:11 UTC (Mon) by mrugiero (guest, #153040) [Link] (1 responses)

> In this case, the call to GET_INT_CONFIG_OR_DEFAULT (a macro) has promised that config is not NULL because it does an early return; a compiler that doesn't optimize this will be issuing repeated NULL checks for something that it knows must be non-NULL, or complaining because the user has an unnecessary NULL check in the expansion of GET_STR_CONFIG_OR_DEFAULT.
The compiler can prove the extra check is unnecessary there. I'm not sure it would be wrong for it to complain, but it's an entirely different case than adding a test after a dereference of something you can't prove to not be NULL.

> Which leads you into the question of provenance of your "always true" if statement; in this case, the compiler has to know that it knows config is not NULL because the programmer did something that's an explicit check, not a "if it's NULL, then we have UB" case, otherwise you get a noisy warning because the programmer included an unnecessary NULL check in the macro. But they did that because that way the macro is safe to use even if it's only used once in a function, and if the user does do:

> /* In a C file, using the header file */
> void do_amazing_things(void *config) {
> int rating = ((struct configs*) config)-gt;rating;
> char * expertise = GET_STR_CONFIG_OR_DEFAULT(config, expertise, "novice");

> /* Do the amazing things */
> }

> then the compiler will need to warn because its "config is not NULL" assumption came from "if config is NULL, then there is UB in the execution". But first we need compiler authors to get onboard with the idea that whenever UB is invoked to permit an optimization (noting that this can require you to do complex tracking to get from your low-level IR (MachineInstr in the LLVM case, for example) all the way back to C, so that you can confirm that a given decision in codegen is not causing issues because it results in bad signed overflow behaviour (for example).

The only thing I said is always true is that you can't dereference something you can't prove not to be NULL and expect it to not be UB; the check reordering is allowed because it was UB before and we can assume at that point what the author wanted. This doesn't mean a warning should not be emitted unless (somehow) specified otherwise. If there's a check before in the same function it's trivial to prove it's no longer NULL after that. Besides, you don't need to know whether an optimization resulted from the UB. Knowing it exists is enough. And AFAICT that can be known before transforming to IR.

Footguns

Posted Jul 20, 2021 11:54 UTC (Tue) by farnz (subscriber, #17727) [Link]

> The only thing I said is always true is that you can't dereference something you can't prove not to be NULL and expect it to not be UB; the check reordering is allowed because it was UB before and we can assume at that point what the author wanted. This doesn't mean a warning should not be emitted unless (somehow) specified otherwise. If there's a check before in the same function it's trivial to prove it's no longer NULL after that. Besides, you don't need to know whether an optimization resulted from the UB. Knowing it exists is enough. And AFAICT that can be known before transforming to IR.

This is the difference between your model, and the C model. In C, you can dereference something that you can't prove not to be NULL, and expect it to not be UB; it's only UB if, on a specific execution, the thing is NULL. If it happens not to be NULL, then there's no UB. This comes down to whole program analysis - as a programmer, you might ensure that you only call ub(config) with non-NULL pointers, and thus it's fine.

So the C compiler is able to reason backwards - if thing *is* NULL, then there is UB. Ergo is must not be NULL, because UB isn't allowed, which permits optimization on the basis that it's not NULL. This is fine if the explicit NULL check is (e.g.) from a macro, or in generated code, or left over from refactoring and not yet removed; you've taken out something that's always false in this context, and just deleted the dead code.

It only becomes a problem where the human is surprised by the deletion of dead code - i.e. where the human thought the check was still important, but it's actually considered always false (or always true) by the compiler. And as deletion of dead code takes place at every level - from AST, through IR, through machine instruction choice - the compiler needs to link back its elimination of dead code and determine whether or not the human would be surprised by this particular elimination.

And that's what makes it a hard problem - we may know that ub(config) is only ever called with a non-NULL pointer, but that needs a whole program analysis which the compiler cannot perform. We may also know this simply because we've refactored so that callers of ub(config) do the NULL check themselves, and the extra NULL check inside the function may be there because a preprocessor (which can be outside the compiler!) has used a general form of code that works regardless of the presence of the NULL checks; why use two macros and force the human to think about whether or not a NULL check has happened already, when you can use one?

Footguns

Posted Jul 19, 2021 14:56 UTC (Mon) by excors (subscriber, #95769) [Link]

> The template metaprogramming case is special in the sense the compiler knows if for that particular invocation the argument would be NULL (although you can't pass pointers as template arguments AFAIR, if we assume that it relies on inlining then we know for sure once the check has been performed at the outmost call we can erase it from all of the recursive calls)

But the compiler may not know that. E.g. you could have code like:

struct S {
    int n;
    // ...
};

template <typename T, typename Pred>
void reset_if(T &items, Pred pred) {
    for (auto &item : items)
        if (pred(item))
            item = nullptr;
}

// Precondition: all items are not nullptr
// Postcondition: all items are even-numbered or are nullptr
void remove_odd(std::vector<std::unique_ptr<S>> &items)
{
    reset_if(items, [](auto &item) { return item->n % 2 != 0; });
}

(i.e. there's a dynamically-allocated array of pointers to S, and the function deallocates all the Ss that match some condition. This is somewhat artificial, but I don't think it's totally implausible as a component of some larger operation). The function's precondition is guaranteed by the programmer but is not visible to the compiler. (Maybe the caller is in a separate translation unit, or a dynamically-loaded library.)

On "item = nullptr", the unique_ptr will deallocate its original contents (to prevent memory leaks). Because unique_ptr is generic and supports custom deallocator functions, and the custom deallocator might not accept nullptr as an argument, the unique_ptr destructor is specified to do something like "if (get() != nullptr) get_deleter()(get());".

In this function, the unique_ptr destructor is only called after the "item->n % 2" check, which is dereferencing the pointer. Then the compiler knows the pointer can't be nullptr, so it can omit the "if (get() != nullptr)" test and save some time and code size. (In this case it's only a couple of instructions, but it's not hard to imagine similar cases where the 'if' is followed by an 'else' with a much larger chunk of code, and eliminating the whole 'else' clause could make a big difference.)

The null check wasn't intentionally added to this code by any programmer, it's just a part of the (relatively) safe abstraction provided by unique_ptr, so it's included in the code that's generated by the template instantiation. In this function the code is used in a way where the check is redundant and the compiler is able to figure that out because it sees the programmer was deliberately dereferencing the pointer, so the abstraction has zero cost and the programmer doesn't have to revert to a less-safe kind of pointer to get the same performance.