IFNDR

Posted Aug 13, 2024 15:24 UTC (Tue) by kid_meier (subscriber, #93987)
In reply to: IFNDR by tialaramex
Parent article: Rust Project goals for 2024

I really don't understand the UB fatalism that seems to assume that you either have UB and thus anything can happen and there are no guarantees whatsoever and the compiler is free to aggressively remove code because it can assume UB can't happen so if it looks like UB it must be dead code. This is absolutely a choice and I've come over to the side that it's a pretty stupid choice. It's certainly a bad default.

I can't remember who, but some time ago this paper "What every compiler writer should know about programmers" [1] was posted here in a comment thread about UB. This is fairly old now, so I am not expert enough to know if compiler optimization has advanced so far that the gains at this point are significantly more than the state-of-the-art at the time it was written but I certainly have my doubts about it.

I just wanted to re-raise this side of the UB debate and push back against this fatalism that keeps being brought up here that there are no other options of handling UB other than the way clang/LLVM and gcc have chosen to handle it, ie. by exploiting it as a logical impossibility and optimizing from there. Again, that is in no way inevitable. The C/C++ standards (AFAIK) make no such claim; it's entirely a result of the (compiler) industry's interpretation of the relevant standards and it seems likely we'd be better off with a softer stance.

I get it, I'm not suggesting security problems would go away if this maximalist interpretation of UB was not the default, but certainly there are a number of well publicised vulnerabilities that wouldn't have occured if this wasn't the industry norm. The horse has left the barn, so indeed this is somewhat pointless -- the mitigations of this maximalist interpretation are well known, etc.

But it could have been another way, and other languages/compilers certainly could strike a different balance.

1. https://c9x.me/compile/bib/ubc.pdf

IFNDR

Posted Aug 13, 2024 16:23 UTC (Tue) by farnz (subscriber, #17727) [Link] (1 responses)

That paper is badly written - it makes two false assumptions (that optimizations rely on UB and are working with something that closely resembles C source code), and then stacks up a bunch of strawmen on top of that.

In practice, any modern compiler works by translating from C to some IR such that the semantics of the IR are the same, for all defined behaviour in the C source, as the semantics of the original C source. Optimizations then transform the IR in a way that maintains the IR's semantics, but which is either more optimal itself, or sets up the preconditions for a later optimization to transform the IR in a useful way; in the latter case, there's usually also a much later optimization pass that undoes the optimization if it didn't do something useful (e.g. LLVM has a pass that transforms loops into LCSSA form, various passes that can only fire if the loop is in LCSSA form, and an instcombine pass that removes the "surplus" IR involved in LCSSA form).

As a consequence of this, no optimization actually depends on UB; rather, the optimizations depend on the IR correctly capturing the meaning of the input code, so that they don't optimize in the wrong direction; you can use godbolt.org to examine the optimization passes that LLVM applies to your C or C++ code, and it'll show you the diff to the LLVM IR each optimization produces.

In that godbolt link, I've set the filters to "hide inconsequential passes" (those which don't change the IR for whatever reason), and changed the options to show the full module at a time, so that you can see how LLVM goes from the unoptimized input to the optimized machine code. Every one of those passes in green is doing something to the LLVM IR, and ignoring the C input; if the C to LLVM IR translation is wrong (e.g. because it gives the "wrong" defined behaviour to something that's UB), then the output will be wrong, too.

The fix is for compilers to define more UB in interesting and useful fashion; to choose everyone's favourite whipping-boy, C says that arithmetic on signed numbers cannot overflow, which means that in LLVM IR, C signed arithmetic puts the "nsw" modifier on the arithmetic operation. If Clang redefined overflow as "either calls abort() or wraps in twos' complement" (which ISO is perfectly happy for Clang to do), then Clang would have to remove the "nsw" modifier from LLVM arithmetic operations that it emits.

That then defines the arithmetic as twos' complement wrapping, and would ensure that LLVM optimizations that depend on the arithmetic not wrapping do not fire unless the compiler already knows via some other path that the arithmetic cannot wrap (e.g. because you're doing a / 10 + b / 10 + c / 10, which cannot wrap).

IFNDR

Posted Aug 13, 2024 17:06 UTC (Tue) by mussell (subscriber, #170320) [Link]

To add onto "not all optimizations rely on UB", both GCC and Clang will kill writes to non-volatile memory locations if there is no subsequent reads as this is allowed by the as-if rule (ISO C section 5.1.2.3). Depending on how the compiler orders its passes, this can lead to interesting behaviour. Just as an example, here is a C program that compiles to an infinite loop in GCC, but terminates immediately in Clang as GCC propagates constants before killing empty loops whereas Clang does the opposite. Not only is this not UB, but is a completely acceptable optimization by ISO C as section 6.8.5.2 says

An iteration statement may be assumed by the implementation to terminate if its controlling expression is not a constant expression, and none of the following operations are performed in its body, controlling expression or (in the case of a for statement) its expression-3:
— input/output operations
— accessing a volatile object
— synchronization or atomic operations.

IFNDR

Posted Aug 13, 2024 17:03 UTC (Tue) by mb (subscriber, #50428) [Link]

>that keeps being brought up here that there are no other options of handling UB other than
>the way clang/LLVM and gcc have chosen to handle

That's a wrong assumption already.
Compilers do not "handle" UB. They do not "see UB" and then cause havoc just because they want to.

Compilers do the *opposite* thing of that. They assume that UB is not present in the program.
They do *not* see UB in the source code and react to that. They assume it's not there.

There are plenty languages available with a sane UB or without UB. Just use them.
C/C++ are obsolete.

IFNDR

Posted Aug 13, 2024 17:14 UTC (Tue) by Wol (subscriber, #4433) [Link]

> I just wanted to re-raise this side of the UB debate and push back against this fatalism that keeps being brought up here that there are no other options of handling UB other than the way clang/LLVM and gcc have chosen to handle it, ie. by exploiting it as a logical impossibility and optimizing from there. Again, that is in no way inevitable. The C/C++ standards (AFAIK) make no such claim; it's entirely a result of the (compiler) industry's interpretation of the relevant standards and it seems likely we'd be better off with a softer stance.

Unfortunately, at least in part, this seems driven by an obsession for EXECUTABLE speed, not development speed.

What's the saying, premature optimisation is the root of all evil? It probably makes sense from their point of view as compiler writers, they want fast runtimes. But it's a nightmare for the developer who makes a trivial change and then waits all day for the program to be rebuilt ...

Like relational databases target GUARANTEED response times, at the cost of making ALL response times much slower - "you're guaranteed your data within 10 cache misses", as opposed to MV - you're 95% guaranteed your data with one cache miss (and 99% with two). Relational says "you have to suffer a slow database", MV says "you have to risk an unlikely pathological case".

But the poor end user is rarely allowed to pick the most appropriate option - "tech knows best" and they suffer what they're given.

"Premature optimisation" as I know all too well at work ... or rather, no attempt at optimisation at all because "the database won't let us do it" or "we can't change current practice", or excuse excuse excuse ... nobody can look at the big picture and say "hey you're ruining it for everyone else!".

Cheers,
Wol