Development quote of the week

Posted Dec 5, 2022 15:31 UTC (Mon) by Wol (subscriber, #4433)
In reply to: Development quote of the week by khim
Parent article: Development quote of the week

> Do you want to see 100 lines of warnings for every line of source code? Any source code, not just code with manifested UB? Because that's the only thing that compiler can easily offer. It can only list assumptions that very uses and most of them are trivial: it assumed that variables were initialized, that memory was allocated, that pointers are not dangning and that 200 other UBs haven't happened, too.

THIS THIS THIS.

ALL the UBs you have listed here are BUGS. They are all "Don't Do It".

But things like signed integer arithmetic? That's a SchrodinUB - you don't know if it's UB or not until you do do it. And if you look at the laws of Pure Mathematics, it's not UB *at*all*.

There's a reason us grey beards get all pissed off at this. We understand logic. We can do Maths. You said your typical programmer of today loves rules. "Monkey See Monkey Do" rules. Like "Don't do signed integer arithmetic!". WHAT!?!? Why on earth does C even HAVE signed integer arithmetic if you're not supposed to use it!? Why doesn't the compiler just optimise it ALL away if it's guaranteed to screw up at the most inopportune moments?

When I design software, I do NOT set out to solve the specific problem in front of me. At work at the moment I am trying to unravel a horrendous mess where everybody has solved their own problem, and we have loads of programs all interacting in weird and unexpected ways. (Compounded by people not using common sense and blaming me for "the output looks wrong", so when I investigate my reaction is "what is this garbage you are putting in?")

When I program I always - even if not consciously - do a truth table. What are the possible inputs? What are the possible outputs? Even if my problem is a small subset of the output, I DON'T close off the input space, I just mark it "here be dragons" (or, usually, more work for later), and trap or block it so I can deal with it later.

The majority of your examples have a simple, single value truth table. It's called "Garbage In, Garbage Out". I have absolutely no problem with that being UB.

But signed integer arithmetic? It appears to collapse to a simple two-way table - "Valid In, Valid Out", or "Valid In, Garbage Out". To treat that the same as a simple "Garbage In, Garbage Out" is crazy. (I've ignored the other two states, because "Garbage In" is indistinguishable from "Valid In".) A scenario where all possible inputs are valid, is not the same at all as a scenario where all possible inputs are invalid.

Cheers,
Wol

Development quote of the week

Posted Dec 5, 2022 17:03 UTC (Mon) by khim (subscriber, #9252) [Link] (10 responses)

> And if you look at the laws of Pure Mathematics, it's not UB *at*all*.

But laws of Pure Mathematics don't apply to computers unless you make them. C language does pretty natural assumption is that you would do so and would avoid computations which may overflow.

> Like "Don't do signed integer arithmetic!"

No, that's not the rule. The rule is “do the bounds checking, damn it”. Yes, this includes signed integer arithmetic, too.

> Why doesn't the compiler just optimise it ALL away if it's guaranteed to screw up at the most inopportune moments?

Because that would be against the rules, obviously.

> Even if my problem is a small subset of the output, I DON'T close off the input space, I just mark it "here be dragons" (or, usually, more work for later), and trap or block it so I can deal with it later.

For that to work you need some kind of isolation between hardware and abstract machine that language uses.

When you try to do that layer thin enough (like C does) you invariably end up with “bad code which may work for some time and then blow up in your place” (like using already freed object which may work for some time till, one unlucky day, interrupt would come at different time and that object would be overwritten). And then you have to describe that subset and avoid it.

It seems that you understand that. But somehow you refuse to accept that bounds of that subset are, to some degree, arbitrary and there needs to be common description of where they are!

That common description is called C standard. If you think that some things don't belong to that subset — use the switch and make them defined to you, personally (compilers already support that, clang and gcc provide plenty of such switches… there are many such controversial UBs and thus many such switches) or change the standard (yes, it's harder than swearing on various forums but have at least theoretical chance to work).

> It appears to collapse to a simple two-way table - "Valid In, Valid Out", or "Valid In, Garbage Out".

C doesn't support such things. It's either “Valid In, Valid Out” (like with unsigned numbers) or “Garbage In, Garbage Out” (like with signed ones). “Valid In, Garbage Out” is not in the cards, sorry.

> A scenario where all possible inputs are valid, is not the same at all as a scenario where all possible inputs are invalid.

Yes. And it's one of the reason why C standard classifies all inputs either as “Valid In, Valid Out” (fully-specified things, unspecified things and implementation-specified things are all in that class) or “Garbage In, Garbage Out”. Case of “Valid In, Garbage Out” shouldn't happen in a valid C program.

Even with unsigned numbers division by zero is not allowed (it's UB) because that would be “Valid In, Garbage Out” case. You have to ensure that divisor is non-zero. Only people rarely object against that, for some reason. Maybe because it's defined as “Garbage In, Garbage Out” in math, not just in C standard.

The fact that you don't ever need to support that “Valid In, Garbage Out” is something they really wanted to have because it simplifies reasoning for the compilers and wasn't supposed to be a big deal for humans.

That was, probably, miscalculation ANSI committee did 30 years ago. But I don't think it's easy to change: neither flags for clang/gcc nor Rust do that. Instead of declaring integer overflow UB Rust makes it IB and carefully describes all possible results in it's reference and the same with clang/gcc switches.

The rule there are no such thing as “Valid In, Garbage Out”, there are only “Valid In, Valid Out” and “Garbage In, Garbage Out” stays unchallenged.

Development quote of the week

Posted Dec 6, 2022 10:10 UTC (Tue) by farnz (subscriber, #17727) [Link] (9 responses)

The laws of Pure Mathematics do apply to computers, very much so. C is, for example, doing its operations on a finite set with abelian groups for multiplication and addition, and not on a field (since some operations such as multiplication and addition are not defined for all pairs of inputs from the set).

The fun comes in when people assume (falsely) that + and * in C are the field operators for school arithmetic. They're not - they're still set operations, and they still form an abelian group, but they're not field operators, and they don't behave like field operators in all cases.

Development quote of the week

Posted Dec 6, 2022 14:03 UTC (Tue) by khim (subscriber, #9252) [Link] (8 responses)

Only unsigned types are defined to work like you describe. Signed types work differently.

Development quote of the week

Posted Dec 6, 2022 14:11 UTC (Tue) by farnz (subscriber, #17727) [Link] (7 responses)

I'm sorry, you have me completely confused. I described a system where addition and multiplication are only defined for a subset of pairs of inputs, which is pretty much signed types in C - if I have a 32 bit integer type, then 2**17 * 2 ** 17 is UB in C and in Pure Mathematics for a type that's not a field, but is an abelian group, whereas 4 * 3 is defined, and is the same as 3 * 4 (the property that makes this an abelian group).

Unsigned types are fields, I thought? The gap between what I described and a field is that in a field, all input values for addition and multiplication result in an element of the set, whereas in what I described, some input values for addition and multiplication result in undefined behaviour.

Development quote of the week

Posted Dec 6, 2022 14:49 UTC (Tue) by khim (subscriber, #9252) [Link] (1 responses)

> I described a system where addition and multiplication are only defined for a subset of pairs of inputs

No. I haven't read that in your description and, worse, group is defined like this:

A group is a set 𝐆 together with a binary operation on 𝐆, here denoted "·", that combines any two elements 𝓪 and 𝓫 to form an element of 𝐆, denoted 𝓪·𝓫, such that the following three requirements, known as group axioms, are satisfied: associativity, identity element, inverse element.

You have started talking about group axioms without checking whether result is defined for all elements of supposed group. Unsigned numbers are a group. Signed numbers are not group.

> Unsigned types are fields, I thought?

No, they don't form a field. uint32_t(65536) doesn't have an inverse element (that is: you can not find any 𝔁 to make uint32_t(𝔁)*uint32_t(65536)=uint32_t(1)), it's a ring. Also a well defined mathematical object, but not a field.

> The gap between what I described and a field is that in a field, all input values for addition and multiplication result in an element of the set, whereas in what I described, some input values for addition and multiplication result in undefined behaviour.

Semigroup, group, abelian group, ring… they all call for operations to be defined for all elements.

But field definition includes UB: reciprocal not defined for zero (and only for zero, that's why unsigned numbers in C are not a field).

That, somehow, never changes the opitions of O_PONIES lovers: they, usually, continue to support the notion that it doesn't mean anything (or, worse, start claiming that the fact that it's UB to divide by zero even for unsigned numbers is also a problem… even if rarely bites anyone because most people test for zero, but not for overflows).

> whereas in what I described, some input values for addition and multiplication result in undefined behaviour

If something doesn't follow the definition then you can't say that it's “𝓧, but not 𝓧”. Well… technically you can say that, but it's a tiny bit pointless: math doesn't work like that, you couldn't use theorems proven for 𝓧 with “𝓧, but not 𝓧”.

You may say that while unsigned numbers properly model ℤ₂ᴺ ring, while signed numbers correctly models ℤ if (and only if) you ensure that operations on them don't overflow, but they don't form semigroup, group, abelian group, or ring.

I guess the idea was that people may need ℤ₂ᴺ, but since ℤ is not feasible to provide in low-level lamguage like C (even python have trouble with ℤ) thus asking developer to use them carefully and avoid them when result is not guaranteed wouldn't too problematic. After all they have to remember not to divide by zero in math, why couldn't be they taught not to overflow in C?

As you can see some people don't like that idea (to put it midly).

Development quote of the week

Posted Dec 6, 2022 14:56 UTC (Tue) by farnz (subscriber, #17727) [Link]

I made mistakes in the terminology (I'd forgotten the details of groups) - but I stand by my claim that it's mathematically reasonable to have a finite set, F, with binary operations + and * defined only for a subset of pairs of elements of F, and that maths does not always claim that because I have a finite set F, with a binary operation +, it must follow all the axioms of arithmetic.

Rather, C's signed integers are a finite set that behaves like the integers in some respects but not others, and results that hold for mathematical integers do not necessarily hold for C signed integers, but that this doesn't mean that C signed integers are not a mathematically acceptable set - so reaching for mathematics and saying "but maths says!" just highlights that you're not particularly good at mathematics, and expect C's signed integers to behave like a set you're familiar with from school, rather than like the sort of sets that you might start discussing in a bachelor's degree.

Development quote of the week

Posted Dec 6, 2022 15:54 UTC (Tue) by anselm (subscriber, #2796) [Link] (1 responses)

The gap between what I described and a field is that in a field, all input values for addition and multiplication result in an element of the set, whereas in what I described, some input values for addition and multiplication result in undefined behaviour.

ISTR that even in a group, the result of the operation on two members of the group must always be a member of the group. “Undefined behaviour” where applying the operation on two members of the group doesn't yield a result within the group isn't allowed. Nor does it work to exclude, e.g., 2**17 from the set on the grounds that 2**17 * 2**17 isn't in the set; you need it in the set so that (non-overflowing) operations like 2 * 2 ** 16 can have a result. (Fields get that property from the fact that they're basically built on groups.)

Development quote of the week

Posted Dec 6, 2022 16:37 UTC (Tue) by farnz (subscriber, #17727) [Link]

Yes - that's my error. I stand by the claim that C signed integers are a perfectly reasonable mathematical object, and that their behaviour is happily described by pure mathematics.

But I have to retract my claim that they're a group - they're still a finite set, with several operations that give them superficial similarity to the set of integers, but they're not a group.

Development quote of the week

Posted Dec 6, 2022 21:09 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> Unsigned types are fields, I thought?

They are rings, not fields (no multiplicative inverse, since we're not looking at floats).

Signed integers are not even rings, because at least one element doesn't have an inverse (e.g. there's -32768 but not 32768).

Development quote of the week

Posted Dec 6, 2022 22:33 UTC (Tue) by khim (subscriber, #9252) [Link] (1 responses)

That's not how rings behave. -32768 is inverse of itself. In fact it's exact same ring which you have with unsigned numbers (most CPUs these days don't even bother to give you separate operations for signed or unsigned numbers, they only distinguish signed and unsigned numbers when you compare them, but ring only deals with addition, multiplication, and, like any ring, equality… ordering is not part of it).

That maybe one reason for why C standard defines signed operations like it does: if you want ring — you already have one, unsigned numbers, why would you need another, identical one?

Development quote of the week

Posted Dec 7, 2022 17:31 UTC (Wed) by kleptog (subscriber, #1183) [Link]

Unsigned integers in C form a well defined ring: Z_2^16 or whatever your bit size is, the integers modulo a power of two. This gives you commutative, associative and distributive properties. The operations don't preserve ordering though.

Signed integers on computers don't form a ring, not in a useful sense anyway. For example 10*19=-66 on an 8-bit machine. You can argue that if you look at the bit patterns it's the same as unsigned integers but that's just relabelling. What I think the C designers wanted was to consider the signed integers a subset of the natural numbers. Because that gives you a meaningful way to deal with ordered comparisons.

For example, the implication a<b => 2a<2b is *not true* in the unsigned integers. This is however super annoying for a compiler optimiser. The standard compiler transformation in C where you change loop counters to count by the width of the object you're looping over is strictly not allowed unless you can prove the loop bound won't overflow. The compiler often can't prove it, so that would leave a simple optimisation on the floor.

Unless you assume overflow is UB.

"a+b > a if b is positive" is false in the unsigned integers. But a very useful thing to be able to assume when optimising code. The classic example of code that changes with -fwrapv:

int f(int i) {
return i+1 > i;
}

Is optimised to true without it.

Pointer arithmetic is another case, you don't want to treat this as a full on ring either, because pointer comparison is so useful (can we assume a[b] is always later in memory than a[c] if b>c?). So the compiler simply assumes overflow cannot happen.

I don't think the compiler writers are out to get you. But if you want signed integers to act 100% like CPU registers, you do need to set the compiler flag.