Better handling of integer wraparound in the kernel

Posted Jan 28, 2024 8:40 UTC (Sun) by epa (subscriber, #39769)
In reply to: Better handling of integer wraparound in the kernel by marcH
Parent article: Better handling of integer wraparound in the kernel

That’s an interesting article. It points out that if integer overflow were defined to wrap around then a simple for-loop could be an infinite loop, if the upper bound is INT_MAX. Declaring that overflow “cannot” happen lets the compiler optimize better, knowing the loop must terminate.

And this finally explains why common C programming style is to use signed integer types, even for quantities that will never be negative in practice. Naively it makes sense to use an unsigned int both as the loop counter and as the upper bound. But if you do so, it is allowed to wrap around to zero, so a loop exit condition would never be hit if the upper bound happens to be the largest possible unsigned value. And the compiler has to allow for that.

(I’m sure it’s not just for-loops and there are other important optimizations, just that this is one of the easier cases to explain.)

But these arcane optimization side effects have very little to do with the choice between a signed quantity and an unsigned quantity. That ought to be orthogonal: so you’d have signed integers with defined wraparound semantics, signed with “undefined” overflow, and similarly two kinds of unsigned integer.

Better handling of integer wraparound in the kernel

Posted Jan 28, 2024 10:46 UTC (Sun) by khim (subscriber, #9252) [Link]

> And this finally explains why common C programming style is to use signed integer types, even for quantities that will never be negative in practice.

Nah, that's simply because C grew out of B and latter only had one datatype: word.

When B evolved into C loops started using int for indexes and then CPUs (sic!) grew various hacks to make these efficient (AMD and Intel have macrops fusion, RISC-V have special instructions to cope, etc).

> I’m sure it’s not just for-loops and there are other important optimizations, just that this is one of the easier cases to explain.

Nope. Most of the time there are no difference, but without various hacks on all levels size_t would have been faster, of course.

On base RV64G they are faster, but who uses these?

Better handling of integer wraparound in the kernel

Posted Jan 28, 2024 11:32 UTC (Sun) by excors (subscriber, #95769) [Link] (1 responses)

> And this finally explains why common C programming style is to use signed integer types, even for quantities that will never be negative in practice.

I suspect that's almost never the real motivation. Early versions of C didn't even have unsigned types - the only types were int and char, and later float and double, so the only one suitable for a loop counter was int. The `int i; for (i=0; i<n; i++) ...` idiom appears to have already been established back then (https://www.bell-labs.com/usr/dmr/www/cman.pdf), and was widely used in the K&R book (after unsigned had been added to the language). K&R seems to use int as the default for pretty much everything, and only uses unsigned where it really needs all 16/32 bits (e.g. hash values) - and you'd almost never need all those bits for a loop counter because you wouldn't have enough RAM for such a huge array.

Also, `int` is much shorter and easier to type than `unsigned`. So I think people have continued using int for 50+ years because it's idiomatic and easy and there's been little motivation to change, not because they really care about modern compiler micro-optimisations.

Meanwhile in C++ it's idiomatic to use `for (size_t i = 0; ...)` (which is unsigned). As far as I can tell, that's largely because of STL: its designer preferred size_t as it means the library doesn't have to worry about being passed negative values for array sizes etc (C++ is much more interested in type-system-enforced correctness than C is), and it can always represent the largest possible object in memory (which by that time was approaching the 32-bit limit). STL was widely used, so everyone else used size_t for consistency, and to avoid implicit type conversions that could trigger overflows or compiler warnings.

In C++ it's also idiomatic to use `++i` instead of `i++` in loops, because `i` might be an STL iterator that has to do more work to return a copy of the old value in post-increment (at least in old compilers that optimise poorly), and there's no downside to preferring pre-increment, so pre-increment was widely recommended for performance. But I've never heard anybody make a general recommentation of using ssize_t instead of size_t to let the compiler optimise non-overflowing bounds checks - nobody cares about performance _that_ much, except when they've got a specific benchmark to profile and optimise.

Better handling of integer wraparound in the kernel

Posted Jan 28, 2024 12:10 UTC (Sun) by wtarreau (subscriber, #51152) [Link]

> Also, `int` is much shorter and easier to type than `unsigned`. So I think people have continued using int for 50+ years because it's idiomatic and easy and there's been little motivation to change, not because they really care about modern compiler micro-optimisations.

Absolutely! I've observed that it has been the main reason I've continued to use it a lot, and for this precise reason I added "uint", "uchar", "ulong", "ullong" etc to my programs to reduce the cost of switching to unsigned. And it works.

That's exactly why anything based on annotations or complex type definitions has no chance to work, too much pain for the developer, and makes the code totally unreadable. I already *introduced* bugs in my code because of GCC warnings that I was trying to shut up in good faith resulting in a lot more complex construct that had bugs. Data-related bugs are one thing, but when it becomes so complicated to tell the compiler there's no issue that you end up with control bugs, that's another story and a much more serious one.

I think that some tooling is needed, and the kernel provides a lot already. For example, min_t(), max_t() are convenient to use and perform lots of checks. Anything that saves the developer from introducing casts is good, and overflow checks are good, if they're easy to use and type (i.e. not __builtin_add_overflow_foo_bar()).

Readability is super critical to produce safe code and to spot bugs in others' code during reviews. Long lines and ultra-long identifiers severely affect this and can easily result in more bugs.

Better handling of integer wraparound in the kernel

Posted Jan 28, 2024 21:14 UTC (Sun) by willy (subscriber, #9762) [Link] (8 responses)

> It points out that if integer overflow were defined to wrap around then a simple for-loop could be an infinite loop, if the upper bound is INT_MAX

A fun interview question is to ask the candidate to write a program which prints out every number between 0 and UINT_MAX inclusive, using only 32 bit arithmetic.

The object, of course, is not to see whether they can, but what mistakes they make along the way, and how they handle Socratic questions about the flaws. Some candidates are quite overconfident that they know what they're doing.

I've seen some very innovative correct answers over the years too. And that's also cool because you can discuss the pitfalls and advantages of their approach over a more conventional approach.

Better handling of integer wraparound in the kernel

Posted Jan 28, 2024 22:23 UTC (Sun) by Wol (subscriber, #4433) [Link]

> A fun interview question is to ask the candidate to write a program which prints out every number between 0 and UINT_MAX inclusive, using only 32 bit arithmetic.

There's all sorts of questions you can ask like that :-)

> The object, of course, is not to see whether they can, but what mistakes they make along the way, and how they handle Socratic questions about the flaws. Some candidates are quite overconfident that they know what they're doing.

When I've done that, the number of people who clearly didn't hear when you say you want to see how they handle it, you don't expect the answer to be correct (or even compile!).

(The problem with that is managers. I've been in interviews where they assumed that because I knew dialect A, I could write a working program in dialect B having met it for the first time in the interview!)

Cheers,
Wol

Better handling of integer wraparound in the kernel

Posted Jan 29, 2024 0:54 UTC (Mon) by tialaramex (subscriber, #21167) [Link] (2 responses)

There will be infinitely many such numbers since 0 and UINT_MAX are definitionally different numbers, we can definitely always find a new different number between them, and so on forever. Almost All of these infinitely many numbers are also non-computable which isn't good news.

Perhaps you meant specifically integers? Computers much prefer integers, but now it's a boring problem. These days I mostly write Rust, so I'd want to write

let mut number: u32 = 0; loop { println!("{number}"); if number == u32::MAX { break; } number +=1; }

Better handling of integer wraparound in the kernel

Posted Jan 29, 2024 8:43 UTC (Mon) by gspr (subscriber, #91542) [Link] (1 responses)

> There will be infinitely many such numbers since 0 and UINT_MAX are definitionally different numbers, we can definitely always find a new different number between them, and so on forever.

As a mathematician, I'd find this response silly. If you want to be pedantic: "number" does not imply "real number" any more than it implies "integer". In fact, I'd argue that without any further context, then the fact that the name UINT_MAX seems to carry the implication of being an integer is enough for a reasonable person to at least assume that the question is about integers.

Your answer would come off to me as quite obtuse. Sure, nothing wrong with clarifying along the lines of "you mean integers, right?" But assuming reals and stating that the problem cannot be solved is just silly. And problematic because you, too, made a whole bunch of assumptions about the (somewhat imprecisely specified) task.

Better handling of integer wraparound in the kernel

Posted Jan 29, 2024 19:11 UTC (Mon) by tialaramex (subscriber, #21167) [Link]

That's a fair point, unlike INT_MAX (whose definition is just a number) UINT_MAX is typically defined as an actual unsigned integer using the suffix U for parsing reasons, and thus (on a platform where that's the case) it really is actually an unsigned integer like Rust's u32::MAX, not just a number.

Better handling of integer wraparound in the kernel

Posted Jan 29, 2024 2:35 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (2 responses)

Do you allow solutions that cheat?

puts("0");
for(unsigned int i = 1; i != 0; i++){
printf("%u\n", i);
}

No arithmetic other than 32-bit is done here*, but I'm not sure this is the answer you had in mind.

* Assuming sizeof(unsigned int) == 4 as on the vast majority of real systems. You can write uint32_t, but I would have to look up how to pass that to printf portably. It's probably some ugly preprocessor macro.

Better handling of integer wraparound in the kernel

Posted Jan 29, 2024 2:53 UTC (Mon) by willy (subscriber, #9762) [Link]

I accept any answer because it's just a starting point for the discussion that lets me get to know you as a programmer. I'd ask you how you'd go about testing it, for example. Or if you considered other algorithms and why you chose this one over any other. Or I'd write out one that was wrong and ask what feedback you'd give in code review.

I'm not looking for bug-free code in a whiteboard interview. I'm looking for a programmer who I can work with, and that means giving and receiving feedback.

Better handling of integer wraparound in the kernel

Posted Jan 29, 2024 20:24 UTC (Mon) by jwilk (subscriber, #63328) [Link]

> You can write uint32_t, but I would have to look up how to pass that to printf portably. It's probably some ugly preprocessor macro.

printf("%" PRIu32 "\n", i); // macro defined in <inttypes.h>

Better handling of integer wraparound in the kernel

Posted Jan 29, 2024 3:47 UTC (Mon) by wtarreau (subscriber, #51152) [Link]

That's a good idea. I, too, like to make candidates discuss about pitfalls. I don't care if they can't do something, they'll have plenty of time to find a solution, what matters is how they're searching and what they care about.

BTW regarding your challenge, for me that's a typical use case for do ... while since you can only test for the end after the operation, e.g i=0; do { printf("%u\n", i); } while (++i);

Better handling of integer wraparound in the kernel

Posted Feb 2, 2024 17:48 UTC (Fri) by anton (subscriber, #25547) [Link]

The stuff about this particular for loop is symptomatic for this blog series. It gives a (non-idiomatic) example fragment, and claims that allowing the compiler to assume that this loop is not infinite "allows a broad range of loop optimizations to kick in", but does not say what they are and how little benefit they provide. So I filled in some missing parts and tried out the result on Godbolt, and I found that gcc-13.2 -O3 delivers the same result without and with -fwrapv.

Wang et al. found that in only one loop in one of the SPEC Cint 2006 benchmarks there was a measurable difference between undefined and wraparound behaviour, due to sign-extending the int in the wrapping case (condensed Godbolt example). You can avoid this problem (on AMD64) by using size_t (as suggested by Wang et al.) or any other unsigned type for the loop index (but RV64G may need zero-extensions if the index type is smaller than an address), or by using an address-sized type (e.g., long, unsigned long on Linux targets).