Shrinking the kernel with link-time optimization

Posted Jan 26, 2018 22:46 UTC (Fri) by mirabilos (subscriber, #84359)
In reply to: Shrinking the kernel with link-time optimization by giraffedata
Parent article: Shrinking the kernel with link-time optimization

The problem isn’t even about fixing vs. not fixing; the problem is that
GCC developers seem to be disinterested in LTO bugs, and for its antecessor
(-fwhole-program --combine) they outright said it won’t get fixed.

I actually find the idea of using LTO to eliminate dead code, in the
Linux kernel or elsewhere, great — I just wanted to point out that GCC
might, with its current bugs, history of bugs, and history of attitude
towards said bugs¹, be a tad too unreliable to do so without excessive
tests that point out miscompiled builds.

① I read “low-hanging fruits” in an LWN article today. One of these,
for the GCC/LTO problem, would be to make building mksh part of the
usual pre-release tests; mksh has a history of spotting compiler,
toolchain, libc, etc. bugs via its testsuite.

Now, with both the footnote #1 and the first paragraph, let’s get to
something: isolating the issue is *hard*. The mksh testsuite is a
bunch of shell scripts together with flags and expected output, with
a Perl driver, ran through the shell compiled with the to-be-tested
compiler/toolchain/libc. That’s a few levels of indirection. The latest
LTO bug occurs in only one testcase: arith-ternary-prec-1, which is:

$ mksh -c 'typeset -i x=2; y=$((1 ? 20 : x+=2))'
mksh: 1 ? 20 : x+=2: += requires lvalue

Basically, ?: binds more than +=, so this is '(1 ? 20 : x) += 2',
and a miscompiled shell silently accepts this. This is *very* hard
to isolate.

GCC developers prefer isolated small test cases. Now, with LTO,
isolating gets even more complicated. I can accept that not having
a small isolated test case is not desirable.

On the other hand, a change in testsuite output between two different
versions of the same compiler, ceteris paribus (i.e. you try the same
version of the testsuite, shell, toolchain, libc, …), *does* indicate
a problem (not necessarily in the compiler, but it’s a prime suspect),
and in the time of “git bisect” it’s at least often possible, for someone
with enough beefy hardware to actually build GCC that often, to figure
out which compiler change introduced the breakage. (Then, it’s still a
matter of deciding whether the bug is actually in the compiler or else‐
where, but the GCC developers at least know their compiler, and each
other on the development team.)

Oh, and: the Linux kernel does not have such a testsuite. Several GNU
distributions’ mksh package maintainers have come to me, independently,
with a testsuite failure report about the above test, and the advice
found after the first analysis (LTO is at fault, GCC miscompiled mksh)
made them compile mksh without LTO, preventing their users from getting
a faulty binary that might misbehave in other situations as well. Now,
the Linux kernel, not so much.

Food for thought?

Shrinking the kernel with link-time optimization

Posted Feb 8, 2018 10:49 UTC (Thu) by dharding (subscriber, #6509) [Link] (1 responses)

Idle curiosity: I'm wondering (though I'm not expecting anyone in this thread to have a ready answer) how many of the problems in LTO builds are specific to LTO, and how many are generic optimization bugs exposed because LTO provides more opportunity for optimization.

Shrinking the kernel with link-time optimization

Posted Feb 8, 2018 20:03 UTC (Thu) by mirabilos (subscriber, #84359) [Link]

That’s an extremely interesting point.

And, yes, sorry, I don’t have even the beginning of an answer for you,
but someone with enough horsepower machine could certainly bisect this
between GCC 5 and 6 I think…