Old compilers and old bugs
On January 5, Russell King reported
on a problem he had been chasing for a long time. Some of his 64-bit Arm
systems running 5.4 or later kernels would, on rare occasion, report a
checksum failure on the ext4
root filesystem. It could take up to three months of uptime for the
problem to manifest itself, making it, as King described it,
"unrealistic to bisect
". He had, however, found a way to more
reliably reproduce the failure, making the task of finding out when the
problem was introduced plausible, at least.
Starting with King's findings, a number of developers working in the Arm subsystem looked into the issue; their efforts seemed to point out this commit as the culprit. That change, applied in 2019, relaxed the memory barriers used around I/O accessors, optimizing accesses to I/O memory. Reverting this patch made the problem go away.
Some developers might have applied the revert and called the problem solved, but that is not what happened here. Will Deacon, the author of the patch in question, was convinced of its correctness; if the Arm architecture is behaving as specified, there should be no need for the stronger barriers, so something else was going on. Reverting the patch, in other words, made the issue go away by papering over a real problem somewhere else.
Where might that "somewhere else" be? King suggested that it could be somewhere else in the kernel, in the Arm processor itself, or in the cache-coherent interconnect that ties together processor clusters and memory. He thought that a problem in the hardware was relatively unlikely, and that the bug thus lurked somewhere within the kernel. That, naturally, led to a lot of code examination, especially within the ext4 filesystem.
Two days later, King announced that the problem had been found; it indeed was an issue within the ext4 filesystem, but not of the variety that had been expected. A look at the assembly code generated for ext4_chksum() revealed that the compiler was freeing the function's stack frame prior to the end of the function itself. The last line of the function is:
return *(u32 *)desc.ctx;
Here, desc is a local variable, living on the stack. The compiled function was resetting the stack pointer above this variable immediately before fetching desc.ctx. That led to a window of exactly one instruction where the function was using stack space that had already been freed.
This is a compiler bug of the worst type. The miscompiled code will work as expected almost every time; there is, after all, no other code trying to allocate stack space in that one-instruction window. All bets are off, though, if an interrupt arrives exactly between the two instructions; then the stack will be overwritten and the load of desc.ctx will be corrupted, leading to the observed checksum failure. This is something that will almost never happen, but when it does things will go badly wrong.
This miscompilation was done by GCC 4.9.4, which was released in August 2016 (4.9.0, the major release on which it is based, came out in April 2014). The relevant bug, though, was reported in 2014 and fixed in November of that year. That fix was seemingly never backported from the (then) under-development 5.x release to 4.9.x, so the 4.9.4 release did not contain it. Interestingly, versions of 4.9.4 shipped by distributors like Red Hat, Android, and Linaro all did have the fix backported, so it only affected developers not using those versions. The bug lurked there for years until finally turning up in King's builds.
One outcome from this episode is a clear illustration of the potential downside of supporting old toolchains. A great deal of effort went into tracking down a bug that had, in fact, been fixed six years ago; that would have not been necessary if developers were not still using 4.9.x compilers.
As it happens, GCC 4.9 is the oldest compiler supported by the kernel, but even that requirement is relatively recent. As of 2018, the kernel still claimed (not entirely truthfully) that it could be built with GCC 3.2, which was released in 2002. As a result of discussions held in 2018, the minimum GCC version was moved forward to 4.6; later it became 4.9.
Fixing GCC 4.9 to address this bug is out of the question; the GCC developers have long since moved on from that release. So, at a minimum, the oldest version of the compiler that can be used for the arm64 architecture will have to be moved forward to 5.1. But that immediately led to the question of whether the oldest version for all architectures should be moved forward.
Ted Ts'o was in favor of that change, but he also pointed out that RHEL 7 (and thus CentOS 7) systems are still stuck with GCC 4.8. As Peter Zijlstra noted, though, it is already necessary to install a newer compiler than the distribution provides to build the kernel on those systems. Arnd Bergmann said that the other known users of GCC 4.9 were Android and Debian 8. Android has since switched over to Clang to build its kernels, and Debian 8 went unsupported at the end of June 2020. So it would appear that relatively few users would be inconvenienced by raising the minimum GCC version to 5.1.
On the other hand, there are some advantages to such a move beyond leaving an unpleasant bug behind. Bergmann argued for this change because it would allow compiling the kernel with -std=gnu11, making it possible to rely on bleeding-edge C11 features. Currently, kernel builds use -std=gnu89, based on the rather less shiny C89 standard. Zijlstra and Deacon both added that moving to 5.1 would allow the removal of a number of workarounds for GCC 4.9 problems.
Given all that, it seems unlikely that there will be much opposition to
moving the kernel as a whole to the 5.1 minimum version. That said, Linus
Torvalds is
unconvinced about the value of such a change and may yet need some
convincing. Even if the shift to 5.1 does not happen right away, the
writing would seem to be on the wall that GCC 4.9 will not be
supported indefinitely. GCC 5.1,
released in April 2015, is not the newest thing on the planet either, of
course. But hopefully it has fewer lurking bugs while simultaneously
making some welcome new features available. Supporting old toolchains has
its value, but so does occasionally dropping the oldest of them.
| Index entries for this article | |
|---|---|
| Kernel | Build system |
