Old compilers and old bugs
On January 5, Russell King reported
on a problem he had been chasing for a long time. Some of his 64-bit Arm
systems running 5.4 or later kernels would, on rare occasion, report a
checksum failure on the ext4
root filesystem. It could take up to three months of uptime for the
problem to manifest itself, making it, as King described it,
"unrealistic to bisect
". He had, however, found a way to more
reliably reproduce the failure, making the task of finding out when the
problem was introduced plausible, at least.
Starting with King's findings, a number of developers working in the Arm subsystem looked into the issue; their efforts seemed to point out this commit as the culprit. That change, applied in 2019, relaxed the memory barriers used around I/O accessors, optimizing accesses to I/O memory. Reverting this patch made the problem go away.
Some developers might have applied the revert and called the problem solved, but that is not what happened here. Will Deacon, the author of the patch in question, was convinced of its correctness; if the Arm architecture is behaving as specified, there should be no need for the stronger barriers, so something else was going on. Reverting the patch, in other words, made the issue go away by papering over a real problem somewhere else.
Where might that "somewhere else" be? King suggested that it could be somewhere else in the kernel, in the Arm processor itself, or in the cache-coherent interconnect that ties together processor clusters and memory. He thought that a problem in the hardware was relatively unlikely, and that the bug thus lurked somewhere within the kernel. That, naturally, led to a lot of code examination, especially within the ext4 filesystem.
Two days later, King announced that the problem had been found; it indeed was an issue within the ext4 filesystem, but not of the variety that had been expected. A look at the assembly code generated for ext4_chksum() revealed that the compiler was freeing the function's stack frame prior to the end of the function itself. The last line of the function is:
return *(u32 *)desc.ctx;
Here, desc is a local variable, living on the stack. The compiled function was resetting the stack pointer above this variable immediately before fetching desc.ctx. That led to a window of exactly one instruction where the function was using stack space that had already been freed.
This is a compiler bug of the worst type. The miscompiled code will work as expected almost every time; there is, after all, no other code trying to allocate stack space in that one-instruction window. All bets are off, though, if an interrupt arrives exactly between the two instructions; then the stack will be overwritten and the load of desc.ctx will be corrupted, leading to the observed checksum failure. This is something that will almost never happen, but when it does things will go badly wrong.
This miscompilation was done by GCC 4.9.4, which was released in August 2016 (4.9.0, the major release on which it is based, came out in April 2014). The relevant bug, though, was reported in 2014 and fixed in November of that year. That fix was seemingly never backported from the (then) under-development 5.x release to 4.9.x, so the 4.9.4 release did not contain it. Interestingly, versions of 4.9.4 shipped by distributors like Red Hat, Android, and Linaro all did have the fix backported, so it only affected developers not using those versions. The bug lurked there for years until finally turning up in King's builds.
One outcome from this episode is a clear illustration of the potential downside of supporting old toolchains. A great deal of effort went into tracking down a bug that had, in fact, been fixed six years ago; that would have not been necessary if developers were not still using 4.9.x compilers.
As it happens, GCC 4.9 is the oldest compiler supported by the kernel, but even that requirement is relatively recent. As of 2018, the kernel still claimed (not entirely truthfully) that it could be built with GCC 3.2, which was released in 2002. As a result of discussions held in 2018, the minimum GCC version was moved forward to 4.6; later it became 4.9.
Fixing GCC 4.9 to address this bug is out of the question; the GCC developers have long since moved on from that release. So, at a minimum, the oldest version of the compiler that can be used for the arm64 architecture will have to be moved forward to 5.1. But that immediately led to the question of whether the oldest version for all architectures should be moved forward.
Ted Ts'o was in favor of that change, but he also pointed out that RHEL 7 (and thus CentOS 7) systems are still stuck with GCC 4.8. As Peter Zijlstra noted, though, it is already necessary to install a newer compiler than the distribution provides to build the kernel on those systems. Arnd Bergmann said that the other known users of GCC 4.9 were Android and Debian 8. Android has since switched over to Clang to build its kernels, and Debian 8 went unsupported at the end of June 2020. So it would appear that relatively few users would be inconvenienced by raising the minimum GCC version to 5.1.
On the other hand, there are some advantages to such a move beyond leaving an unpleasant bug behind. Bergmann argued for this change because it would allow compiling the kernel with -std=gnu11, making it possible to rely on bleeding-edge C11 features. Currently, kernel builds use -std=gnu89, based on the rather less shiny C89 standard. Zijlstra and Deacon both added that moving to 5.1 would allow the removal of a number of workarounds for GCC 4.9 problems.
Given all that, it seems unlikely that there will be much opposition to
moving the kernel as a whole to the 5.1 minimum version. That said, Linus
Torvalds is
unconvinced about the value of such a change and may yet need some
convincing. Even if the shift to 5.1 does not happen right away, the
writing would seem to be on the wall that GCC 4.9 will not be
supported indefinitely. GCC 5.1,
released in April 2015, is not the newest thing on the planet either, of
course. But hopefully it has fewer lurking bugs while simultaneously
making some welcome new features available. Supporting old toolchains has
its value, but so does occasionally dropping the oldest of them.
Index entries for this article | |
---|---|
Kernel | Build system |
Posted Jan 11, 2021 22:09 UTC (Mon)
by JoeBuck (subscriber, #2330)
[Link] (1 responses)
Posted Jan 11, 2021 22:24 UTC (Mon)
by zlynx (guest, #2285)
[Link]
The memory barrier probably masked the problem by changing how the compiler ordered its machine code. That doesn't mean that adding a memory barrier would fix the same bug in other locations.
Posted Jan 11, 2021 22:34 UTC (Mon)
by dxin (guest, #136611)
[Link] (1 responses)
Posted Jan 12, 2021 0:56 UTC (Tue)
by ndesaulniers (subscriber, #110768)
[Link]
Posted Jan 12, 2021 8:12 UTC (Tue)
by marcH (subscriber, #57642)
[Link] (2 responses)
bleeding edge features like.... C99 comments! No?
Posted Jan 12, 2021 10:25 UTC (Tue)
by quietbritishjim (subscriber, #114117)
[Link]
Posted Jan 13, 2021 8:39 UTC (Wed)
by error27 (subscriber, #8346)
[Link]
Posted Jan 12, 2021 10:51 UTC (Tue)
by vstinner (subscriber, #42675)
[Link] (14 responses)
I love such thriller story! With the article, the whole issue perfectly makes sense, but I love the puzzle that people had the solve to put pieces altogether.
I'm always amazed by developers able to spot in the machine/assembly code that the C code is "miscompiled". The compiler is not the usual suspect.
It's also "funny" that it's still possible to use a variable allocated on the stack after the stack pointer is "reset". It's funny since even if the code is "wrong", well, it works in practice "most of the time".
> That led to a window of exactly one instruction where the function was using stack space that had already been freed.
How likely is it to get an interruption at a very precise unique instruction? "Unlikely" if I understood correctly, but "it happens" because of the Murphy's law ;-)
--
Maybe a special kernel debug mode should send signals frequently to make such bug more likely?
In Python, I wrote functional tests on EINTR handling. The tests use setitimer() to trigger a SIGARLM signal every 100 ms while the test is running, some tests "sleep" for 200 ms and so should get at least one signal.
Posted Jan 12, 2021 15:09 UTC (Tue)
by eru (subscriber, #2753)
[Link]
This type of bug is not that unusual, but normally it is caused by a programmer returning a pointer to a local variable from a function. Depending on what the caller does, the buggy code may "work" for a long time, and then blow up, when some unrelated code changes are made, or the compiler is changed. Actually this bug was more common in the past. Nowadays smarter compilers and code analysis tools can usually warn about it.
Posted Jan 12, 2021 15:58 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (10 responses)
For "regular userspace code," this may be entirely legal depending on how big your stack frames are. But the kernel uses -mno-red-zone to forbid it, precisely because it needs to be interrupt-safe.
See also this Linus rant (which is the only reason I know these things): https://lkml.org/lkml/2014/7/24/584
Posted Jan 12, 2021 16:34 UTC (Tue)
by eru (subscriber, #2753)
[Link] (4 responses)
Posted Jan 12, 2021 20:00 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link]
Which is the most popular architecture by far in terms of desktops and servers. Obviously, phones are a different story, but characterizing it as "only" x86_64 sounds a bit odd to me.
> But any C code that depends on the stack below the stack frame of the current function remaining undisturbed is clearly buggy.
The C abstract machine has no notion of "below the stack frame," so this sentence is meaningless.
Under C, a pointer can point to one of five things:
1. A static object.
For cases 1 and 3, the stack frame is obviously irrelevant. Case 2 is only possible while the function is still executing, because when it returns, we immediately transition to case 5 (according to the C abstract machine). And cases 4 and 5 are always illegal to dereference, regardless of whether you have red-zones or not, meaning an optimizing compiler can interpret your code as unreachable and do wacky things even if the arch would have supported it. In fact, it's generally illegal to even *have* a case 5 pointer, because all of the operations you can do to create a case 5 pointer are themselves UB.
Instead, the entire context of this discussion is around *assembler* code, which has (usually) been emitted by a compiler. The compiler is absolutely entitled to use red zoning on archs where it is supported, unless the user instructs it otherwise (with -mno-red-zone). But this has nothing to do with the "bugginess" of the C code it started with. Buggy C code will obviously result in buggy assembler, but there is no reasonable way for C code to cause a red-zoning violation unless it has wandered pretty far into UB territory to begin with. At that point, the C code isn't so much buggy as utterly meaningless.
Posted Jan 13, 2021 7:15 UTC (Wed)
by jem (subscriber, #24231)
[Link] (2 responses)
One could argue that variables in the "red zone" *are* part of the stack frame of the function. The compiler writer didn't want to waste a clock cycle to decrement the stack pointer, but that is just a technicality. On the other hand, any code that assumes the stack area below the stack pointer is free to use for some other purpose is clearly buggy.
Posted Jan 13, 2021 9:54 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (1 responses)
Are you talking about the code that executes a new function, and hence is assuming that that space is free to to use for allocating a new stack frame? Because that's exactly what this bug is ...
Cheers,
Posted Jan 13, 2021 14:10 UTC (Wed)
by jem (subscriber, #24231)
[Link]
Posted Jan 12, 2021 21:07 UTC (Tue)
by cesarb (subscriber, #6266)
[Link] (4 responses)
> We do *not* follow the x86-64 ABI wrt redzoning, because we *cannot*: interrupts while in kernel mode *will* use the stack without a redzone.
This is a peculiarity of the x86 architecture (which was kept when AMD extended it to 64 bits): to handle an interrupt, the CPU stores the return address and the processor state (flags) in the stack. Other architectures (including IIRC ARM and MIPS) store them in special registers instead, keeping the stack untouched.
Posted Jan 13, 2021 17:27 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (2 responses)
Cheers,
Posted Jan 13, 2021 19:17 UTC (Wed)
by excors (subscriber, #95769)
[Link]
(The nesting only allows interrupt handlers to be interrupted by a strictly higher priority interrupt, and there's a limited number of priority levels (e.g. I think 8 is common on Cortex-M4), so the nesting can't get arbitrarily deep. It's useful since you can use high-priority interrupts for real-time tasks, with a guarantee the handler will execute within a certain number of cycles and will never be held up by lower-priority interrupt handlers. That wouldn't be possible if the lower-priority handler was required to manually save its state before allowing any nesting.)
Posted Jan 25, 2021 19:07 UTC (Mon)
by immibis (subscriber, #105511)
[Link]
I'm not sure how that works with exceptions. I think ARM has a separate register set for "aborts", so you can nest an abort within an IRQ, but nesting an abort within an abort may cause problems for you - of course, if your abort handler was unable to save the state without causing another abort, you *already* have problems. That's basically a triple fault in software.
Posted Jan 13, 2021 19:10 UTC (Wed)
by jem (subscriber, #24231)
[Link]
Posted Jan 12, 2021 18:56 UTC (Tue)
by pbonzini (subscriber, #60935)
[Link]
By the time the compiler becomes the suspect, spotting the miscompiled code is easy compared to the steps before!
(That said, I know you worked on a ProactorEventLoop that was probably just as nasty as this kind of bug. I found your blog post while trying to figure out somr random hangs that only showed up on Windows and only with Python 3.5. Fortunately the 3.5 EOL meant that *my* bug fixed itself just by waiting a few weeks!).
Posted Jan 13, 2021 3:55 UTC (Wed)
by plugwash (subscriber, #29694)
[Link]
1. the processor executes on-average one instruction per clock cycle (IIRC modern cores can perform more than one instruction per clock sometimes, but they can also have to wait for memory)
Which would give us an assumed hit-rate from timer interrupts alone of 1 in 2 million executions. That sounds frequent enough to cause problems on a frequently executed codepath, but rare enough to be a massive PITA to debug.
Posted Jan 12, 2021 12:07 UTC (Tue)
by ncm (guest, #165)
[Link] (13 responses)
Few in that world use ARM, yet. But quite a large number of other bugs motivated the later 4.8.x releases, and 4.9. One might expect that there would be a careful catalog of Known Bugs, with means to step around them, and reviews of code to ensure they are not stepped in, but such an expectation would be wholly wrong.
Posted Jan 12, 2021 15:12 UTC (Tue)
by eru (subscriber, #2753)
[Link] (8 responses)
Posted Jan 12, 2021 16:16 UTC (Tue)
by ncm (guest, #165)
[Link] (6 responses)
Finance people generally *prefer* things that are out of maintenance, because they experience even fewer incoming changes. Any infrastructural bug that seems like it ought to be physically possible to work around reliably is A-OK, if the alternative would involve changing something.
Posted Jan 12, 2021 17:35 UTC (Tue)
by eru (subscriber, #2753)
[Link] (2 responses)
Posted Jan 12, 2021 19:27 UTC (Tue)
by jwarnica (subscriber, #27492)
[Link] (1 responses)
Red Hat (in RHEL vWhatever) or Debian (in vWhatever), or similar, by definition, contains a cohesive set of obsolete (but internally cross compatible) things. If you want to backport a particular patch to your obsolete world, feel free, but then compile it with the included and matching obsolete compiler.
Posted Jan 12, 2021 21:06 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link]
While this is the extreme end of the spectrum, there is a line that needs to be drawn. I don't know how many times bug reports have come down to things like "I have some random path in LD_LIBRARY_PATH because of some other problem I hit months ago" or (just this week) "the build picked up a Python2 tool in a Python3 PYTHONPATH-using environment" that if someone who doesn't know how to gather the required information (or even just read error messages) does have their problem solved by doing a clean build or other from-scratch setup is just the best solution. I don't even know how to document "make sure that tool X that is found is using Python3, not Python2" in a way that someone who couldn't determine that it's what's happening from the error message anyways would be able to effectively act upon.
> "Someone wants to use an obsolete dependency because their world is built on mostly obsolete things? OK, let them!"
Again, there's a limit sometimes as to how varied an environment a single project can reliably support. At some point when someone is asking for a 5 year old deployment of tool X to support other tools that came out yesterday and finding issues, well, what is one to do?
Posted Jan 12, 2021 20:31 UTC (Tue)
by atai (subscriber, #10977)
[Link] (2 responses)
Posted Jan 12, 2021 21:09 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
The problem isn't the code, it's peoples' inertia.
Cheers,
Posted Jan 12, 2021 21:10 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link]
Posted Jan 12, 2021 23:16 UTC (Tue)
by nivedita76 (subscriber, #121790)
[Link]
Posted Jan 12, 2021 18:31 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Jan 22, 2021 0:25 UTC (Fri)
by nix (subscriber, #2304)
[Link] (1 responses)
(There is a nice historical timeline at https://gcc.gnu.org/releases.html so you don't have to dig around on ancient ftp servers because all the ancient historical digging has already been done for us. It's at times like this that I'm reminded of how old this stuff is, how it was made by a generation that is now fast approaching retirement: I'm not exactly young any more, old enough to have had no access to computers at all for the first few years of my life -- but when GCC was first released, I was only ten years old, and it had been under development for some years by then. Free software is older than the children of the microcomputer age. For that matter even the *text editors* that a lot of us use are older than we are: both vi and Emacs date back in some form to roughly 1976.)
Posted Jan 22, 2021 0:43 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jan 12, 2021 18:47 UTC (Tue)
by alonz (subscriber, #815)
[Link]
Posted Jan 12, 2021 22:59 UTC (Tue)
by ndesaulniers (subscriber, #110768)
[Link] (1 responses)
Posted Jan 21, 2021 11:59 UTC (Thu)
by Hi-Angel (guest, #110915)
[Link]
Posted Jan 13, 2021 0:04 UTC (Wed)
by dvdeug (guest, #10998)
[Link] (1 responses)
Posted Jan 24, 2021 0:17 UTC (Sun)
by david.a.wheeler (subscriber, #72896)
[Link]
The Ada standard includes "pragma reviewable", a compiler mechanism specifically to enable reviews of the generated code:
It's specifically to enable review when you're developing high assurance software (software has to be pretty critical to justify this level of review, but if it's really critical, this does help counter these kinds of errors).
Old compilers and old bugs
<p>
But a given vendor might have backported the GCC fix to 4.9. Guess there could be a config option to force it on.
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
>It's also "funny" that it's still possible to use a variable allocated on the stack after the stack pointer is "reset". It's funny since even if the code is "wrong", well, it works in practice "most of the time".
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
2. An automatic object.
3. A dynamic object.
4. NULL or one-past-the-end-of-an-array.
5. Anything else.
Old compilers and old bugs
Old compilers and old bugs
Wol
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Wol
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Lets try and put an order of magintude on this by making come crude assumptions
2. the processor runs at about 2GHz (seems reasonable for an arm64 core).
3. there is a timer interrupt at 1000Hz.
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Wol
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
I'm surprised that they use GCC at all, and not COBOL...
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
Old compilers and old bugs
https://www.adaic.org/resources/add_content/standards/12r...