Linux 5.12's very bad, double ungood day
Linux 5.12's very bad, double ungood day
Posted Mar 8, 2021 23:18 UTC (Mon) by roc (subscriber, #30627)In reply to: Linux 5.12's very bad, double ungood day by airlied
Parent article: Linux 5.12's very bad, double ungood day
Posted Mar 9, 2021 4:26 UTC (Tue)
by airlied (subscriber, #9104)
[Link] (18 responses)
Posted Mar 9, 2021 12:01 UTC (Tue)
by roc (subscriber, #30627)
[Link] (17 responses)
I am constantly frustrated by the kernel testing culture, or lack of it. rr has approximately zero resources behind it and our automated tests are still responsible for detecting an average of about one kernel regression every release.
Every time I bring this up people have a string of excuses, like how it's hard to test drivers etc. Some of those are fair, but the bug in this article and pretty much all the regressions found by rr aren't in drivers, they're in core kernel code that can easily be tested, even from user space.
Posted Mar 9, 2021 13:27 UTC (Tue)
by pizza (subscriber, #46)
[Link] (16 responses)
Okay, so... when exactly?
There are 10K commits (give or take a couple thousand) that land in every -rc1. Indeed, until -rc1 lands, nobody can really be sure if a given pull request (or even a specific patch) will get accepted. This is why nearly all upstream tooling treats "-rc1" as the "time to start looking for regressions" inflection point [1], and they spend the next *two months* fixing whatever comes up. This has been the established process for over a decade now.
So what if there was a (nasty) bug that takes down a test rig? That's what the test rigs are for! The only thing unusual about this bug is that it leads to silent corruption, to the point where "testing" in of itself wasn't enough; the test would have had to been robust enough to ensure nothing unexpected was written anywhere to the disk. That's a deceptively hairy testing scenario, arguably going well beyond the tests folks developing filesystems run.
Note I'm not making excuses here; it is a nasty bug and clearly the tests that its developers ran was insufficient. But it is ridiculous to expect "release-quality" regression testing to be completed at the start of the designated testing period.
[1] Indeed, many regressions are due to the combinations of unrelated changes in a given -rc1; each of those 10K patches in of themselves is fine, but (eg) patch #3313 could lead to data loss, but only in combination of a specific kernel option being enabled, and run on a system containing an old 3Ware RAID controller and a specific motherboard with a PCI-X bridge that can't pass through MSI interrupts due to how it was physically wired up. [2] [3]
[2] It's sitting about four feet away from me as I type this.
[3] Kernel bugzilla #43074
Posted Mar 9, 2021 14:11 UTC (Tue)
by epa (subscriber, #39769)
[Link]
Posted Mar 9, 2021 15:12 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (11 responses)
Posted Mar 9, 2021 15:22 UTC (Tue)
by geert (subscriber, #98403)
[Link] (1 responses)
$ git tag --contains 48d15436fde6
Three weeks passed between the buggy commit entering linux-next and upstream.
Posted Mar 9, 2021 15:35 UTC (Tue)
by pizza (subscriber, #46)
[Link]
So the "problem" here isn't that nothing was being tested, it's just that none of the tests run during this interval window caught this particular issue. It's also not clear that there was even a test out there that could have caught this, except by pure happenstance.
But that's the reality of software work; a bug turns up, write a test to catch it (and hopefully others of the same class), add it to the test suite (which runs as often as your available resources allow) .... and repeat endlessly.
Posted Mar 9, 2021 15:25 UTC (Tue)
by pizza (subscriber, #46)
[Link] (8 responses)
Not that it will stop folks complaining when "5.32-alpha0-rc4-pre3" fails to boot on their production system, obviously because it should have been tested first, and we need a pre-pre-pre-pre-pre release snapshot to start testing against.
Posted Mar 9, 2021 15:26 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Posted Mar 9, 2021 17:47 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (5 responses)
Horse to water and all that ...
Cheers,
Posted Mar 9, 2021 19:55 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link]
But this kind of one-off code is annoying to test itself and someone will script adding it to their boot command lines anyways.
Posted Mar 10, 2021 21:58 UTC (Wed)
by sjj (guest, #2020)
[Link] (3 responses)
I don’t think I’ve built a kernel in 10 years, or maybe that one time 7-8 years ago.
Posted Mar 10, 2021 22:22 UTC (Wed)
by roc (subscriber, #30627)
[Link] (1 responses)
Posted Mar 11, 2021 8:43 UTC (Thu)
by pbonzini (subscriber, #60935)
[Link]
Posted Mar 10, 2021 23:20 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
You clearly don't run gentoo :-)
Cheers,
Posted May 2, 2021 2:58 UTC (Sun)
by pizza (subscriber, #46)
[Link]
I saw this scroll by when I upgraded this system to Fedora 34:
$ rpm -q icedtea-web
Posted Mar 9, 2021 22:44 UTC (Tue)
by dbnichol (subscriber, #39622)
[Link] (1 responses)
Posted Mar 10, 2021 2:43 UTC (Wed)
by roc (subscriber, #30627)
[Link]
Posted Mar 10, 2021 2:34 UTC (Wed)
by roc (subscriber, #30627)
[Link]
In practice, large projects often try to maximise bang-for-the-buck by dividing tests into tiers, e.g. tier 1 tests run on every push, tier 2 every day, maybe a tier 3 that runs less often. Many projects use heuristics or machine learning to choose which tests to run in each run of tier 1.
Yes, I understand that it's difficult to thoroughly test weird hardware and configuration combinations. Ideally organizations that produce hardware with Linux support would contribute testing on that hardware. But even if we ignore all those bugs, there are still lots of core kernel bugs not being caught by kernel CI.
Linux 5.12's very bad, double ungood day
Linux 5.12's very bad, double ungood day
Linux 5.12's very bad, double ungood day
Linux 5.12's very bad, double ungood day
Linux 5.12's very bad, double ungood day
Linux 5.12's very bad, double ungood day
next-20210128
next-20210129
next-20210201
next-20210202
next-20210203
next-20210204
next-20210205
next-20210208
next-20210209
next-20210210
next-20210211
next-20210212
next-20210215
next-20210216
next-20210217
next-20210218
next-20210219
next-20210222
next-20210223
next-20210224
next-20210225
next-20210226
next-20210301
next-20210302
next-20210303
next-20210304
next-20210305
next-20210309
v5.12-rc1
v5.12-rc1-dontuse
v5.12-rc2
Linux 5.12's very bad, double ungood day
> Three weeks passed between the buggy commit entering linux-next and upstream.
Linux 5.12's very bad, double ungood day
Linux 5.12's very bad, double ungood day
Linux 5.12's very bad, double ungood day
Wol
Linux 5.12's very bad, double ungood day
Linux 5.12's very bad, double ungood day
Linux 5.12's very bad, double ungood day
Linux 5.12's very bad, double ungood day
Linux 5.12's very bad, double ungood day
Wol
Linux 5.12's very bad, double ungood day
icedtea-web-2.0.0-pre.0.3.alpha16.patched1.fc34.3.x86_64
Linux 5.12's very bad, double ungood day
Linux 5.12's very bad, double ungood day
Linux 5.12's very bad, double ungood day