Ext4 data corruption in stable kernels

Posted Dec 10, 2023 23:27 UTC (Sun) by bgilbert (subscriber, #4738)
In reply to: Ext4 data corruption in stable kernels by wtarreau
Parent article: Ext4 data corruption in stable kernels

> The same type of people who complain about unanticipated storms and who then complain about mistaken weather forecast when it announces rain that doesn't come.

"Stable" is not a prediction of forces beyond developer control. It's an assertion of a quality bar, which needs to be backed by appropriate tools, testing, and developer time.

> For my stable kernel usages, I *tend* to pick one or two versions older than the last one if I see that the recent fixes are not important for me (i.e. I won't miss them).

As I understand Greg KH's position, anyone applying such a policy is irresponsible for not immediately installing the newest batch of patches.

Ext4 data corruption in stable kernels

Posted Dec 11, 2023 13:47 UTC (Mon) by wtarreau (subscriber, #51152) [Link] (16 responses)

> > The same type of people who complain about unanticipated storms and who then complain about mistaken weather forecast when it announces rain that doesn't come.

> "Stable" is not a prediction of forces beyond developer control. It's an assertion of a quality bar, which needs to be backed by appropriate tools, testing, and developer time.

Which is exactly the case. Look at the latest 6.6.5-rc1 thread for example:
https://lore.kernel.org/all/20231205031535.163661217@linu...

I've counted 17 people responding to that thread with test reports, some of which indicate boot failures, others successes, on a total of around 910 systems covering lots of architectures, configs and setup. I think this definitely qualifies for "appropriate tools", "testing" and "developer time", and I doubt many other projects devote that amount of efforts to weekly releases.

> > For my stable kernel usages, I *tend* to pick one or two versions older than the last one if I see that the recent fixes are not important for me (i.e. I won't miss them).
>
> As I understand Greg KH's position, anyone applying such a policy is irresponsible for not immediately installing the newest batch of patches.

No, for having already discussed this topic with him, I'm pretty sure he never said this. I even remember that once he explained that he doesn't want to advertise severity levels in his releases so that users upgrade when they feel confident and not necessarily immediately nor when it's written that now's a really important one. Use cases differ so much between users that some might absolutely need to upgrade to fix a driver that's going to ruin their data while others might prefer not to as a later fix could cause serious availability issues.

Periodically applying updates is a healthy approach, what matters is that severe bugs do not live long enough in the wild and that releases are frequent enough to help narrow down an occasional regression based on the various reports. I personally rebuild every time I reboot my laptop (it's quite rare thanks to suspend), and phone vendors tend to update only once every few months and that's already OK.

Ext4 data corruption in stable kernels

Posted Dec 11, 2023 15:44 UTC (Mon) by bgilbert (subscriber, #4738) [Link] (9 responses)

I've counted 17 people responding to that thread with test reports, some of which indicate boot failures, others successes, on a total of around 910 systems covering lots of architectures, configs and setup. I think this definitely qualifies for "appropriate tools", "testing" and "developer time", and I doubt many other projects devote that amount of efforts to weekly releases.

Many other projects have CI tests that are required to pass before a new release can ship. If that had been the case for LTP, this regression would have been avoided. What's more, the problem was reported to affect 6.1.64 during its -rc period, but no action was taken to fix that release. 6.1.64 was released with the problem four days later.

Mistakes happen! But this is an opportunity to improve processes to prevent a recurrence, rather than accepting the status quo.

No, for having already discussed this topic with him, I'm pretty sure he never said this. I even remember that once he explained that he doesn't want to advertise severity levels in his releases so that users upgrade when they feel confident and not necessarily immediately nor when it's written that now's a really important one. Use cases differ so much between users that some might absolutely need to upgrade to fix a driver that's going to ruin their data while others might prefer not to as a later fix could cause serious availability issues.

I have personally been complained at by Greg for fixing a stable kernel regression via cherry-pick, rather than shipping the latest release directly to distro users. I've seen similarly aggressive messaging in other venues. In fact, the standard release announcement says:

All users of the x.y kernel series must upgrade.

If downstream users are intended to take a more cautious approach, the messaging should be clarified to reflect that.

Ext4 data corruption in stable kernels

Posted Dec 12, 2023 4:52 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (8 responses)

> but no action was taken to fix that release. 6.1.64 was released with the problem four days later.

You should really see that as a pipeline. Even if the issue was reported you don't know if it was noticed before 6.1.64 was emitted. What matters is that the issue was quickly fixed. Sure we're still missing a way to tag certain versions as broken, like happened for 2.4.11 that was marked "dontuse" in the download repository. But it's important to understand that the constant flow of fixes doesn't easily prevent a release from being cancelled instantly.

I would not be shocked to see 3 consecutive kernels being emitted and tagged as "ext4 broken" there for the time it takes to get knowledge of the breakage and fix it.

> I have personally been complained at by Greg for fixing a stable kernel regression via cherry-pick, rather than shipping the latest release directly to distro users.

Here you're speaking about cherry-picking fixes. That's something extremely dangerous that nobody must ever do and that some distros have been doing for a while, sometimes shipping kernels remaining vulnerable for months or years due to this bad practice. The reason for recommending against cherry-picking is very simple (and was explained in lengths at multiple conferences): the ONLY combinations of kernel patches that are both tested and supported by the subsystem maintainers are the mainline and stable ones. If you perform any other assembly of patches, nobody knows if they work well together or if another important patch is missing (as happened above). Here the process worked fine because developers reported the missing patches. Imagine if you took that single patch yourself, nobody would have known and you could have corrupted a lot of your users' FSes.

So please, for your users, never ever cherry-pick random patches from stable. Take the whole stable, possibly a slightly older one if you don't feel easy with latest changes, add your distro-specific patches on top of it, but do not pick what seems relevant to you, that will eventually result in a disaster and nobody will support you for having done this.

Ext4 data corruption in stable kernels

Posted Dec 12, 2023 9:58 UTC (Tue) by bgilbert (subscriber, #4738) [Link] (3 responses)

Even if the issue was reported you don't know if it was noticed before 6.1.64 was emitted. What matters is that the issue was quickly fixed.

The message I linked above is dated November 24 and reported a regression in v6.1.64-rc1. The testing deadline for 6.1.64 was November 26, and it was released on November 28. That report was sufficient to cause a revert in 5.10.y and 5.15.y, so I don't think there can be an argument that not enough information was available.

The users who had data corruption, or who had to roll out an emergency fix to avoid data corruption, don't care that the issue was quickly fixed. They can always roll back to an older kernel if they need to. They care that the problem happened in the first place.

The reason for recommending against cherry-picking is very simple (and was explained in lengths at multiple conferences): the ONLY combinations of kernel patches that are both tested and supported by the subsystem maintainers are the mainline and stable ones. [...] Take the whole stable, possibly a slightly older one if you don't feel easy with latest changes, add your distro-specific patches on top of it, but do not pick what seems relevant to you, that will eventually result in a disaster and nobody will support you for having done this.

What are you talking about? If I ship a modified kernel and it breaks, of course no one will support me for having done so. If I ship an unmodified stable kernel and it breaks, no one will support me then either! The subsystem maintainers aren't going to help with my outage notifications, my users, or my emergency rollout. As with any downstream, I'm ultimately responsible for what I ship.

In the case mentioned upthread, my choices were: a) cherry-pick a one-line fix for the userspace ABI regression, or b) take the entire diff from 4.14.96 to 4.14.97: 69 patches touching 92 files, +1072/-327 lines. Option b) is simply not defensible release engineering. If I can't hotfix a regression without letting in a bunch of unrelated code, I'll never converge to a kernel that's safe to ship. That would arguably be true even if stable kernels didn't have a history of user-facing regressions, which they certainly did.

This discussion is a great example of the problem I'm trying to describe. Stable kernels are aggressively advertised as the only safe kernels to run, but there's plenty of evidence that they aren't safe, and the stable maintainers tend to denigrate and dismiss users' attempts to point out the structural problems — or even to work around them! These problems can be addressed, as I said, with tools, testing, and developer time. There is always, always, always room for improvement. But that will only happen if the stable team decides to make improvement a priority.

Ext4 data corruption in stable kernels

Posted Dec 12, 2023 17:55 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (2 responses)

> The message I linked above is dated November 24 and reported a regression in v6.1.64-rc1. The testing deadline for 6.1.64 was November 26, and it was released on November 28. That report was sufficient to cause a revert in 5.10.y and 5.15.y, so I don't think there can be an argument that not enough information was available.

Yes but if you read Greg's response, it's obvious there has been a misunderstanding, and noone else jumped on that thread to ask for the other kernels. Sh*t happens:

> > and on the following RC's:
> > * v5.10.202-rc1
> > * v5.15.140-rc1
> > * v6.1.64-rc1
> >
> > (Note that the list might not be complete, because some branches failed to execute completely due to build issues reported elsewhere.)
> >
> > Bisection in linux-5.15.y pointed to:
> >
> > commit db85c7fff122c14bc5755e47b51fbfafae660235
> > Author: Jan Kara <jack@suse.cz>
> > Date: Fri Oct 13 14:13:50 2023 +0200
> >
> > ext4: properly sync file size update after O_SYNC direct IO
> > commit 91562895f8030cb9a0470b1db49de79346a69f91 upstream.
> >
> >
> > Reverting that commit made the test pass.
>
> Odd. I'll go drop that from 5.10.y and 5.15.y now, thanks.

I mean, it's always the same every time there is a regression: users jump on their gun and explain what OUGHT to have been done, except that unsurprisingly they were not there either to do it by then. I don't know when everyone will understand that maintaining a working kernel is a collective effort, and that when there's a failure it's a collective failure.

> If I can't hotfix a regression without letting in a bunch of unrelated code, I'll never converge to a kernel that's safe to ship.

There are two safe possibilities for this:
- either you ask the identified wrong commit and ask its author what he thinks about removing it and you do that;
- or you roll back to the latest known good kernel. Upgrades are frequent enough to allow rollbacks. Seriously...

And in both cases it's important to insist on having a fixed version so that the involved people have their say on the topic (including "take this fix instead, it's ugly but safer for now"). What matters in the end is end-users' safety, so picking a bunch of fixes that have not yet been subject to all these tests is not a good solution at all. And by the way the problem was found during the test period, which proves that testing is useful and effective at finding some regressions. It's "just" that the rest of process messed up there.

> Stable kernels are aggressively advertised as the only safe kernels to run, but there's plenty of evidence that they aren't safe, and the stable maintainers tend to denigrate and dismiss users' attempts to point out the structural problems

No, not at all. There's no such "they are safe" nor "they aren't safe". Safety is not a boolean, it's a metric. And maintainers do not dismiss whatever users' attempts, on the opposite, these attempts are welcome and adopted when they prove to be useful, such as all the tests that are run for each and every release. It's just that there's a huge difference between proposing solutions and whining. Saying "you should have done that" or "if I were doing your job I would certainly not do it this way" is just whining. Saying "let me one extra day to run some more advanced tests myself" can definitely be part of a solution to improve the situation (and then you will be among those criticized for messing up from time to time).

Ext4 data corruption in stable kernels

Posted Dec 13, 2023 4:59 UTC (Wed) by bgilbert (subscriber, #4738) [Link] (1 responses)

> Yes but if you read Greg's response, it's obvious there has been a misunderstanding, and noone else jumped on that thread to ask for the other kernels. Sh*t happens:

Yup, agreed. Process failures happen; they should lead to process improvements. Asking for more testers isn't going to solve this one.

> And in both cases it's important to insist on having a fixed version so that the involved people have their say on the topic (including "take this fix instead, it's ugly but safer for now").

I think we're talking past each other here. The fix for 4.14.96 had already landed in 4.14.97. I backported one patch from it, rather than taking the entire release.

> And maintainers do not dismiss whatever users' attempts, on the opposite, these attempts are welcome and adopted when they prove to be useful, such as all the tests that are run for each and every release. It's just that there's a huge difference between proposing solutions and whining. Saying "you should have done that" or "if I were doing your job I would certainly not do it this way" is just whining. Saying "let me one extra day to run some more advanced tests myself" can definitely be part of a solution to improve the situation (and then you will be among those criticized for messing up from time to time).

Every open-source maintainer gets complaints that the software is not meeting users' needs. Those users often aren't in a position to fix the software themselves, they may have suggestions which don't account for the full complexity of the problem, and they may not even fully understand their own needs. Even when a maintainer needs to reject a suggestion (and they should, often!) the feedback is still a great source of information about where improvements might be useful. And sometimes a suggestion contains the seed of a good idea. Even if the people in this comment section are wrong about a lot of the details, I'm sure there's at least one idea here that's worth exploring.

As you said in another subthread, the existing stable kernel process has worked remarkably well for its scale. But processes don't scale forever, and processes can't be improved without the participation (and probably the active commitment) of the actual maintainers. BitKeeper and then Git allowed kernel development to scale to today's levels, but those tools could never have succeeded if key maintainers hadn't actively embraced them and encouraged their use. At the end of the day, while a lot of the day-to-day work can be handled by any skilled contributor, the direction of a project must be set by its maintainers.

Ext4 data corruption in stable kernels

Posted Dec 13, 2023 5:44 UTC (Wed) by wtarreau (subscriber, #51152) [Link]

> I think we're talking past each other here. The fix for 4.14.96 had already landed in 4.14.97. I backported one patch from it, rather than taking the entire release.

OK got it, and yes for such rare cases where the fix is already accepted by maintainers and validated, I agree that it remains a reasonable approach.

> But processes don't scale forever, and processes can't be improved without the participation (and probably the active commitment) of the actual maintainers.

That's totally true, but it's also important to keep in mind the fact that maintainers are scarce and already overloaded, and that asking them to experiment with random changes is the best way to waste their time or make them feel their work is useless. Coming with a PoC saying "don't you think something like this could improve your work" is a lot different from "you should just do this or that". Maintainers do not miss suggestions that come from everywhere all the time. Remember how many times Linus was suggested to switch to SVN before Git appeared ? If all those who suggested it had actually just tried prior to speaking, they would have had their response and avoided to look like fools.

Ext4 data corruption in stable kernels

Posted Dec 12, 2023 10:19 UTC (Tue) by geert (subscriber, #98403) [Link] (3 responses)

Playing the devil's advocate (which can be considered appropriate for v6.6.6 ;-)

> Here you're speaking about cherry-picking fixes. That's something extremely dangerous that nobody must ever do [...]

But stable is also cherry-picking some changes, but not others?!?!? Nobody knows if they work well together or if another important patch is missing...

The only solution is to follow mainline ;-)

Ext4 data corruption in stable kernels

Posted Dec 12, 2023 18:05 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (2 responses)

> But stable is also cherry-picking some changes, but not others?!?!? Nobody knows if they work well together or if another important patch is missing...

That has always been the case. For a long time I used to say myself that the kernels I was releasing were certainly full of bugs (otherwise it would not be needed to issue future releases), but the difference with the ones peoplle build in their garage is that the official stable ones are the result of:
- reviews from all the patch authors
- tests from various teams and individuals.

I.e. they are much better known than other combinations.

One would say that patch authors do not send a lot of feedback but there are regularly one or two responses in a series, either asking for another patch if one is picked, or suggesting not to pick one, so that works as well. And the tests are invaluable. When I picked 2.6.32 and Greg insisted that now I had to follow the whole review process, I was really annoyed because it doubled my work. But seeing suggestions to fix around 10 patches per series based on review and testing showed me the garbage I used to provide before this process. That's why I'm saying: people complain but the process works remarkably well given the number of patches and the number of regressions. I remember the era of early 2.6 where you would have been foolish to run an stable version before .10 or so. I've hade on one of my machines a 4.9.2 that I never updated for 5 years for some reason, and it never messed up on me. I'm not advocating for not updating but I mean that mainline is much stabler than it used to and stable sees very few regressions.

> The only solution is to follow mainline ;-)

That's what Linus sometimes says as well. That's where you get the latest fixes and the latest bugs as well. It doesn't mean the balance is necessarily bad, but it's more adventurous :-)

Ext4 data corruption in stable kernels

Posted Dec 12, 2023 18:49 UTC (Tue) by farnz (subscriber, #17727) [Link] (1 responses)

As an aside, I've noted more than once in my career that there's a deep tradeoff in dependency handling here:

I can stick with an old version, and keep trying to patch it to have fewer bugs but no new features. This is less work week-by-week, but when I do hit a significant bug where I can't find a fix myself, upstream is unlikely to be much help (because I'm based on an "ancient" codebase from their point of view).
I can keep up to date with the latest version, with new features coming in all the time, doing more work week-by-week, but not having the "big leaps" to make, and having upstream much more able to help me fix any bugs I find, because I'm basing my use on a codebase that they work on every day.

For example, keeping up with latest Fedora releases is harder week-by-week than keeping up with RHEL major releases; but getting support from upstreams for the versions of packages in Fedora is generally easier than getting support for something in the last RHEL major release, because it's so much closer to their current code; further, it's generally easier to go from a "latest Fedora" version of something to "latest upstream development branch" than to go from "latest RHEL release" to "latest upstream development branch" and find patches yourself that way.

Ext4 data corruption in stable kernels

Posted Dec 13, 2023 5:54 UTC (Wed) by wtarreau (subscriber, #51152) [Link]

For me that's a tooling problem before all. For users convenience most of the time you can just upgrade to the latest version since it's supposedly better. The fact is that it's *often* better but not always, and for some users, the few cases where it's not better are so worse that they'd prefer not to *take the risk* to update. This is exactly the root cause of the problem.

What I'm suggesting is to update in small jumps, but not the last version which is still lacking feedback. If you see, say, 6.1.66 being released, and you consider that the 1-month old 6.1.62 looks correct because nobody complained about it, and nobody recently spoke about a critical urgent update that requires that everyone absolutely updates to yesterday's patch, then you could just update to 6.1.62 (or any surrounding one that was reported to work pretty well). This leaves one month of feedback for these kernels for you to choose, doesn't require too frequent updates and doesn't require to live on the bleeding edge (i.e. less risks of regressions).

That's obviously not rocket science and will not always work, but this approach allows you to skip big regressions with immediate impact, and generally saves you from having to update twice in a row.

Ext4 data corruption in stable kernels

Posted Dec 12, 2023 0:16 UTC (Tue) by roc (subscriber, #30627) [Link] (5 responses)

> I've counted 17 people responding to that thread with test reports, some of which indicate boot failures, others successes, on a total of around 910 systems covering lots of architectures, configs and setup.

Relying on volunteers to manually build and boot RC kernels is both inefficient and inadequate. There should be dedicated machines that automatically build and boot those kernels AND run as many automated tests as can be afforded given the money and time available. With some big machines and 48 hours you can run a lot of tests.

This isn't asking for much. This is what other mature projects have been doing for years.

Ext4 data corruption in stable kernels

Posted Dec 12, 2023 4:55 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (4 responses)

> There should be dedicated machines that automatically build and boot those kernels AND run as many automated tests as can be afforded given the money and time available. With some big machines and 48 hours you can run a lot of tests.
>
> This isn't asking for much. This is what other mature projects have been doing for years.

Well, if you and/or your employer can provide this (hardware and manpower to operate it), I'm sure everyone will be extremely happy. Greg is constantly asking for more testers. You're speaking as if some proposal for help was rejected, resources like this don't fall out from the sky. Also you seem to know what tests to run on them, please do! All the testers I mentioned run their own tests from different (and sometimes overlapping) sets and that's extremely useful.

But saying "This or that should be done", the question remains "by whom if it's not by the one suggesting it?".

Ext4 data corruption in stable kernels

Posted Dec 12, 2023 7:16 UTC (Tue) by roc (subscriber, #30627) [Link] (1 responses)

Regarding which tests to run: as bgilbert said: "Many other projects have CI tests that are required to pass before a new release can ship. If that had been the case for LTP, this regression would have been avoided." Of course the LTP test *did* run; it's not just about having the tests and running the tests, but also gating the release on positive test results.

As it happens my rr co-maintainer Kyle Huey does regularly test RC kernels against rr's regression test suite, and has found (and reported) a few interesting bugs that way. But really the Linux Foundation or some similar organization should be responsible for massive-scale automated testing of upstream kernels. Lots of companies stand to benefit financially from more reliable Linux releases, and as I understand it, the LF exists to channel those common interests into funding.

Ext4 data corruption in stable kernels

Posted Dec 12, 2023 18:11 UTC (Tue) by wtarreau (subscriber, #51152) [Link]

> But really the Linux Foundation or some similar organization should be responsible for massive-scale automated testing of upstream kernels

But why is it always that when something happens, lots of people consider that there is surely an entity somewhere whose job should be to fix it ? Why ?

You seem to have an idea of the problem and its solution, why are you not proposing your help ? Because you don't have the time for this ? And what makes you think the problem is too large for you but very small for someone else ? What makes you think that there are people idling all the day waiting for this task to be assigned to them and to start working on it ? And if instead you didn't analyze it completely in its environment and with all of its dependencies and impacts and it was much harder to put in place ?

It's easy to always complain, really easy. If all the energy spent complaining against the current state every time there's a problem had been assigned to fixing it, maybe we wouldn't have been speaking about this issue in the first place.

Ext4 data corruption in stable kernels

Posted Dec 12, 2023 10:11 UTC (Tue) by bgilbert (subscriber, #4738) [Link] (1 responses)

Suppose the stable team announced their intention to gate stable releases on automated testing, and put out a call for suitable test suites. Test suites could be required to meet a defined quality bar (low false positive rate, completion within the 48-hour review period, automatic bisection), and any suite that repeatedly failed to meet the bar could be removed from the test program. If no one at all stepped up to offer their tests, I would be shocked.

The stable team wouldn't need to own the test runners, just the reporting API, and the API could be quite simple. I agree with roc that the Linux Foundation should take some financial responsibility here, but I suspect some organizations would run tests and contribute results even if no funding were available.

Ext4 data corruption in stable kernels

Posted Dec 12, 2023 18:13 UTC (Tue) by wtarreau (subscriber, #51152) [Link]

> Suppose the stable team announced their intention to gate stable releases on automated testing, and put out a call for suitable test suites. Test suites could be required to meet a defined quality bar (low false positive rate, completion within the 48-hour review period, automatic bisection), and any suite that repeatedly failed to meet the bar could be removed from the test program.

OK so something someone has to write and operate.

> If no one at all stepped up to offer their tests, I would be shocked.

Apparently it's just been proposed, by you. When will we benefit from your next improvements to the process ?

Ext4 data corruption in stable kernels

Posted Dec 11, 2023 14:02 UTC (Mon) by farnz (subscriber, #17727) [Link]

Greg's position is a lot less concrete than that - it's "I make no assertions about whether or not any given batch of patches fixes bugs you care about; if you want all the fixes I think you should care about, then you must take the latest batch". Whether you want all the fixes that Greg thinks you should is your decision - but he makes no statement about what subset of stable patches you should pick in that case.