Ext4 data corruption in stable kernels

Posted Dec 12, 2023 17:55 UTC (Tue) by wtarreau (subscriber, #51152)
In reply to: Ext4 data corruption in stable kernels by bgilbert
Parent article: Ext4 data corruption in stable kernels

> The message I linked above is dated November 24 and reported a regression in v6.1.64-rc1. The testing deadline for 6.1.64 was November 26, and it was released on November 28. That report was sufficient to cause a revert in 5.10.y and 5.15.y, so I don't think there can be an argument that not enough information was available.

Yes but if you read Greg's response, it's obvious there has been a misunderstanding, and noone else jumped on that thread to ask for the other kernels. Sh*t happens:

> > and on the following RC's:
> > * v5.10.202-rc1
> > * v5.15.140-rc1
> > * v6.1.64-rc1
> >
> > (Note that the list might not be complete, because some branches failed to execute completely due to build issues reported elsewhere.)
> >
> > Bisection in linux-5.15.y pointed to:
> >
> > commit db85c7fff122c14bc5755e47b51fbfafae660235
> > Author: Jan Kara <jack@suse.cz>
> > Date: Fri Oct 13 14:13:50 2023 +0200
> >
> > ext4: properly sync file size update after O_SYNC direct IO
> > commit 91562895f8030cb9a0470b1db49de79346a69f91 upstream.
> >
> >
> > Reverting that commit made the test pass.
>
> Odd. I'll go drop that from 5.10.y and 5.15.y now, thanks.

I mean, it's always the same every time there is a regression: users jump on their gun and explain what OUGHT to have been done, except that unsurprisingly they were not there either to do it by then. I don't know when everyone will understand that maintaining a working kernel is a collective effort, and that when there's a failure it's a collective failure.

> If I can't hotfix a regression without letting in a bunch of unrelated code, I'll never converge to a kernel that's safe to ship.

There are two safe possibilities for this:
- either you ask the identified wrong commit and ask its author what he thinks about removing it and you do that;
- or you roll back to the latest known good kernel. Upgrades are frequent enough to allow rollbacks. Seriously...

And in both cases it's important to insist on having a fixed version so that the involved people have their say on the topic (including "take this fix instead, it's ugly but safer for now"). What matters in the end is end-users' safety, so picking a bunch of fixes that have not yet been subject to all these tests is not a good solution at all. And by the way the problem was found during the test period, which proves that testing is useful and effective at finding some regressions. It's "just" that the rest of process messed up there.

> Stable kernels are aggressively advertised as the only safe kernels to run, but there's plenty of evidence that they aren't safe, and the stable maintainers tend to denigrate and dismiss users' attempts to point out the structural problems

No, not at all. There's no such "they are safe" nor "they aren't safe". Safety is not a boolean, it's a metric. And maintainers do not dismiss whatever users' attempts, on the opposite, these attempts are welcome and adopted when they prove to be useful, such as all the tests that are run for each and every release. It's just that there's a huge difference between proposing solutions and whining. Saying "you should have done that" or "if I were doing your job I would certainly not do it this way" is just whining. Saying "let me one extra day to run some more advanced tests myself" can definitely be part of a solution to improve the situation (and then you will be among those criticized for messing up from time to time).

Ext4 data corruption in stable kernels

Posted Dec 13, 2023 4:59 UTC (Wed) by bgilbert (subscriber, #4738) [Link] (1 responses)

> Yes but if you read Greg's response, it's obvious there has been a misunderstanding, and noone else jumped on that thread to ask for the other kernels. Sh*t happens:

Yup, agreed. Process failures happen; they should lead to process improvements. Asking for more testers isn't going to solve this one.

> And in both cases it's important to insist on having a fixed version so that the involved people have their say on the topic (including "take this fix instead, it's ugly but safer for now").

I think we're talking past each other here. The fix for 4.14.96 had already landed in 4.14.97. I backported one patch from it, rather than taking the entire release.

> And maintainers do not dismiss whatever users' attempts, on the opposite, these attempts are welcome and adopted when they prove to be useful, such as all the tests that are run for each and every release. It's just that there's a huge difference between proposing solutions and whining. Saying "you should have done that" or "if I were doing your job I would certainly not do it this way" is just whining. Saying "let me one extra day to run some more advanced tests myself" can definitely be part of a solution to improve the situation (and then you will be among those criticized for messing up from time to time).

Every open-source maintainer gets complaints that the software is not meeting users' needs. Those users often aren't in a position to fix the software themselves, they may have suggestions which don't account for the full complexity of the problem, and they may not even fully understand their own needs. Even when a maintainer needs to reject a suggestion (and they should, often!) the feedback is still a great source of information about where improvements might be useful. And sometimes a suggestion contains the seed of a good idea. Even if the people in this comment section are wrong about a lot of the details, I'm sure there's at least one idea here that's worth exploring.

As you said in another subthread, the existing stable kernel process has worked remarkably well for its scale. But processes don't scale forever, and processes can't be improved without the participation (and probably the active commitment) of the actual maintainers. BitKeeper and then Git allowed kernel development to scale to today's levels, but those tools could never have succeeded if key maintainers hadn't actively embraced them and encouraged their use. At the end of the day, while a lot of the day-to-day work can be handled by any skilled contributor, the direction of a project must be set by its maintainers.

Ext4 data corruption in stable kernels

Posted Dec 13, 2023 5:44 UTC (Wed) by wtarreau (subscriber, #51152) [Link]

> I think we're talking past each other here. The fix for 4.14.96 had already landed in 4.14.97. I backported one patch from it, rather than taking the entire release.

OK got it, and yes for such rare cases where the fix is already accepted by maintainers and validated, I agree that it remains a reasonable approach.

> But processes don't scale forever, and processes can't be improved without the participation (and probably the active commitment) of the actual maintainers.

That's totally true, but it's also important to keep in mind the fact that maintainers are scarce and already overloaded, and that asking them to experiment with random changes is the best way to waste their time or make them feel their work is useless. Coming with a PoC saying "don't you think something like this could improve your work" is a lot different from "you should just do this or that". Maintainers do not miss suggestions that come from everywhere all the time. Remember how many times Linus was suggested to switch to SVN before Git appeared ? If all those who suggested it had actually just tried prior to speaking, they would have had their response and avoided to look like fools.