Ext4 data corruption in stable kernels

Posted Dec 11, 2023 13:57 UTC (Mon) by wtarreau (subscriber, #51152)
In reply to: Ext4 data corruption in stable kernels by birdie
Parent article: Ext4 data corruption in stable kernels

> The real larger problem Linux fans never talk about: very poor/inadequate/missing QA/QC in the Linux kernel.

Compared to what, and as per which metric and unit ?

Latest kernel was run on ~910 systems by 17 people who found issues that were fixed before the release:

https://lore.kernel.org/all/20231205031535.163661217@linu...

If you have good plans to propose something better that doesn't put the process to a halt, I'm sure everyone would be interested to know about it. The stable team is always seeking more testers, feel free to join.

Ext4 data corruption in stable kernels

Posted Dec 11, 2023 19:54 UTC (Mon) by mat2 (guest, #100235) [Link] (2 responses)

From the mail you quoted:

> Subject: [PATCH 6.6 000/134] 6.6.5-rc1 review
> Date: Tue, 5 Dec 2023 12:14:32 +0900

[snip]

> Responses should be made by Thu, 07 Dec 2023 03:14:57 +0000.
> Anything received after that time might be too late.

I think that the time available for testing stable release candidates is too short (~48 hours). Some bugs (such as this one) are visible only after some usage period.

Longer times also mean more testers. For example, Ubuntu's mainline PPA ( https://kernel.ubuntu.com/mainline/ ) might run such stable RCs similar to how it compiles normal kernels.

So perhaps the stable testing period should be make longer, like 4-5 days.

I'll try to test stable RCs myself. Is there some mailing list available that one may subscribe to to get notifications about these releases (just notifications, without all the patches)?

Ext4 data corruption in stable kernels

Posted Dec 11, 2023 19:55 UTC (Mon) by pizza (subscriber, #46) [Link]

> So perhaps the stable testing period should be make longer, like 4-5 days.

No matter what period is chosen, it will simultaneously be too short for some, and too long for others.

Ext4 data corruption in stable kernels

Posted Dec 12, 2023 5:08 UTC (Tue) by wtarreau (subscriber, #51152) [Link]

That's a valid point, though the time has adapted over history to the period it takes for active participants to send their reports. If you wait too long, testers start testing at the end of the period and during all that time frame, users stay needlessly exposed to unfixed bugs (including the one that was needed to fix that one). And I agree that if the period is too short, you get less opportunities to test.

It you think you could regularly participate in tests with a bit more time, you should suggest this publicly. I'm sure Greg is open to adjust is cycle a little bit to permit more testing, but it needs to be done for something real, not just suppositions. Keep in mind that he's the person that has released the largest number of kernels and has accumulated a lot of experience about what happens before and after, and by now he definitely knows how people react at various periods of the cycle.

When I was maintaining extended LTS kernels, I also got used to how my users would react. I knew that one distro would test during the week after -rc so I would leave one week of testing, then I knew that nobody would test it for one month following the release, that was specific to these use cases where users don't upgrade often and prefer to wait for the right moment. So in my head a release was not confirmed until about one month after it was emitted, which often required to quickly emit another one to fix some issues.

And nowadays I'm pretty sure that the feedback and behavior on 6.6 is not the same at all as with 5.4 or 4.14!

In haproxy we have much less changes per stable release and we announce our own level of trust about the version. That's possible because the maintainers doing the backports have already been involved in a lot of these fixes and hesitating about some backports. So we just indicate if we're really confident in the release or if it should be taken with special care. Users appreciate it a lot and help us in return by reporting suspicion about issues. I don't think it would work well for the kernel because stable maintainers receive an avalanche of fixes from various sources and it's very hard to have an idea of the impacts of these patches. Subsystem maintainers are the ones who know best, immediately followed by testers, so it's quite hard to give an appreciation of how much a version can be trusted. In an ideal world, some subsystem maintainers could indicate "be careful" and that would raise a warning. But here it wouldn't have worked since that was already a fix for a serious problem.

Fixes that break stuff are the worst ones to deal with because they create a lot of confusion. And BTW security fixes are very often in this category, which is why we insist a lot on having them discussed publicly as much as possible.