LWN: Comments on "Ext4 data corruption in stable kernels"

Ext4 data corruption in stable kernels

farnz — Mon, 18 Dec 2023 11:38:37 +0000

I would marginally disagree; it is possible for a bug to not be a security bug. The difficulty is distinguishing bugs with no security relevance from those with security relevance, given that the kernel's overall threat model is very broad.

For example, a bug where the kernel sometimes clears the LSB of the blue channel of 16 bpc RGB colour on a DisplayPort link is almost certainly completely irrelevant; at the sorts of brightnesses monitors can do today, the difference between 16 bits each R and G and 15 bits of B, and 16 bits each R, G, B is below human perception.

But the challenge is that from my perspective, a bug in the kernel driver for a 100G Ethernet chip that connects via PCIe is completely irrelevant - I have no systems with that sort of hardware, nor is there a way for an attacker to add that hardware without my knowledge. Similarly, a bug in iSCSI that can only be tickled once iSCSI is in use is not security-relevant to me, since I have no iSCSI set up, so to tickle the bug, the attacker needs remote code execution already. From the perspective of a company running big servers that access bulk storage over iSCSI using 100G Ethernet, however, both of those bugs can be security bugs.

Should those bugs be "security" bugs, since if you happen to have the problematic setup, they're relevant? Or should they not be security bugs since most people don't have either 100G Ethernet or iSCSI setups?

Ext4 data corruption in stable kernels

jschrod — Mon, 18 Dec 2023 01:49:10 +0000

As a Debian user, I want to thank you for the work that you're doing.

I have no other venue -- please know, that I'm amazed at the fantastic job that you and your Debian colleagues are doing!!!

Ext4 data corruption in stable kernels

jschrod — Mon, 18 Dec 2023 01:43:42 +0000

Well, *all* bug fixes are security fixes.

You cannot know in advance which bugs might be exploited, be it on a technical or a social level. This has been demonstrated so many times, I cannot believe it has to be spelled out.

*Every* bug is a security bug. If you think it isn't, the future will tell you different.

Ext4 data corruption in stable kernels

jan.kara — Wed, 13 Dec 2023 10:50:16 +0000

Correct. Only written files could have been corrupted. Furthermore the corruption can happen only if the application uses direct IO (open with O_DIRECT flag) to write to the file which is not that common. Finally, the data corruption happened because file position was not properly updated after the write. So cases where file position is always set to the desired value before starting the write (such as using AIO which always requires offset, using pwrite(2) call, or calling lseek(2)) were not affected. So up to now I'm not aware of any application that would actually end up corrupting its data due to this bug. But the potential is certainly there...

Ext4 data corruption in stable kernels

wtarreau — Wed, 13 Dec 2023 05:54:59 +0000

For me that's a tooling problem before all. For users convenience most of the time you can just upgrade to the latest version since it's supposedly better. The fact is that it's *often* better but not always, and for some users, the few cases where it's not better are so worse that they'd prefer not to *take the risk* to update. This is exactly the root cause of the problem.

What I'm suggesting is to update in small jumps, but not the last version which is still lacking feedback. If you see, say, 6.1.66 being released, and you consider that the 1-month old 6.1.62 looks correct because nobody complained about it, and nobody recently spoke about a critical urgent update that requires that everyone absolutely updates to yesterday's patch, then you could just update to 6.1.62 (or any surrounding one that was reported to work pretty well). This leaves one month of feedback for these kernels for you to choose, doesn't require too frequent updates and doesn't require to live on the bleeding edge (i.e. less risks of regressions).

That's obviously not rocket science and will not always work, but this approach allows you to skip big regressions with immediate impact, and generally saves you from having to update twice in a row.

Ext4 data corruption in stable kernels

wtarreau — Wed, 13 Dec 2023 05:44:50 +0000

> I think we're talking past each other here. The fix for 4.14.96 had already landed in 4.14.97. I backported one patch from it, rather than taking the entire release.

OK got it, and yes for such rare cases where the fix is already accepted by maintainers and validated, I agree that it remains a reasonable approach.

> But processes don't scale forever, and processes can't be improved without the participation (and probably the active commitment) of the actual maintainers.

That's totally true, but it's also important to keep in mind the fact that maintainers are scarce and already overloaded, and that asking them to experiment with random changes is the best way to waste their time or make them feel their work is useless. Coming with a PoC saying "don't you think something like this could improve your work" is a lot different from "you should just do this or that". Maintainers do not miss suggestions that come from everywhere all the time. Remember how many times Linus was suggested to switch to SVN before Git appeared ? If all those who suggested it had actually just tried prior to speaking, they would have had their response and avoided to look like fools.

Ext4 data corruption in stable kernels

bgilbert — Wed, 13 Dec 2023 04:59:48 +0000

> Yes but if you read Greg's response, it's obvious there has been a misunderstanding, and noone else jumped on that thread to ask for the other kernels. Sh*t happens:

Yup, agreed. Process failures happen; they should lead to process improvements. Asking for more testers isn't going to solve this one.

> And in both cases it's important to insist on having a fixed version so that the involved people have their say on the topic (including "take this fix instead, it's ugly but safer for now").

I think we're talking past each other here. The fix for 4.14.96 had already landed in 4.14.97. I backported one patch from it, rather than taking the entire release.

> And maintainers do not dismiss whatever users' attempts, on the opposite, these attempts are welcome and adopted when they prove to be useful, such as all the tests that are run for each and every release. It's just that there's a huge difference between proposing solutions and whining. Saying "you should have done that" or "if I were doing your job I would certainly not do it this way" is just whining. Saying "let me one extra day to run some more advanced tests myself" can definitely be part of a solution to improve the situation (and then you will be among those criticized for messing up from time to time).

Every open-source maintainer gets complaints that the software is not meeting users' needs. Those users often aren't in a position to fix the software themselves, they may have suggestions which don't account for the full complexity of the problem, and they may not even fully understand their own needs. Even when a maintainer needs to reject a suggestion (and they should, often!) the feedback is still a great source of information about where improvements might be useful. And sometimes a suggestion contains the seed of a good idea. Even if the people in this comment section are wrong about a lot of the details, I'm sure there's at least one idea here that's worth exploring.

As you said in another subthread, the existing stable kernel process has worked remarkably well for its scale. But processes don't scale forever, and processes can't be improved without the participation (and probably the active commitment) of the actual maintainers. BitKeeper and then Git allowed kernel development to scale to today's levels, but those tools could never have succeeded if key maintainers hadn't actively embraced them and encouraged their use. At the end of the day, while a lot of the day-to-day work can be handled by any skilled contributor, the direction of a project must be set by its maintainers.

Ext4 data corruption in stable kernels

farnz — Tue, 12 Dec 2023 18:49:47 +0000

As an aside, I've noted more than once in my career that there's a deep tradeoff in dependency handling here:

I can stick with an old version, and keep trying to patch it to have fewer bugs but no new features. This is less work week-by-week, but when I do hit a significant bug where I can't find a fix myself, upstream is unlikely to be much help (because I'm based on an "ancient" codebase from their point of view).
I can keep up to date with the latest version, with new features coming in all the time, doing more work week-by-week, but not having the "big leaps" to make, and having upstream much more able to help me fix any bugs I find, because I'm basing my use on a codebase that they work on every day.

For example, keeping up with latest Fedora releases is harder week-by-week than keeping up with RHEL major releases; but getting support from upstreams for the versions of packages in Fedora is generally easier than getting support for something in the last RHEL major release, because it's so much closer to their current code; further, it's generally easier to go from a "latest Fedora" version of something to "latest upstream development branch" than to go from "latest RHEL release" to "latest upstream development branch" and find patches yourself that way.

Ext4 data corruption in stable kernels

wtarreau — Tue, 12 Dec 2023 18:15:01 +0000

So are you going to step up to review all these patches and categorize them yourself ? Because most of the time their authors themselves have no idea that the bug they're fixing can have a security impact. That's the first part of the problem with the CVE circus.

Ext4 data corruption in stable kernels

wtarreau — Tue, 12 Dec 2023 18:13:24 +0000

> Suppose the stable team announced their intention to gate stable releases on automated testing, and put out a call for suitable test suites. Test suites could be required to meet a defined quality bar (low false positive rate, completion within the 48-hour review period, automatic bisection), and any suite that repeatedly failed to meet the bar could be removed from the test program.

OK so something someone has to write and operate.

> If no one at all stepped up to offer their tests, I would be shocked.

Apparently it's just been proposed, by you. When will we benefit from your next improvements to the process ?

Ext4 data corruption in stable kernels

wtarreau — Tue, 12 Dec 2023 18:11:54 +0000

> But really the Linux Foundation or some similar organization should be responsible for massive-scale automated testing of upstream kernels

But why is it always that when something happens, lots of people consider that there is surely an entity somewhere whose job should be to fix it ? Why ?

You seem to have an idea of the problem and its solution, why are you not proposing your help ? Because you don't have the time for this ? And what makes you think the problem is too large for you but very small for someone else ? What makes you think that there are people idling all the day waiting for this task to be assigned to them and to start working on it ? And if instead you didn't analyze it completely in its environment and with all of its dependencies and impacts and it was much harder to put in place ?

It's easy to always complain, really easy. If all the energy spent complaining against the current state every time there's a problem had been assigned to fixing it, maybe we wouldn't have been speaking about this issue in the first place.

Ext4 data corruption in stable kernels

wtarreau — Tue, 12 Dec 2023 18:05:54 +0000

> But stable is also cherry-picking some changes, but not others?!?!? Nobody knows if they work well together or if another important patch is missing...

That has always been the case. For a long time I used to say myself that the kernels I was releasing were certainly full of bugs (otherwise it would not be needed to issue future releases), but the difference with the ones peoplle build in their garage is that the official stable ones are the result of:
- reviews from all the patch authors
- tests from various teams and individuals.

I.e. they are much better known than other combinations.

One would say that patch authors do not send a lot of feedback but there are regularly one or two responses in a series, either asking for another patch if one is picked, or suggesting not to pick one, so that works as well. And the tests are invaluable. When I picked 2.6.32 and Greg insisted that now I had to follow the whole review process, I was really annoyed because it doubled my work. But seeing suggestions to fix around 10 patches per series based on review and testing showed me the garbage I used to provide before this process. That's why I'm saying: people complain but the process works remarkably well given the number of patches and the number of regressions. I remember the era of early 2.6 where you would have been foolish to run an stable version before .10 or so. I've hade on one of my machines a 4.9.2 that I never updated for 5 years for some reason, and it never messed up on me. I'm not advocating for not updating but I mean that mainline is much stabler than it used to and stable sees very few regressions.

> The only solution is to follow mainline ;-)

That's what Linus sometimes says as well. That's where you get the latest fixes and the latest bugs as well. It doesn't mean the balance is necessarily bad, but it's more adventurous :-)

Ext4 data corruption in stable kernels

wtarreau — Tue, 12 Dec 2023 17:55:11 +0000

> The message I linked above is dated November 24 and reported a regression in v6.1.64-rc1. The testing deadline for 6.1.64 was November 26, and it was released on November 28. That report was sufficient to cause a revert in 5.10.y and 5.15.y, so I don't think there can be an argument that not enough information was available.

Yes but if you read Greg's response, it's obvious there has been a misunderstanding, and noone else jumped on that thread to ask for the other kernels. Sh*t happens:

> > and on the following RC's:
> > * v5.10.202-rc1
> > * v5.15.140-rc1
> > * v6.1.64-rc1
> >
> > (Note that the list might not be complete, because some branches failed to execute completely due to build issues reported elsewhere.)
> >
> > Bisection in linux-5.15.y pointed to:
> >
> > commit db85c7fff122c14bc5755e47b51fbfafae660235
> > Author: Jan Kara <jack@suse.cz>
> > Date: Fri Oct 13 14:13:50 2023 +0200
> >
> > ext4: properly sync file size update after O_SYNC direct IO
> > commit 91562895f8030cb9a0470b1db49de79346a69f91 upstream.
> >
> >
> > Reverting that commit made the test pass.
>
> Odd. I'll go drop that from 5.10.y and 5.15.y now, thanks.

I mean, it's always the same every time there is a regression: users jump on their gun and explain what OUGHT to have been done, except that unsurprisingly they were not there either to do it by then. I don't know when everyone will understand that maintaining a working kernel is a collective effort, and that when there's a failure it's a collective failure.

> If I can't hotfix a regression without letting in a bunch of unrelated code, I'll never converge to a kernel that's safe to ship.

There are two safe possibilities for this:
- either you ask the identified wrong commit and ask its author what he thinks about removing it and you do that;
- or you roll back to the latest known good kernel. Upgrades are frequent enough to allow rollbacks. Seriously...

And in both cases it's important to insist on having a fixed version so that the involved people have their say on the topic (including "take this fix instead, it's ugly but safer for now"). What matters in the end is end-users' safety, so picking a bunch of fixes that have not yet been subject to all these tests is not a good solution at all. And by the way the problem was found during the test period, which proves that testing is useful and effective at finding some regressions. It's "just" that the rest of process messed up there.

> Stable kernels are aggressively advertised as the only safe kernels to run, but there's plenty of evidence that they aren't safe, and the stable maintainers tend to denigrate and dismiss users' attempts to point out the structural problems

No, not at all. There's no such "they are safe" nor "they aren't safe". Safety is not a boolean, it's a metric. And maintainers do not dismiss whatever users' attempts, on the opposite, these attempts are welcome and adopted when they prove to be useful, such as all the tests that are run for each and every release. It's just that there's a huge difference between proposing solutions and whining. Saying "you should have done that" or "if I were doing your job I would certainly not do it this way" is just whining. Saying "let me one extra day to run some more advanced tests myself" can definitely be part of a solution to improve the situation (and then you will be among those criticized for messing up from time to time).

Ext4 data corruption in stable kernels

geert — Tue, 12 Dec 2023 10:19:35 +0000

Playing the devil's advocate (which can be considered appropriate for v6.6.6 ;-)

> Here you're speaking about cherry-picking fixes. That's something extremely dangerous that nobody must ever do [...]

But stable is also cherry-picking some changes, but not others?!?!? Nobody knows if they work well together or if another important patch is missing...

The only solution is to follow mainline ;-)

Ext4 data corruption in stable kernels

bgilbert — Tue, 12 Dec 2023 10:11:25 +0000

Suppose the stable team announced their intention to gate stable releases on automated testing, and put out a call for suitable test suites. Test suites could be required to meet a defined quality bar (low false positive rate, completion within the 48-hour review period, automatic bisection), and any suite that repeatedly failed to meet the bar could be removed from the test program. If no one at all stepped up to offer their tests, I would be shocked.

The stable team wouldn't need to own the test runners, just the reporting API, and the API could be quite simple. I agree with roc that the Linux Foundation should take some financial responsibility here, but I suspect some organizations would run tests and contribute results even if no funding were available.

Ext4 data corruption in stable kernels

bgilbert — Tue, 12 Dec 2023 09:58:22 +0000

Even if the issue was reported you don't know if it was noticed before 6.1.64 was emitted. What matters is that the issue was quickly fixed.

The message I linked above is dated November 24 and reported a regression in v6.1.64-rc1. The testing deadline for 6.1.64 was November 26, and it was released on November 28. That report was sufficient to cause a revert in 5.10.y and 5.15.y, so I don't think there can be an argument that not enough information was available.

The users who had data corruption, or who had to roll out an emergency fix to avoid data corruption, don't care that the issue was quickly fixed. They can always roll back to an older kernel if they need to. They care that the problem happened in the first place.

The reason for recommending against cherry-picking is very simple (and was explained in lengths at multiple conferences): the ONLY combinations of kernel patches that are both tested and supported by the subsystem maintainers are the mainline and stable ones. [...] Take the whole stable, possibly a slightly older one if you don't feel easy with latest changes, add your distro-specific patches on top of it, but do not pick what seems relevant to you, that will eventually result in a disaster and nobody will support you for having done this.

What are you talking about? If I ship a modified kernel and it breaks, of course no one will support me for having done so. If I ship an unmodified stable kernel and it breaks, no one will support me then either! The subsystem maintainers aren't going to help with my outage notifications, my users, or my emergency rollout. As with any downstream, I'm ultimately responsible for what I ship.

In the case mentioned upthread, my choices were: a) cherry-pick a one-line fix for the userspace ABI regression, or b) take the entire diff from 4.14.96 to 4.14.97: 69 patches touching 92 files, +1072/-327 lines. Option b) is simply not defensible release engineering. If I can't hotfix a regression without letting in a bunch of unrelated code, I'll never converge to a kernel that's safe to ship. That would arguably be true even if stable kernels didn't have a history of user-facing regressions, which they certainly did.

This discussion is a great example of the problem I'm trying to describe. Stable kernels are aggressively advertised as the only safe kernels to run, but there's plenty of evidence that they aren't safe, and the stable maintainers tend to denigrate and dismiss users' attempts to point out the structural problems — or even to work around them! These problems can be addressed, as I said, with tools, testing, and developer time. There is always, always, always room for improvement. But that will only happen if the stable team decides to make improvement a priority.

Ext4 data corruption in stable kernels

roc — Tue, 12 Dec 2023 07:16:25 +0000

Regarding which tests to run: as bgilbert said: "Many other projects have CI tests that are required to pass before a new release can ship. If that had been the case for LTP, this regression would have been avoided." Of course the LTP test *did* run; it's not just about having the tests and running the tests, but also gating the release on positive test results.

As it happens my rr co-maintainer Kyle Huey does regularly test RC kernels against rr's regression test suite, and has found (and reported) a few interesting bugs that way. But really the Linux Foundation or some similar organization should be responsible for massive-scale automated testing of upstream kernels. Lots of companies stand to benefit financially from more reliable Linux releases, and as I understand it, the LF exists to channel those common interests into funding.

Ext4 data corruption in stable kernels

wtarreau — Tue, 12 Dec 2023 05:08:51 +0000

That's a valid point, though the time has adapted over history to the period it takes for active participants to send their reports. If you wait too long, testers start testing at the end of the period and during all that time frame, users stay needlessly exposed to unfixed bugs (including the one that was needed to fix that one). And I agree that if the period is too short, you get less opportunities to test.

It you think you could regularly participate in tests with a bit more time, you should suggest this publicly. I'm sure Greg is open to adjust is cycle a little bit to permit more testing, but it needs to be done for something real, not just suppositions. Keep in mind that he's the person that has released the largest number of kernels and has accumulated a lot of experience about what happens before and after, and by now he definitely knows how people react at various periods of the cycle.

When I was maintaining extended LTS kernels, I also got used to how my users would react. I knew that one distro would test during the week after -rc so I would leave one week of testing, then I knew that nobody would test it for one month following the release, that was specific to these use cases where users don't upgrade often and prefer to wait for the right moment. So in my head a release was not confirmed until about one month after it was emitted, which often required to quickly emit another one to fix some issues.

And nowadays I'm pretty sure that the feedback and behavior on 6.6 is not the same at all as with 5.4 or 4.14!

In haproxy we have much less changes per stable release and we announce our own level of trust about the version. That's possible because the maintainers doing the backports have already been involved in a lot of these fixes and hesitating about some backports. So we just indicate if we're really confident in the release or if it should be taken with special care. Users appreciate it a lot and help us in return by reporting suspicion about issues. I don't think it would work well for the kernel because stable maintainers receive an avalanche of fixes from various sources and it's very hard to have an idea of the impacts of these patches. Subsystem maintainers are the ones who know best, immediately followed by testers, so it's quite hard to give an appreciation of how much a version can be trusted. In an ideal world, some subsystem maintainers could indicate "be careful" and that would raise a warning. But here it wouldn't have worked since that was already a fix for a serious problem.

Fixes that break stuff are the worst ones to deal with because they create a lot of confusion. And BTW security fixes are very often in this category, which is why we insist a lot on having them discussed publicly as much as possible.

Ext4 data corruption in stable kernels

wtarreau — Tue, 12 Dec 2023 04:55:48 +0000

> There should be dedicated machines that automatically build and boot those kernels AND run as many automated tests as can be afforded given the money and time available. With some big machines and 48 hours you can run a lot of tests.
>
> This isn't asking for much. This is what other mature projects have been doing for years.

Well, if you and/or your employer can provide this (hardware and manpower to operate it), I'm sure everyone will be extremely happy. Greg is constantly asking for more testers. You're speaking as if some proposal for help was rejected, resources like this don't fall out from the sky. Also you seem to know what tests to run on them, please do! All the testers I mentioned run their own tests from different (and sometimes overlapping) sets and that's extremely useful.

But saying "This or that should be done", the question remains "by whom if it's not by the one suggesting it?".

Ext4 data corruption in stable kernels

wtarreau — Tue, 12 Dec 2023 04:52:13 +0000

> but no action was taken to fix that release. 6.1.64 was released with the problem four days later.

You should really see that as a pipeline. Even if the issue was reported you don't know if it was noticed before 6.1.64 was emitted. What matters is that the issue was quickly fixed. Sure we're still missing a way to tag certain versions as broken, like happened for 2.4.11 that was marked "dontuse" in the download repository. But it's important to understand that the constant flow of fixes doesn't easily prevent a release from being cancelled instantly.

I would not be shocked to see 3 consecutive kernels being emitted and tagged as "ext4 broken" there for the time it takes to get knowledge of the breakage and fix it.

> I have personally been complained at by Greg for fixing a stable kernel regression via cherry-pick, rather than shipping the latest release directly to distro users.

Here you're speaking about cherry-picking fixes. That's something extremely dangerous that nobody must ever do and that some distros have been doing for a while, sometimes shipping kernels remaining vulnerable for months or years due to this bad practice. The reason for recommending against cherry-picking is very simple (and was explained in lengths at multiple conferences): the ONLY combinations of kernel patches that are both tested and supported by the subsystem maintainers are the mainline and stable ones. If you perform any other assembly of patches, nobody knows if they work well together or if another important patch is missing (as happened above). Here the process worked fine because developers reported the missing patches. Imagine if you took that single patch yourself, nobody would have known and you could have corrupted a lot of your users' FSes.

So please, for your users, never ever cherry-pick random patches from stable. Take the whole stable, possibly a slightly older one if you don't feel easy with latest changes, add your distro-specific patches on top of it, but do not pick what seems relevant to you, that will eventually result in a disaster and nobody will support you for having done this.

Ext4 data corruption in stable kernels

ro1 — Tue, 12 Dec 2023 02:51:54 +0000

Is kernel 6.2 affected?

Ext4 data corruption in stable kernels

roc — Tue, 12 Dec 2023 00:16:26 +0000

> I've counted 17 people responding to that thread with test reports, some of which indicate boot failures, others successes, on a total of around 910 systems covering lots of architectures, configs and setup.

Relying on volunteers to manually build and boot RC kernels is both inefficient and inadequate. There should be dedicated machines that automatically build and boot those kernels AND run as many automated tests as can be afforded given the money and time available. With some big machines and 48 hours you can run a lot of tests.

This isn't asking for much. This is what other mature projects have been doing for years.

Ext4 data corruption in stable kernels

Bosch — Mon, 11 Dec 2023 21:27:08 +0000

Question: Does anyone know if the latest Manjaro XFCE release (manjaro-xfce-23.0.4-minimal-231015-linux65) is safe from this bug?

I'm planning to make the jump from Mint to that in the next few days and I don't know where to look to find which kernel it uses on install.

Ext4 data corruption in stable kernels

pizza — Mon, 11 Dec 2023 19:55:53 +0000

> So perhaps the stable testing period should be make longer, like 4-5 days.

No matter what period is chosen, it will simultaneously be too short for some, and too long for others.

Ext4 data corruption in stable kernels

mat2 — Mon, 11 Dec 2023 19:54:27 +0000

From the mail you quoted:

> Subject: [PATCH 6.6 000/134] 6.6.5-rc1 review
> Date: Tue, 5 Dec 2023 12:14:32 +0900

[snip]

> Responses should be made by Thu, 07 Dec 2023 03:14:57 +0000.
> Anything received after that time might be too late.

I think that the time available for testing stable release candidates is too short (~48 hours). Some bugs (such as this one) are visible only after some usage period.

Longer times also mean more testers. For example, Ubuntu's mainline PPA ( https://kernel.ubuntu.com/mainline/ ) might run such stable RCs similar to how it compiles normal kernels.

So perhaps the stable testing period should be make longer, like 4-5 days.

I'll try to test stable RCs myself. Is there some mailing list available that one may subscribe to to get notifications about these releases (just notifications, without all the patches)?

Ext4 data corruption in stable kernels

csigler — Mon, 11 Dec 2023 17:02:25 +0000

Neither. I have the Arch LTS kernel pkg installed in case the latest current has "issues." So I could easily have been bitten by this. Losing data is never something to joke about, kindly or hatefully.

I think I was just relieved that the current stable kernel which I run has not been affected!

Ext4 data corruption in stable kernels

farnz — Mon, 11 Dec 2023 16:50:42 +0000

Without first determining whether or not a given patch is, or is not, a fix for a CVE/!CVE, how do I state the CVE number in the changelog? Bear in mind that at the point I write the patch, I may just be fixing a bug I've seen, without realising that it's security relevant, or indeed that someone has applied for a CVE number for it.

Ext4 data corruption in stable kernels

cloehle — Mon, 11 Dec 2023 15:50:40 +0000

>Using that logic ("all-users-must-upgrade"), all patches in a given stable release are both security fixes and not security fixes at the same time.

And that is kind of the current situation, although strangely worded.
The kernel doesn't make the distinction, don't run a kernel with fixed bugs.

To get a CVE many vendors require you to actually prove an exploit and that is often magnitudes more effort for both the reporter and the CNA to verify, but for now the kernel community would rather spend the potential days to months with fixing stuff instead of thinking "How could this bug be exploited somehow?".

Ext4 data corruption in stable kernels

bgilbert — Mon, 11 Dec 2023 15:44:47 +0000

I've counted 17 people responding to that thread with test reports, some of which indicate boot failures, others successes, on a total of around 910 systems covering lots of architectures, configs and setup. I think this definitely qualifies for "appropriate tools", "testing" and "developer time", and I doubt many other projects devote that amount of efforts to weekly releases.

Many other projects have CI tests that are required to pass before a new release can ship. If that had been the case for LTP, this regression would have been avoided. What's more, the problem was reported to affect 6.1.64 during its -rc period, but no action was taken to fix that release. 6.1.64 was released with the problem four days later.

Mistakes happen! But this is an opportunity to improve processes to prevent a recurrence, rather than accepting the status quo.

No, for having already discussed this topic with him, I'm pretty sure he never said this. I even remember that once he explained that he doesn't want to advertise severity levels in his releases so that users upgrade when they feel confident and not necessarily immediately nor when it's written that now's a really important one. Use cases differ so much between users that some might absolutely need to upgrade to fix a driver that's going to ruin their data while others might prefer not to as a later fix could cause serious availability issues.

I have personally been complained at by Greg for fixing a stable kernel regression via cherry-pick, rather than shipping the latest release directly to distro users. I've seen similarly aggressive messaging in other venues. In fact, the standard release announcement says:

All users of the x.y kernel series must upgrade.

If downstream users are intended to take a more cautious approach, the messaging should be clarified to reflect that.

Ext4 data corruption in stable kernels

Kamiccolo — Mon, 11 Dec 2023 15:39:54 +0000

It was posted in this thread some time ago:
https://lore.kernel.org/stable/81a11ebe-ea47-4e21-b5eb-53...

I'd say LTP deserve at least a little bit more love ;)

Ext4 data corruption in stable kernels

jalla — Mon, 11 Dec 2023 15:32:28 +0000

Happened again today with 6.6.5 and 6.1.66.

https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6....
https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.6.6

Ext4 data corruption in stable kernels

rolexhamster — Mon, 11 Dec 2023 15:19:38 +0000

Using that logic ("all-users-must-upgrade"), all patches in a given stable release are both security fixes and not security fixes at the same time. (In other words, heisen-patches, fashioned after heisenbugs). That's a cop out.

If a patch (or collection of patches) fixes an existing CVE/!CVE, why not simply state that in the changelog? This is distinct and separate from asking to categorize each bug/patch as "Very likely not security-related" and "Security-related".

Ext4 data corruption in stable kernels

farnz — Mon, 11 Dec 2023 15:13:07 +0000

That requires someone (who's willing to stand behind their effort) to look over all of the changes, and tell you which ones have no security relevance. And thinking this way reveals the problem with "I only apply the security-relevant bugfixes"; to do that, you first need to know which bugfixes are security relevant, which in turn implies that you know which bugfixes are not security relevant.

If you merely take all bugfixes that are known to be security relevant, then you're engaging in theatre; there will always be security relevant bugfixes that aren't known to be security relevant, either because no-one in the chain from the bug finder to Greg recognised that this bug had security relevance, or because people who recognised that it was security relevant chose to hide that fact for reasons of their own (e.g. because they work for the NSA, want future kernels to be fixed, but benefit from people not rushing to backport the fix to a vendor's 3.3 kernel).

Ext4 data corruption in stable kernels

intgr — Mon, 11 Dec 2023 14:31:11 +0000

> existing well-known test suites like LTP (which revealed the bug)

Interesting fact. Do you have a link to it? And any discussions that followed?

Ext4 data corruption in stable kernels

Paf — Mon, 11 Dec 2023 14:25:18 +0000

Right, that’s exactly what I was referring to with “developed in the context of” and “you get it”. This problem - of an unknown dependency on an older commit - is not terribly solvable in general. One thing that can help you sometimes is having robust tests.

Ext4 data corruption in stable kernels

cloehle — Mon, 11 Dec 2023 14:20:36 +0000

Absolutely not, the kernel community rejects the CVE systems, and for very good reasons.

It is a completely unreasonable amount of effort to categorize bugs into "Very likely not security-related" and "Security-related", in fact everyone that attempts this (most vendors) messes up regularly, which is a huge weakness of the CVE (and !CVE) systems for that matter.

Ext4 data corruption in stable kernels

farnz — Mon, 11 Dec 2023 14:02:51 +0000

Greg's position is a lot less concrete than that - it's "I make no assertions about whether or not any given batch of patches fixes bugs you care about; if you want all the fixes I think you should care about, then you must take the latest batch". Whether you want all the fixes that Greg thinks you should is your decision - but he makes no statement about what subset of stable patches you should pick in that case.

Ext4 data corruption in stable kernels

wtarreau — Mon, 11 Dec 2023 14:00:13 +0000

That's why for me data corruption bugs are the worst ones. When you discover it, generally it's too late.

I abandoned reiserfs about 10 years ago after I found corrupted file tails 3 times in the same week. The FS was really neat but I stopped trusting it, so I immediately turned tail merging off and switched when I had the opportunity to. Nobody wants to play with their data, that only strikes back much later, after the problem appeared.

Ext4 data corruption in stable kernels

wtarreau — Mon, 11 Dec 2023 13:57:01 +0000

> The real larger problem Linux fans never talk about: very poor/inadequate/missing QA/QC in the Linux kernel.

Compared to what, and as per which metric and unit ?

Latest kernel was run on ~910 systems by 17 people who found issues that were fixed before the release:

https://lore.kernel.org/all/20231205031535.163661217@linu...

If you have good plans to propose something better that doesn't put the process to a halt, I'm sure everyone would be interested to know about it. The stable team is always seeking more testers, feel free to join.

Ext4 data corruption in stable kernels

wtarreau — Mon, 11 Dec 2023 13:52:06 +0000

> No. It would just delay a lot of fixes.
>
> Perhaps that would be a good thing, especially when it comes to critical subsystems.

No, it would just leave users exposed longer to them and make them appear with many more related fixes, making it even harder to spot the culprit. The problem is that some users absolutely want to reject the responsibility on someone else:
- a fix is missing, what are you doing maintainers, couldn't you pick it for stable ?
- a fix broke my system, what are you doing maintainers, couldn't you postpone it ?

It will never change anyway, but it will continue to add lines here on lwn :-)