|
|
Subscribe / Log in / New account

The core of the -stable debate

By Jonathan Corbet
July 22, 2021
Disagreements over which patches should find their way into stable updates are not new — or uncommon. So when the topic came up again recently, there was little reason to expect anything but more of the same. And, for the most part, that is what ensued but, in this exchange, we were also able to see the core issue that drives these discussions. There are, in the end, two fundamentally different views of what the stable tree should be.

The 5.13.2 stable update was not a small one; it contained an even 800 patches. That is 5% of the total size of the mainline 5.13 development cycle, which was not small either. With the other stable kernels going out for consideration on the same day, there were over 2,000 stable-bound patches in need of review; that is a somewhat tall order for even a large community to handle in the two days that are allowed. Even so, Hugh Dickins was able to raise an objection over the inclusion of several memory-management patches that had not been specifically marked for inclusion in the stable releases. Those patches, he thought, were not suitable for a stable kernel and should not have been selected.

Stable-kernel maintainer Greg Kroah-Hartman responded that the size of the update was due to maintainers holding onto fixes until the merge window opens. Once the -rc1 release comes out, those fixes all land in the stable updates, which are, as a result, huge. But it is clear that the amount of change going into the stable kernels has been growing over time. If one looks at the number of changes going into the first five updates for each release (enough updates to include the merge-window fixes), the result is:

Release Updates Changes
First 5Total
4.19 198 754 19,682
5.0 21 408 2,387
5.1 18 353 1,747
5.2 21 779 2,429
5.3 18 561 2,178
5.4 134 435 14,414
5.5 19 653 2,516
5.6 19 364 1,864
5.7 19 581 2,984
5.8 18 899 2,755
5.9 16 1,244 2,339
5.10 52 841 7,295
5.11 22 948 3,588
5.12 19 1,446 3,843
5.13 4 1,416 1,416

While 5.13 has not yet reached five stable releases as of this writing, it seems safe to predict that it will collect more changes in its first five stable updates than any of its predecessors. The number of patches going into the stable updates is increasing in general, at a rate that would seem to exceed the growth in the rate at which changes are applied during mainline kernel development cycles. Long-term stable kernels now receive more patches during their "stable" period than during the development cycle leading up to their "final" release. In other words, the development cycle is not even close to being finished when Linus Torvalds applies a tag and leaves the building.

Developers and maintainers can indicate that a mainline patch should be backported to the stable releases by including a "CC: stable@vger.kernel.org" tag, but that is not how most patches get there; of the 1,416 commits in 5.13.4, only 259 — 18% — contained such a tag. The stable maintainers have increasingly aggressively sought out mainline patches that look like fixes and put them into the stable releases. The Fixes tag found on many patches is used as a cue that a patch is a fix, but machine learning is also being used to select patches. The result is a lot of commits going into stable updates without having ever been explicitly marked for that treatment.

This work has always been controversial, especially when regressions slip through. Regressions are inevitable regardless, but it is hard to imagine a way to add over 3,000 changes to a kernel during a three-month short-term stable cycle without a few of them being bad. The patches singled out by Dickins this time around are not responsible for any regressions (that anybody knows about yet), but the memory-management developers are clearly worried about the possibility.

To avoid any such outcome, memory-management maintainer Andrew Morton has requested that patches carrying his Signed-off-by tag not be included in stable updates in the absence of a specific request: "At present this -stable promiscuity is overriding the (sometime carefully) considered decisions of the MM developers, and that's a bit scary". Kroah-Hartman asked how the decision to mark a patch for stable backporting is made and, specifically, why a number of clear fixes were not selected; Morton explained the thinking this way:

Broadly speaking: end-user impact. If we don't have reports of the issue causing a user-visible problem and we don't expect such things to occur, don't backport. Why risk causing some regression when we cannot identify any benefit?

Kroah-Hartman's position, instead, is that, if a patch fixes a bug, it should be included in the stable updates:

But it really feels odd that you all take the time to add a "Hey, this fixes this specific commit!" tag in the changelog, yet you do not actually want to go and fix the kernels that have that commit in it. This is an odd signal to others that watch the changelogs for context clues.

Sasha Levin (who also works on the stable updates) added that a lot of important fixes are missed if only the explicitly tagged patches are backported into the stable kernels. Some of those fixes then find their way into distributor kernels via other paths, which doesn't seem ideal.

In the end, this is what the disagreement comes down to: a difference of opinion on what is the best way to create stable updates that are truly stable and free of problems.

  • Many developers see the stable updates as a carefully curated collection of hand-selected fixes, each of which has been extensively reviewed for importance and the lack of regression potential. These kernels should be safe to update to, since they should have a minimal chance of introducing problems not seen in their predecessors. This position tends to be taken by the developers of complex, core subsystems that have a high potential for subtle regressions.

    The memory-management subsystem is a classic example; there was also a similar discussion with the XFS filesystem developers in late 2020. Memory management requires predicting the future; as a result, the code is a large collection of complicated heuristics that have to work with a huge variety of workloads. It is not uncommon for an innocent-seeming change to create a performance regression for some private customer workload that won't surface until years have passed. Memory-management developers have learned that their lives run much more smoothly in the absence of such regressions, so they go out of their way to avoid making unnecessary changes to stable kernels.

  • Others, including the stable maintainers, feel that the best kernel is to be had by including every fix that can be reasonably backported. Many bugs have user impacts — including security problems — that are not obvious to the developers when those bugs are being fixed; including all of the fixes will head off a lot of problems before they are discovered.

    Some distributors have taken this position and, as a result, are happy with how things are working; Justin Forbes wrote that "the current stable process has fixed more bugs than it has introduced". The Android kernel is increasingly tied to the stable updates as a way of getting as many fixes as possible; based on this experience, Kroah-Hartman said: "I have numbers to back up the other side, along with the security research showing that to ignore these stable releases puts systems at documented risk".

Which position is "correct" is not entirely clear, but there is no doubt about which position is "winning". As always in the Linux world, the people who are doing the work will decide how the work is to be done, and the stable maintainers have opted for the "promiscuous" approach. It would be interesting to see if there would be a user community for ultra-stable kernels maintained using the more restrictive approach, but it is doubtful that anybody has the time to create such a thing.

There is always room for tweaking around the edges and opting out certain subsystems; this seems likely to happen with regard to memory management. More testing would also certainly help; the testing picture for stable releases has improved considerably over the years but could still get better. Ted Ts'o suggested that there could be a role for a "perfbot" system that looked for performance regressions in particular, if the resources could be found to create that sort of facility. Performance regressions are particularly difficult to test for, though; the resource requirements are large and it is nearly impossible to simulate every type of workload.

In any case it seems that large stable updates will continue to be the rule going forward. With luck they will continue to become less regression-prone, but they will never be completely regression-free. So it is safe to predict that the debates over what should and should not go into stable updates will continue indefinitely.

Index entries for this article
KernelDevelopment model/Stable tree


to post comments

The core of the -stable debate

Posted Jul 22, 2021 17:37 UTC (Thu) by jgg (subscriber, #55211) [Link]

The -stable is probably better described as -secure, as the article explains the motivation and goal seems to be to get all bug fixes because any could be a security issue. It is not about the very low change, highly predicable safe upgrade that you might expect from an enterprise distro stable stream.

On the other side we also have distros and other consumers with thier own, quite different, policies.

IMHO this debate basically boils down to the fact there are lots of people in the backporting world consuming the "cc: stable" and "Fixes" tags and all have their own incompatible expectations on what they should mean.

What I've sort of settled on is "cc: stable" means this is really important and we know it needs attention for some reason. Everyone doing backports should look at this.

"Fixes" means this fixes something, and should be evaluated according to each back porter's criteria. Commit messages are supposed to be good enough to help them decide this. No tag means this probably is new features/functions.

With all the AI and algorithms -stable is really strongly QA dependent. If you are using a part of the kernel that doesn't have QA coverage on the -stable tree, run away. This has created something of a chicken and egg as in my world most users are not using -stable, and I can't justify investing in more QA without users.

The core of the -stable debate

Posted Jul 22, 2021 19:27 UTC (Thu) by post-factum (subscriber, #53836) [Link] (16 responses)

FWIW, it got funny recently with -stable and -mm.

This series was backported partially (2 out of 4 patches were picked for some reason), obviously introducing a visible regression because of broken locking. Those 2 patches are set to be reverted in 5.13.5 so that the regression will be addressed.

But what's the logic behind this? First, a partial fix for a real problem is backported, introducing yet another issue. Then, this fix, instead of being backported fully, gets completely reverted.

Stable is not stable any more?

The core of the -stable debate

Posted Jul 23, 2021 1:42 UTC (Fri) by willy (subscriber, #9762) [Link] (14 responses)

That is an insanely rare race. When was the last time you invoked "swapoff" *while the system was actively swapping to that swap device*? ie not part of shutdown.

I think I've done that maybe twice in my entire life. And the race window is several instructions long.

I honestly wonder if it's worth fixing at all. It's certainly not worth backporting to -stable. I've asked Sasha to see if his bot can notice that a patch is part of an N part series and refuse to backport M/N if part 1 to M-1 are not going to be backported.

The core of the -stable debate

Posted Jul 23, 2021 6:01 UTC (Fri) by post-factum (subscriber, #53836) [Link] (6 responses)

So, -stable is for fixes, but not for all fixes. But for what fixes specifically -stable is?

The core of the -stable debate

Posted Jul 23, 2021 9:34 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (3 responses)

For those that the maintainers deem worthy, depending on risk of regressions, risk of botched backports, and likelihood of actually encountering the bug in the wild, among other things.

The core of the -stable debate

Posted Jul 23, 2021 10:24 UTC (Fri) by post-factum (subscriber, #53836) [Link] (2 responses)

> maintainers

Subsystem maintainers? -stable maintainers?

The core of the -stable debate

Posted Jul 23, 2021 14:39 UTC (Fri) by lsl (subscriber, #86508) [Link] (1 responses)

Those that do the work, so -stable maintainers.

The core of the -stable debate

Posted Jul 24, 2021 13:51 UTC (Sat) by pbonzini (subscriber, #60935) [Link]

Nope. Those that know the code best and are on the hook if somebody reports a regression, that is subsystem maintainers.

The core of the -stable debate

Posted Jul 23, 2021 11:39 UTC (Fri) by willy (subscriber, #9762) [Link]

If only somebody had written those rules down.

https://www.kernel.org/doc/html/latest/process/stable-ker...

* It must fix a real bug that bothers people (not a, “This could be a problem…” type thing).

The core of the -stable debate

Posted Jul 23, 2021 13:21 UTC (Fri) by wtarreau (subscriber, #51152) [Link]

> But for what fixes specifically -stable is?

All those where the benefit/risk ratio is extremely high. I.e. any fix whose risk is zero, and fixes of high importance when there is a non-negligible risk. For example Andy's ESPFIX series a few years ago were not trivial to backport and required to modify some of the exception handling code that otherwise is never ever touched in a patch series. I remember that I was really scared when touching such areas (I don't remember if that was on 2.6.32 or 3.10, but I still remember about that topic), but regardless of the risks, that was required, and the risks were addressed with a good dose of cross-maintainer help and testing. All the spectre/meltdown/l1tf etc stuff was orders of magnitude more complex and more risky, but had to be done as well.

In short, -stable is for those who use Linux. If you're uncertain to be able to gauge a risk/benefit ratio yourself, let those who it's the job do it.

It's worth noting that for those dealing with patched kernels (distros, product vendors etc), the situation is even worse. While -stable maintainers have the freedom to decide how to modify their kernel to reduce some risks, those at the end of this supply chain do not have this luxury if some of their patches rely on the modified parts. This can sometimes imply more modifications in their own patches just to adapt to the last changes, and such modifications will be done with less chance of peer review, let alone public review, and sometimes even no chance to try the code in field if the issue is still under embargo. This alone is a great motivation for upstreaming your local work as much as possible.

The core of the -stable debate

Posted Jul 26, 2021 13:18 UTC (Mon) by immibis (subscriber, #105511) [Link] (6 responses)

How would you tell the system not to actively swap to that swap device, if not by using swapoff?

The core of the -stable debate

Posted Jul 26, 2021 13:28 UTC (Mon) by willy (subscriber, #9762) [Link] (5 responses)

swapoff is, of course, how you tell the system to stop using a swap device.

My question is, how often do you do this (other than as part of shutdown)? Once an hour? Once a week? Ever?

The core of the -stable debate

Posted Jul 27, 2021 16:00 UTC (Tue) by flussence (guest, #85566) [Link] (2 responses)

I've had scenarios along the lines of “oh shit, this linker process is about to fill up zram swap after a day long chromium build, better move it to disk so the loadavg doesn't zoom off into 3 digits forever”.

If the kernel decides to panic at that point because I'm holding the bailing-out bucket wrong I'd not be too happy.

The core of the -stable debate

Posted Jul 27, 2021 16:08 UTC (Tue) by willy (subscriber, #9762) [Link] (1 responses)

That's swapon though, right? Not swapoff?

The core of the -stable debate

Posted Aug 2, 2021 18:51 UTC (Mon) by flussence (guest, #85566) [Link]

That's kind of both. Adding a higher priority swap space when zram can't keep up only stops things getting exponentially worse - zram won't shrink its RAM usage without a kick from an operator, and the kernel isn't great with cases where swap also contributes to memory use. It's like the swap-on-NFS problem but worse.

(Fortunately for me I've got enough RAM nowadays to not care about that kind of thing, but I'd spare a thought for people that still do.)

The core of the -stable debate

Posted Jul 30, 2021 13:16 UTC (Fri) by droundy (subscriber, #4559) [Link] (1 responses)

I've used "swapoff -a && swapon -a" relatively frequently to recover from the situation where my system has written a few gigabytes to swap and when the memory pressure receded, zoom was still being incredibly slow, presumably due to slowing swapping memory back in. Without this trick it takes ages to get things back to to speed. Not sure what zoom does with all that RAM, maybe garbage collection?

Of course, I try to do this *after* the swap storm and when there's plenty of memory free, but I'm a human who is not fully able to predict the future.

The core of the -stable debate

Posted Jul 31, 2021 1:35 UTC (Sat) by pabs (subscriber, #43278) [Link]

I've used that trick before too, it does help quite a bit in some situations.

The core of the -stable debate

Posted Jul 25, 2021 18:06 UTC (Sun) by willy (subscriber, #9762) [Link]

... and at least one of those two patches is buggy. Hugh asked for it to be reverted upstream. Reverting from stable was clearly the right decision.

The core of the -stable debate

Posted Jul 22, 2021 21:42 UTC (Thu) by roc (subscriber, #30627) [Link] (24 responses)

"fixed more bugs than it has introduced" seems like a weird way to evaluate "stable" updates. For users, the cost of an introduced regression is typically much higher than the value of a fixed bug. Something that was broken before starting to work -> nice to have. Something that was working before doesn't work -> wailing and gnashing of teeth, time spent diagnosing the problem and possibly downgrading the kernel.

Also each regression teaches users to be afraid of updating the kernel, and even to be afraid of upgrading other software components, making them less likely to accept future updates, and thus exposed to ongoing to security issues.

The core of the -stable debate

Posted Jul 23, 2021 3:24 UTC (Fri) by wtarreau (subscriber, #51152) [Link] (23 responses)

That's why it's generally recommended not to immediately upgrade sensitive machines to the latest -stable, especially if it looks quite fat. Others can see these regressions before you and you stay safe.

Stable kernels have become extremely reliable these days. The likelihood of facing a regression there is extremely low, and these kernels experience a lot of tests by various teams before being released. Even if a regression is introduced, it's often unlikely that you'll actually face it due to the wide spectrum of areas covered by each version.

As I once mentioned, in many years of rebasing our products onto the latest -stable, we've met a regression only once, and it was not caused by the official kernel but by one of our patches being applied at the wrong place. Because that's also something important to remind, that many stable users do have extra patches on top of the kernel. Some of these patches are extra features/drivers, and some used to be fixes relevant to the use case, that can often be dropped after they're merged into -stable.

From an ex-stable maintainer's perspective I find the size of current kernels scary, for me the limit was around ~200 at once, entirely handled manually and with limited testing. But with all the current tooling and tests, I'm not shocked by 800 patches. The current kernel development process probably is one of the most efficient, scalable and robust of all the software industry, and I'm sure it still has some margin to further improve!

The core of the -stable debate

Posted Jul 23, 2021 6:41 UTC (Fri) by bgilbert (subscriber, #4738) [Link] (7 responses)

At CoreOS in late 2017/early 2018 we switched to shipping the latest stable LTS kernels, with a single-digit number of (small) patches on top. We learned to be more cautious after we had six regressions in seven months.

The last straw was a bug that caused large downloads to run at ~300 bytes/sec, which rather interfered with shipping a fix. That change had landed in the stable tree — against the policy of the networking subsystem — after someone submitted it to the stable@ mailing list. It turned out to be part two of a two-patch series.

The core of the -stable debate

Posted Jul 24, 2021 9:44 UTC (Sat) by wtarreau (subscriber, #51152) [Link]

That's indeed rather bad and doesn't translate my experience. Don't you feel that the quality has overall improved over the last 3 years since you faced this ?

The core of the -stable debate

Posted Jul 24, 2021 9:54 UTC (Sat) by pabs (subscriber, #43278) [Link] (5 responses)

Do you have automated testing in place to detect the next set of regressions? Can that testing be done upstream in KernelCI and other places?

The core of the -stable debate

Posted Jul 24, 2021 19:39 UTC (Sat) by bgilbert (subscriber, #4738) [Link] (4 responses)

We had automated tests for common uses of the distro, and those did occasionally catch regressions before release. (Those tests were unfortunately pretty tied to our custom test framework.) The regressions we didn't catch usually affected a non-default configuration or network environment. These were often things that an upstream subsystem test suite might reasonably be expected to test, but from our perspective as a downstream consumer, we had no hope of QA'ing an entire kernel ourselves. We ultimately addressed this with a policy change: no upstream stable release would be promoted to our stable channel until it had spent a week in our beta channel, being exposed to user workloads.

I agree with roc that regressions in stable kernels should be viewed as a failure of upstream QA, not of downstream QA. I wonder if some of the friction could be addressed by setting expectations properly. If the stable maintainers don't intend their kernels to be used without significant additional validation, and consistently said so, then others wouldn't keep discovering these problems the hard way. But in my experience, the current maintainers tend to push users to run the latest releases (sometimes quite directly) while being less than entirely responsive to stability concerns.

The core of the -stable debate

Posted Jul 25, 2021 22:34 UTC (Sun) by rodgerd (guest, #58896) [Link] (3 responses)

> I agree with roc that regressions in stable kernels should be viewed as a failure of upstream QA, not of downstream QA.

Otherwise it's just an attempt to have it both ways: harangue users for not using "stable" kernels instead of distro kernels, while refusing to provide users the same level of guarantees that Debian, Red Hat, Ubuntu, etc offer their users.

The core of the -stable debate

Posted Jul 26, 2021 16:48 UTC (Mon) by wtarreau (subscriber, #51152) [Link] (2 responses)

A regression is always a failure, from all those involved in the operation that led to that regression: the author of the original code that had to be patched; the author of the patch that possibly overlooked some side effects or was not precise enough regarding the backport instructions; those who thought they understood the changes enough to apply them or to ignore a companion patch, etc.

There is no excuse for a regression, and it's pointless to try to blame this one or that one, everyone should work together to get it fixed ASAP and limit the risk that the same situation can repeat. It will always happen but it has to remain exceptionally rare.

But there's something worse than regressions, it's not applying fixes. And that's precisely the problem regressions are causing: by lack of trust, users tend to apply fixes less often or to only pick some; and by excess of upstream validation, the amount of efforts can slow down creation or adoption of fixes. A good balance is required, and ideally a target of number of regressions per year should be established as a reference to gauge the overall quality (e.g. no more than 3 minor and 1 major per year).

The core of the -stable debate

Posted Jul 26, 2021 17:37 UTC (Mon) by bgilbert (subscriber, #4738) [Link] (1 responses)

I agree with most of your post, but it's framed around a premise I don't think is right:

> But there's something worse than regressions, it's not applying fixes.

We do ourselves a disservice by thinking of fixes as the greatest possible good. As roc pointed out upthread, lots of users don't see things that way. They think in terms of risk/reward tradeoffs, and thus so should we.

The core of the -stable debate

Posted Jul 26, 2021 20:46 UTC (Mon) by wtarreau (subscriber, #51152) [Link]

> We do ourselves a disservice by thinking of fixes as the greatest possible good. As roc pointed out upthread, lots of users don't see things that way. They think in terms of risk/reward tradeoffs, and thus so should we.

They think like this until they're seriously hit by a bug and ask for their data to come back. The worst issues you can have (IMHO) are those which silently destroy your data. Memory corruption bugs can be part of them sometimes, file-system bugs as well. Bugs where a network driver can receive packets larger than the allocated buffer can cause this. All of such bugs are totally ignored by the vast majority of the users, which is why they believe they'd rather not apply fixes. But when they find zeroes or random stuff inside their source files or their holidays photos, and that all their backups are corrupted because what was backed up is what was written, then they seriously complain that it's criminal not to integrate fixes for such issues.

However I do agree that the vast majority don't are about bugs that cause crashes from time to time, or instabilities that require a reboot to stay on the safe side. But the frontier between the two categories is often extremely blury.

The core of the -stable debate

Posted Jul 23, 2021 9:32 UTC (Fri) by mjg59 (subscriber, #23239) [Link] (10 responses)

> That's why it's generally recommended not to immediately upgrade sensitive machines to the latest -stable

Then how do you ensure they receive relevant security updates?

The core of the -stable debate

Posted Jul 23, 2021 13:32 UTC (Fri) by wtarreau (subscriber, #51152) [Link] (9 responses)

> Then how do you ensure they receive relevant security updates?

How do you ensure that any security update that's relevant to you isn't irrelevant to the rest of the world, or that any fix for a file-system corruption, or a bug that causes 3 reboots per hour is in ?

The process is always the same, quick glance at the first column of the change log, look for the subsystems you rely on, if they're not present you possibly don't care. If you see a ton of them, you probably care but it can represent a risk. That's where you have to plan to "upgrade soon". If something breaks badly, usually within a day or two, a new version follows with very few fixes. Subtle breakage can take a week or two.

And the extremely rare, critical security updates that require everyone to upgrade are always publicly discussed and relayed everywhere (like here) so that everyone knows that it's about time to upgrade.

The core of the -stable debate

Posted Jul 23, 2021 20:58 UTC (Fri) by roc (subscriber, #30627) [Link] (8 responses)

It is absurd to expect Linux end users to inspect kernel changelogs to decide whether or when to apply a kernel update.

The core of the -stable debate

Posted Jul 23, 2021 23:13 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (6 responses)

Indeed. I'm pretty tech-savvy, but if someone were to ask me if the frobnitz subsystem is important to me, I'd have to say…not sure? Given the obtuseness of many driver names, it's not something one is likely to be able to remember easily.

The core of the -stable debate

Posted Jul 24, 2021 9:49 UTC (Sat) by wtarreau (subscriber, #51152) [Link] (5 responses)

In my opinion, when you build your own kernels it's because you're not interested in the distro's kernel so you know what you're using.

The situation is problematic for distros, though, who have to swallow everything because everything is used in field. But in this case, being late by a week or two avoids shipping the rare versions that are extremely broken like @dgilbert faced above.

The core of the -stable debate

Posted Jul 26, 2021 4:58 UTC (Mon) by roc (subscriber, #30627) [Link] (4 responses)

Unfortunately, shipping security updates late by a week or two is a huge problem these days. The bad guys can reverse engineer bugs from patches and deploy exploits in under a week.

The core of the -stable debate

Posted Jul 26, 2021 6:09 UTC (Mon) by wtarreau (subscriber, #51152) [Link] (3 responses)

> The bad guys can reverse engineer bugs from patches and deploy exploits in under a week.

The bad guys often have discovered these vulns long before they were reported, because there's a real economy behind finding and selling vulnerabilities, and when one becomes reported, it can be one that was already in active use (or waiting to be used), too bad for the bad guys then, or one that was already worn out and about to be disclosed thus not interesting enough for them anymore. E.g. http://zerodium.com/program.html

When you figure you have an opportunity to sell your discovery for $2.5M, would you report it or sell it ? Hard to tell. Probably that the second one to try to sell it fails because it's already known from the buyer, then it's time to report it, give it a name and logo, and become famous :-/

For sure there are a number of vulnerabilities that are first discovered by the reporter, but I'm far from being convinced they're the majority.

In fact what embargoes protect against are mostly script kiddies having fun with their OS, leaving no quick solution to their admin.
In addition, even in sensitive places, lots of admins don't update and reboot immediately after a vulnerability was disclosed and fixed. Even if there's a know exploit. That's why in practice reducing the attack surface by disabling all unused features remains way more of an effective protection than instantly jumping on the last update, because it protects you *before* the vulnerability is public.

The core of the -stable debate

Posted Jul 27, 2021 2:27 UTC (Tue) by roc (subscriber, #30627) [Link] (2 responses)

It's certainly true that many vulnerabilities are discovered by *someone* before the fix is released, but it is also true that they are discovered by many more people *after* the fix is released.

> That's why in practice reducing the attack surface by disabling all unused features remains way more of an effective protection than instantly jumping on the last update

This is only makes sense for kernel users who are also kernel developers. *Maybe* conscientious developers of embedded devices would do this before they ship (though developers of such devices are not known for being conscientious). Anyone using third-party software is very unlikely to disable standard kernel features because the risk of breakage is very high. I've certainly never heard of cloud or desktop Linux users doing this.

The core of the -stable debate

Posted Jul 27, 2021 6:28 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (1 responses)

> This is only makes sense for kernel users who are also kernel developers.

Not kernel developers, but anyone interested in doing a bit of tuning. Rebuilding one's kernel is often considered as a milestone in one's discovery of Linux.

> *Maybe* conscientious developers of embedded devices would do this before they ship (though developers of such devices are not known for being conscientious).

In most embedded devices, it's very common for two reasons:
- resources usage / kernel size reduction
- you know exactly what is required by your machine and you precisely prefer to reduce the attack surface because it buys you some time to deliver updates. It's very convenient to be able to tell a customer "don't worry, you're not affected".

> Anyone using third-party software is very unlikely to disable standard kernel features because the risk of breakage is very high.

Absolutely!

> I've certainly never heard of cloud or desktop Linux users doing this.

In clouds it's pretty rare because if your machine doesn't restart sometimes it's hard to access the console to fix it. But on desktop it used to be unavoidable for some time at the beginning: you had to compile your kernel to set the NIC's address on the ISA slot, to set the sound card's address/IRQ/DMA, then later to enable frame buffer only for your own graphics card because there were multiple drivers available and some didn't work as modules or would take over yours, etc. Things have progressed a lot since, but those who were used to such practices have probably continued. At least all my machines continue to run LTS kernels that I configure and build myself because they perfectly match what I need or the tunings I prefer. "make oldconfig" is also a great way to discover new features, new filesystems, etc.

The core of the -stable debate

Posted Jul 29, 2021 20:16 UTC (Thu) by Wol (subscriber, #4433) [Link]

Don't forget Gentoo ...

Cheers,
Wol

The core of the -stable debate

Posted Jul 27, 2021 16:10 UTC (Tue) by flussence (guest, #85566) [Link]

How do people cope in the Windows world?

The core of the -stable debate

Posted Jul 23, 2021 21:21 UTC (Fri) by roc (subscriber, #30627) [Link] (3 responses)

> The current kernel development process probably is one of the most efficient, scalable and robust of all the software industry

That may be true but only because most of the software industry is a disaster area. In reality, the kernel development process is pretty far behind well-run projects. E.g.

There is no systematic bug tracking. You report bugs by emailing a bunch of people and LKML and hoping that they don't all ignore you. If it's a regression, you have to manually track whether it has been fixed in time for a release, nag people if it hasn't, etc. It's often unclear who, if anyone, is taking responsibility for the bug. Good projects have a bug tracking system that they actually use, so at any point in time anyone can pull up a list of the bugs that need to be fixed before the release and who's responsible for them.

Testing is still very weak. Good projects expect that when you submit a bug fix, you also submit an automated test for the bug. The kernel has no such expectation. And the kernel tests are still not being run frequently enough: there should be a large set of tests running on every significant change. Progress is being made, but there's a long way to go.

One thing that the kernel dev process definitely is *not*: efficient. Maybe it's efficient for the core developers but for people outside that core --- wrangling bugs, doing QA, dealing with LKML, inspecting changelogs to decide whether they should update to -stable today --- it is horrendously inefficient.

The core of the -stable debate

Posted Jul 24, 2021 0:07 UTC (Sat) by pizza (subscriber, #46) [Link] (1 responses)

> Testing is still very weak. Good projects expect that when you submit a bug fix, you also submit an automated test for the bug. The kernel has no such expectation. And the kernel tests are still not being run frequently enough: there should be a large set of tests running on every significant change. Progress is being made, but there's a long way to go.

There are a _huge_ number of tests being routinely run on the Linux kernel, and more are constantly being written. But those doing the testing tend to focus on the use cases they care about, which may or may not overlap with what anyone else cares about.

While many of the kernel subsystems can be independently tested, >80% of the kernel sources consists of device drivers whose corresponding hardware exhibits complex behaviors that depend heavily on what is going on outside the device, even before the huge combinatorial challenge of platform/arch-specific CONFIG_* options is factored in.

As an example of this, I spent most of the last few days trying to hunt down the cause of a firmware panic with an Intel AX200 card on a custom aarch64-based board -- turns out that at the time the SoC vendor kernel was forked, the iwlwifi driver (and that card+firmware combo) had not been tested on a system that didn't have ACPI.

But how exactly are we to ensure a similar regression doesn't occur, without actually executing that particular set of kernel options on that particular board with that hardware? No matter where you draw the line, it's going to necessarily exclude some usecases.

The core of the -stable debate

Posted Jul 26, 2021 5:03 UTC (Mon) by roc (subscriber, #30627) [Link]

Indeed, you can't test all combinations of features. But the kernel is a long way from that being the biggest problem. For example AFAIK there is still no automated testing of all 32-bit-compat wrappers for 64-bit syscalls. (We have such testing in the rr testsuite, which has detected quite a few kernel bugs.)

A huge number of tests is not impressive by itself, since the kernel is a huge project.

The core of the -stable debate

Posted Aug 5, 2021 17:31 UTC (Thu) by tuna (guest, #44480) [Link]

Isn't this what you pay Redhat etc to do for you (if you do not want to invest the engineering resources yourself)?

The core of the -stable debate

Posted Jul 22, 2021 22:05 UTC (Thu) by ojab (guest, #54051) [Link] (2 responses)

With highly scientific script
```ruby
require 'open3'
require 'csv'

tags = `git tag`
.lines
.grep_v(/v2\.6/)
.grep(/^v[0-9]\./)
.map { |tag| Gem::Version.new(tag[1..-1]) }
.reject(&:prerelease?)

releases = tags.chunk { |version| tags.first.segments[0, 2] }

versions = {}
releases.each do |major, minors|
minors.sort.each_cons(2) do |prev, current|
stdout_str, status = Open3.capture2('git', 'rev-list', '--count', "v#{prev}..v#{current}")
versions[current] = Integer(stdout_str.strip)
end
end

CSV.open('/tmp/kern.csv', 'wb') do |csv|
versions.each do |gem_version, commits_count|
csv << [gem_version.to_s, commits_count.to_s]
end
end
```
I got data in https://docs.google.com/spreadsheets/d/1ieNTu9qyaOSx00A8B... (hopefully it works, google docs is hard).

So while we have many commits here, kernel-3.16.35 has 3960 more commits than 3.16.34 which is way bigger than any of the mentioned kernels.

The core of the -stable debate

Posted Jul 22, 2021 23:00 UTC (Thu) by ojab (guest, #54051) [Link] (1 responses)

And if we count rc and add tag date https://gist.github.com/e92a2fe68405f3acb621b1e1b5d5c863 we would have https://docs.google.com/spreadsheets/d/1LJL1o6IBRZb3bb2Pn... table.

While 5.13.2 after 5.14-rc1 has 803 commits, it's not really unusual number at all. 5.12.4 after 5.13-rc1 had 676 commits, 5.11.3 after 5.12-rc1 had 773 commits, etc.

The core of the -stable debate

Posted Jul 22, 2021 23:08 UTC (Thu) by ojab (guest, #54051) [Link]

Overall looks like maintainer/gregkh scripts was changed after 5.10-rc1, there is ~2x jump from ~450 to 750+ patches to the current stable after rc1.

The core of the -stable debate

Posted Jul 22, 2021 22:51 UTC (Thu) by zblaxell (subscriber, #26385) [Link] (3 responses)

the development cycle is not even close to being finished when Linus Torvalds applies a tag and leaves the building.
"One kernel release good enough for everyone" has never happened before in human history, and it's certainly not going to happen for the first time within the lifetime of the Linux kernel project.

Every project I've ever worked on, witnessed, or even _heard of_ follows one of two patterns for at-scale user deployments:

  1. Start from upstream (Linus in this case) and run it through weeks, months, or years of review and testing in a multi-stage QA pipeline, until it's finally ready for general user workloads not worse for some specific workload than what your specific users are running now.
  2. Ship it downstream without testing because upstream signed off on it, and wonder why the UX is so terrible that users have started following you around with pitchforks and torches--or switched to some other upstream, and no longer following you at all.
If you don't want pattern #2, you have to drastically narrow the scope to the parts of the software you can test, and (for security) make sure software you're not using can't be brought into scope. There's finite resources available and choices have to be made. Users downstream from you might disagree with your choices. One of the awesome things about the Linux development model is that this disagreement doesn't matter, because users can choose the QA filter path between them and Linus. As long as downstream QA sends feedback upstream so that upstream eventually improves, and users can reposition themselves at the output end of a QA filter appropriate for their needs (or build their own custom filter), it all more or less works.

It follows that all downstream users are QA filters, all the way to the end user. Distro follows kernel.org, maybe applies a few extra fixes or drops entire releases, or imposes a 6-week delay so that users of other distros can report bugs. Corporate IT follows distro, runs more specialized tests for product owners. Product owners run their own product's tests against new kernels. If the product stops working, there shall be no kernel upgrade that week ("concrete outage" usually trumps "theoretical vulnerability"). App users who fail to get service because one of the above didn't do their QA diligence will migrate to some other startup's product. IoT vendors who let a bad kernel upgrade slip through might experience a business-terminating event. Even individual hobbyists will back out a bad kernel upgrade after the fact, and try upgrading again next week.

10-20% of kernel.org kernels currently fail product regression tests every year, so this filtering process is more or less mandatory for users running Linux in production. With more aggressive -rc kernel fix scraping, that defect rate might rise to 40% and not cause any real problems for anyone: some users might discover they are not positioned at the output end of QA filter appropriate for their needs, and maybe it's a few months between usable upstream kernels instead of a few weeks, but the overall flow won't change. It will still be necessary to distill every -stable kernel before use even if the defect rate was magically orders of magnitude lower.

On the other side of Linus (i.e. where Linus is downstream) it's a very different problem. Subsystem maintainers spend a lot of resources on testing changes in groups (i.e. the entire content of a pull request) and comparatively little on testing all possible combinations of individual commits. Scraping individual patches out of -rc avoids the benefit of that pre-integration testing (at some level of abstraction, this is what led to the late-2020 XFS regression). Subsystem maintainer's QA tests will pass because the PR has all the critical dependencies in place, while the scraped commits version fails because critical dependencies weren't declared by tags. Sometimes it's only possible to know what a commit truly depends on by bisection testing, so it's too burdensome to ask the maintainer to declare dependencies--it would be less work to have the maintainer send the PR, or a bespoke modified version of it, to -stable, than to fill in tags everywhere in case a robot might one day use them.

If the scraping becomes too aggressive,

  1. subsystem maintainers will have to divert some resources to testing the scraper's output in addition to their existing work (the scraper's input),
  2. someone will have to contribute resources to replicate the subsystem tests on the scraper's output side, or
  3. everyone will have to accept the net decrease in -stable kernel quality.
I think if we want more aggressive -stable backporting, we really have to directly ask subsystem maintainers to provide that, and maybe contribute some humans to specialize in that work. The robot is like a mediocre intern, creating more work for others than it accomplishes itself.

The core of the -stable debate

Posted Jul 22, 2021 23:27 UTC (Thu) by roc (subscriber, #30627) [Link] (2 responses)

This "let downstream be responsible for QA" approach creates a lot of problems.

One problem is, a lot of users use distros that ship -stable releases with minimal testing. So users *do* get burned because their QA path is too short.

On the other hand, distros kind of *need* to ship -stable releases with minimal testing because you need to ship security updates quickly and kernel upstream intentionally conceals information about which commits are security fixes. I guess in some cases this information is circulated on private mailing lists, but AFAIK that's not true for all security fixes.

Another problem is, users experiencing kernel regressions often misattribute the issue to other software, and developers of that software are burdened. Happens to us with rr from time to time.

Another problem is, the more "downstream QA" is required, the slower and less reliable is the process of reporting and fixing regressions.

Another problem is, all the downstream kernel testing and bug reporting duplicates work. When users find a regression it's usually not just one user, but many who hit the bug, diagnose it, track down whether it has been reported, and often end up reporting it multiple times. I imagine the same is true to a lesser extent at the distro level.

But "Downstream is responsible for QA" is a great model for kernel developers and that trumps all.

The core of the -stable debate

Posted Jul 23, 2021 7:48 UTC (Fri) by taladar (subscriber, #68407) [Link] (1 responses)

As long as downstream wants to change which commits are included in their kernel they will have to do QA. As the person you replied to said, you can not expect developers to test every possible combination of commits for possible bugs or regressions.

What is really at the core of the issue here is the false concept of "stable" versions based on backports. A "stable" version is nothing more than a completely new and untested version of the software that no developer has seen in that form. The fact that it is new and untested not based on new development but based on selecting a subset of the patches between the old "stable" and the latest released version does not mean it gets any more stability.

The core of the -stable debate

Posted Jul 23, 2021 21:25 UTC (Fri) by roc (subscriber, #30627) [Link]

Downstream has to do some QA, yes, but that can and should be minimised.

The attitude should be "if a bug introduced upstream reaches downstream, that is a failure of upstream QA which should be fixed".

The core of the -stable debate

Posted Jul 23, 2021 3:34 UTC (Fri) by wtarreau (subscriber, #51152) [Link] (2 responses)

I know pretty well this trouble for facing it in haproxy. There, we ask that bug fixes are explicitly marked as such, and that when known, developers try to mention how far that should be backported (not surprising that I was one of the proponents of "Fixes:" long ago :-))

What I'm used to doing there is to *explicitly* mention in a commit message "no backport needed" if I prefer it not to be backported. I know that some devs will not make any mention if they don't want it to be backported, it mostly depends on what developers work on (i.e. perpetual core development usually causes bugs that require a quick fix that has nothing to do in stable, while peripheral areas tends to be stabler and should get their fixes backported unless specified otherwise).

Here we're still missing the notion of "this is a fix", it often comes in the "fixes:" tag but given that it has to contain a commit ID, it's not always trivial to get or developers are lazy. I really think that including "fix" or "bug" in the subject does help a lot to get that info.
And from this, depending on the indication *and* the person, you know the policy: explicit no-backport (written), implicit no-backport (person), implicit backport (person), explicit backport (written).

This approach gives us a lot of flexibility. For example some developers occasionally provide significant improvements in certain areas that are considered worth backporting after some observation time. They can mention it in their commit message like "it might be worth backporting; if done, this patch also requires the following patches to be backported: <list>". And that's used quite a bit because it's actually rewarding for developers to see their improvements adopted faster. For example when you change a test from O(n^2) to O(n*log(n)), sometimes it's worth backporting if O(n^2) was causing stability trouble to some users even if it's not technically a fix.

The core of the -stable debate

Posted Jul 23, 2021 12:35 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (1 responses)

It seems like it'd be useful to drop a note to the original patch submission thread that "hi, patches A, B, and C from this series has been selected for stable trees X, Y, and Z" (bonus points if the bit that hinted that it should be backported could be teased from the ML tool and mentioned as well) to start discussion about it. Then authors and reviewers can give feedback of "you can't just pull a subset of these patches" or "your machine is wrong, please add this as a negative case for training purposes". It's certainly better than expecting everyone to be on top of an 800 patch series review that's go-or-no-go in 2 days.

However, I believe this likely requires improved metadata tracking throughout the kernel development workflow which opens *that* topic.

The core of the -stable debate

Posted Jul 23, 2021 12:59 UTC (Fri) by wtarreau (subscriber, #51152) [Link]

I was about to say that what you're proposing is already handled by being CCed when the patches are sent to the queue, but no, I get your point and I think you're right. Having a synthetic summary as a response to an old thread making it clear that only a subset was picked can be useful. In addition, reviewers/authors/participants could also respond "wait a minute, that was broken, we added two extra patches later". Rememeber the Spectre series of fixes ?

With lore.kernel.org in place, and ability to identify threads and patch series, I think this should not be too difficult anymore to spot the relevant thread, message ID and CC list and respond there. Of couse it's definitely extra work to be implemented, but probably worth it.

The core of the -stable debate

Posted Jul 23, 2021 8:42 UTC (Fri) by mkubecek (guest, #130791) [Link] (1 responses)

> Many developers see the stable updates as a carefully curated collection of hand-selected fixes, each of which has been extensively reviewed for importance and the lack of regression potential.

One can hardly blame them if we still carry file Documentation/process/stable-kernel-rules.rst in the kernel tree and its first section still says essentially what these people think. It's really interesting to go through these rules and compare them to how stable backports are actually done these days (and have been for a few years). Things can certainly change but if stable maintainers decided that stable branches should work in a completely different way than they were originally meant to, they should IMHO start by rewriting those rules to avoid confusion and make it clear that stable is no longer what it used to be.

The core of the -stable debate

Posted Jul 23, 2021 9:42 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

I'm comforted that I am not the only one having had discussions similar to https://lore.kernel.org/linux-mm/20190808163928.118f8da4f...

The core of the -stable debate

Posted Jul 26, 2021 14:58 UTC (Mon) by tlamp (subscriber, #108540) [Link] (8 responses)

> As always in the Linux world, the people who are doing the work will decide how the work is to be done, and the stable maintainers have opted for the "promiscuous" approach.

But they do *not* have to do the work to diagnose weird regressions in a specific sub system, find out what -stable even did backport (as often preparatory stuff is missed) and then see how to fix, that's done by either the maintainer themselves after getting reports from users or from some downstream project, that was naive enough to think the name of the -stable tree and its documentation[0] are actually resembling reality.

This is asymmetric work generation, a few people use semi-transparent, automated processes, which cannot be that much work (as how else could one churn out multiple stable releases with hundreds to thousands of commits almost daily), to create quite some amount of pain and diagnostic + fixing work for a lot of people.

As it seems, even those producing the actual fixes for a completely different code tree cannot even request otherwise, i.e., that some patches are excluded from those dumb scripts, but have no choice and are then forced to react and face the negative image - as from the outside one sees mostly that the respective subsystem took in the "bad patch" (even if it has gone bad only due to backporting).

So, the people that have to do the actual work do not seem like being able to decide much...

[0]: https://www.kernel.org/doc/html/v5.10/process/stable-kern...

The core of the -stable debate

Posted Jul 26, 2021 17:04 UTC (Mon) by wtarreau (subscriber, #51152) [Link] (1 responses)

> This is asymmetric work generation, a few people use semi-transparent, automated processes, which cannot be that much work (as how else could one churn out multiple stable releases with hundreds to thousands of commits almost daily), to create quite some amount of pain and diagnostic + fixing work for a lot of people.

I strongly disagree with the oversimplification you're making here.

For having extended a few kernels in the past, the last two being 2.6.32 and 3.10, I can tell you that backporting requires a huge amount of work. In order to propose 200 patches for review, I had to work 2 week-ends full-time.

In such a situation you try to automate whatever you can but that only solves the easy parts. There's a lot of background work, discussions with authors to ask whether something that looks important and resists the backport is really important, etc. For sure, Sasha developed some tools to ease the work. They're mostly about pre-selecting fixes that look relevant but were not tagged (since the identified problem in the first place is that maintenance is far from being a primary concern from many submitters). But the backport work has to be done anyway. You can instrument your tools to try to fit a patch into a version to try to figure which one of the surrounding patches helps fix the context, and if it's relevant or not. Yet there's a lot of manual work.

And I would say that most of the patches that backport without efforts are unlikely to be the ones causing the most regressions because they're trivial and in code that almost never changes.

If Greg recently threatened not to extend 5.10 if he didn't get more arms to help him, it was not to make noise, it was because this really is a tough job. I personally offered to try to do a little bit more but I don't have the energy to work as much as I used to on this. Two 16-hours-a-day week-ends per release to review 6000 patches and pick 200 is way too much for me now.

Now if you think that you'd do better than what the stable team currently does, I strongly encourage you to contact stable@kernel.org to propose your help, which will really be welcome.

The core of the -stable debate

Posted Jul 27, 2021 7:15 UTC (Tue) by tlamp (subscriber, #108540) [Link]

It was really an oversimplification, sorry about that.

But the point I rather wanted to convey that why not allow maintainers their say in the matter?

If a maintainer, i.e., one of the persons that knows their subsystem best, objects to backporting stuff they do not deem as fit for that it should be respected? As, yes while assembling stable kernels is surely more work than my oversimplification may have tried to make it look like, it's also a lot of work to get forced into fixing up something you did object in the first place.

So why not keep the automatic process and scoop up as much as stuff one thinks that it could possibly fix something, but also honor the objection of some maintainers?

> Now if you think that you'd do better than what the stable team currently does, I strongly encourage you to contact stable@kernel.org to propose your help, which will really be welcome.

I do not necessarily think that, and I also did not state that.
FWIW, my downstream kernel was bitten by such backports which I spent quite some time to diagnose and untangle and also sent the result to the stable list.

This is also something a newcomer could not really help/change, it's a more general decision and a single person not proposing backports a maintainer person does not seem fit, won't stop the existing ones of doing so.

The core of the -stable debate

Posted Jul 26, 2021 21:24 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (5 responses)

> But they do *not* have to do the work to diagnose weird regressions in a specific sub system, find out what -stable even did backport (as often preparatory stuff is missed) and then see how to fix, that's done by either the maintainer themselves after getting reports from users or from some downstream project, that was naive enough to think the name of the -stable tree and its documentation[0] are actually resembling reality.

What would happen if the various maintainers and downstreams just stopped doing that work? "Nope, sorry, it doesn't repro on Linus's tree, so this bug is WONTFIX. Go complain to whoever backported your kernel for you, whether that's -stable or your distro."

I tend to imagine this would be disruptive, but it also strikes me that, ultimately, this is a volunteer project. If you don't want to do the work, you always have the option of, well, not doing it.

The core of the -stable debate

Posted Jul 27, 2021 7:04 UTC (Tue) by tlamp (subscriber, #108540) [Link] (4 responses)

> I tend to imagine this would be disruptive, but it also strikes me that, ultimately, this is a volunteer project. If you don't want to do the work, you always have the option of, well, not doing it.

Sure, but as mentioned your subsystem would also face "bad press", user sees only that the specific hardware, file system, or what not would be brittle, not the reasoning behind that. So, as most maintainers want their work to be actually useful and not riddled by issues, you have some pressure to do that work to a certain degree.

I do not want to say the stable maintainers are doing this intentionally, I get their reasoning, but if a maintainer cannot even opt out for patches they do not see fit for backport (who'd be better to judge that as the maintainer?) this (badly worded) "getting forced to act for something you did not want" can be the result, at least sometimes.

The process can be Ok in general, but why not allow the actual maintainers their say, as they are the ones that would actually do the work cleaning up any mess?

The core of the -stable debate

Posted Jul 27, 2021 14:32 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (3 responses)

> The process can be Ok in general, but why not allow the actual maintainers their say, as they are the ones that would actually do the work cleaning up any mess?

They're always consulted about this, this is exactly what's asked during the patch review:

.... "If anyone has any issues with these being applied, please let me know." ....

Maybe it's just that the review duration is too short. But it's hard to make the subject less visible, it precisely indicates the version and the patch. If maintainers don't want to check because they don't care, there's nothing that can be done for their subsystems. I can understand that some don't want to be bothered with this and turn backports off by default though.

The core of the -stable debate

Posted Jul 27, 2021 15:19 UTC (Tue) by tlamp (subscriber, #108540) [Link] (2 responses)

> They're always consulted about this, this is exactly what's asked during the patch review:

But the article we're commenting on states:

> "At present this -stable promiscuity is overriding the (sometime carefully) considered decisions of the MM developers, and that's a bit scary."
> -- Andrew Morton

So the maintainer would need to track all proposed series for all stable releases, check for any patches that they did not ack'ed for backport and explicitly NAK them, and reading GKH response here it feels a bit like they'd then probably would need an actually Good Reason™ for each NAK, that sounds a bit backwards to me and quite the work (but granted, I'm only contributing occasionally, so that does not need to mean much, good filters and a bot could automate that if an actual maintainer would feel really that much pressured).

The core of the -stable debate

Posted Jul 28, 2021 3:08 UTC (Wed) by wtarreau (subscriber, #51152) [Link]

As I said previously, the only purpose of the bot is to collect the fixes from all those not used to explicitly marking their patches for backporting. As long as some maintainers are trusted and state that they take the responsibility for mentioning that and do not want to be bothered for each of their patch, that's great. That's exactly where the bot needs to support filters basically saying "let's trust that maintainer that he does not want his patch backported". But based on what's currently merged, it just happens to be a faint minority.

The core of the -stable debate

Posted Jul 30, 2021 15:40 UTC (Fri) by ecree (guest, #95790) [Link]

I've definitely experienced at least once that I NAK'ed an autosel patch and then a couple of weeks later it showed up again in autosel and I had to NAK it again. Idk whether the process has been fixed to prevent that — it hasn't come up recently, probably because I've not been sending much upstream — but it certainly left a sour taste as to the workload of watching autosel like a hawk to catch it when it tries to screw up our driver. I can well understand why people with responsibility for subsystems with a larger patch volume are getting frustrated.

The core of the -stable debate

Posted Jul 27, 2021 2:03 UTC (Tue) by calumapplepie (guest, #143655) [Link] (3 responses)

Lets add some better tags for these poor patches.

There are other comments and people talking about better ways to structure the kernel workflow than "lob everything at a mailing list and hope for the best". Lets assume, for now, that they all fail miserably and we are forever stuck with the current email based workflow.

Why not add a few more tags, to better indicate the nature of a given fix? For instance:

Severity:{cosmetic,minor,normal,important,grave}: describes how bad the Grave patches are always backported. Cosmetic ones are never backported, and minor patches are only backported as part of a larger set. Obviously, this should be paired with a Fixes:. Secret security issues can be marked Normal: or Important: public ones should (obviously) be marked grave.

Stable-suitability:{Certain,possible,doubtful,unsuitable,breaking} Certain should come with a CC to the stable- address: doubtful indicates a patch that the maintainer believes is not suitable for stable (like, say, unsuitable and breaking both indicate patches that shouldn't be backported, but an "unsuitable" patch might able to be backported with effort, rewriting, and (possibly) rethinking, while a "breaking" patch should just not backported. Declaring "breaking" would require some justification: "this might break stuff" isnt.

(if a person can't decide between two options, then stringing them together is acceptable. for instance, "Stable-suitability:certain-certain-possible" means "I'm 66% certain this should go in a backport, but its possible I'm wrong, so read the commit message")

Impact:{None,Insignificant,Small,Large}: indicates the amount of other code that could be broken by the change. "None" is for changes that have no impact on the external behavior of the function (so, adding handling for out of memory would be None, since the situation is rare and the alternative is Firey Death, but adding a new error code wouldn't be, since callers might have assumed the only possible error is OOM).

----

Obviously a more rigorous spec would be needed, and probably some better names, but I think this would serve as a stopgap for kernel patchers to make their intent more clear while smarter folks try to completely rework the workflow. It wouldn't be obligatory, and I don't think it adds much to the workload of a patcher: it's just a machine-parsable way of saying things that a reviewer should already be able to tell from the patch and the notes. The stable-suitability tag and severity tags are probably the most needed, since they directly address significant problems we have right now.

- just some nonsense from a maniac

The core of the -stable debate

Posted Jul 27, 2021 6:01 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (2 responses)

I generally agree with this, not necessarily with the details, but with the principle. I've defended "fixes" as a way to improve this in the past. The only thing is that "fixes" mentions a commit ID, and backportability is inferred from this, which is not the same as giving instructions.

In haproxy we don't use tags for this to force the persons doing the backport to read the commit message and figure by themselves, where the backporting instructions are. But at a larger scale tags would definitely help.

However having just tokens in the tags doesn't work well because very often you'll see this pattern:
- a bug is introduced by patch X merged into, say, 2.6.35
- a fix for it is introduced in 3.4 and backported to stable releases between X and 3.4
- later in 5.14-rc it's found that the fix introduced a vulnerability.
- the analysis shows that it's only been an exploitable vulnerability since 5.8, that it possibly presented a severe risk of random crashes since 5.3, that it could have caused oopses since 4.12 and that it could only have caused misleading debug messages since 4.0.
- the fix requires to change the way a recent API is used

=> the severity often depends on the version, due to interactions with other code parts
=> the solution requires tricky adaptations for older versions, and the trickier the adaptations, the less severe the bug was
=> the risks and benefits of the backport are also dependent on the version.

Thus, we definitely need to encourage one to write instructions for the stable team. If only 10% of the patches have full-text instructions and these patches are the most painful and riskiest to backport, it makes sense to stay away from a machine-readable format for these instructions for them, and rely on a human to quickly sort them out (sometimes suggestions like "might be easier if commit XXX is also backported" are helpful).

In addition, for complicated backports, some maintainers may want to observe before going further. That's a "cool down" period where only the latest stable and latest LTS get the patch but not previous ones.

The impact on the code size is not always easy to gauge without doing the backport, and sometimes the amount of code is not a valid metric because you could imagine just backporting some library code for example, that usually passes well. But code that relies on changing semantics is much trickier. It can be a two-liner relying on a new locking scheme that didn't exist in the past and that would require a totally different approach for example.

Thus we could have a "severity" tag with estimates at the time of writing, plus another value such as "depends" which indicates that it depends on the version and that the text has to be done, a "risk" field indicating the estimated risk of backporting the fix, including references to interactions with other subsystems, and possibly just a "stable:" tag for instructions to the humans. This is the one that could also be used to write "stable: please do not backport this beyond 5.4", "no backport at all", or "wait one month before 5.4". And it saves the person from having to read the whole commit message when the tag is present.

Also the field names should be short and easy to remember. But given that "fixes" is already just seldom used, just like "cc:stable", I'm wondering how well that would work without a bit more education :-/

The core of the -stable debate

Posted Jul 27, 2021 14:40 UTC (Tue) by abatters (✭ supporter ✭, #6932) [Link] (1 responses)

Another problem is that more information about the effects of a patch becomes available after commit, so the commit message becomes out-of-date. It might be useful to have the ability to append more information to a commit after it has been accepted, in some kind of online database that supplements git. If you submit a patch that caused an issue, you can append some notes to it later. If you do a git bisect to find a bad commit, you can add a note to that commit, even if no one else bothers to fix the problem. If you change your mind about whether or not a patch is suitable for backporting to -stable, you can add a note about it. etc.

The core of the -stable debate

Posted Aug 10, 2021 1:09 UTC (Tue) by calumapplepie (guest, #143655) [Link]

*pokes git-notes* you got anything to say for yourself?

Of course, that can't be (easily) pushed/pulled from origin: but it might help. Perhaps someone could tweak Git to do that by default?

The core of the -stable debate

Posted Jul 28, 2021 19:09 UTC (Wed) by johannbg (guest, #65743) [Link]

Andrew Morton and other devs that share his views are seemingly basing their assumption that users actually report the issue they are faced with, which more often than not, is not the case...


Copyright © 2021, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds