Revisiting stable-kernel regressions

By Jonathan Corbet
February 13, 2020

Stable-kernel updates are, unsurprisingly, supposed to be stable; that is why the first of the rules for stable-kernel patches requires them to be "obviously correct and tested". Even so, for nearly as long as the kernel community has been producing stable update releases, said community has also been complaining about regressions that make their way into those releases. Back in 2016, LWN did some analysis that showed the presence of regressions in stable releases, though at a rate that many saw as being low enough. Since then, the volume of patches showing up in stable releases has grown considerably, so perhaps the time has come to see what the situation with regressions is with current stable kernels.

As an example of the number of patches going into the stable kernel updates, consider that, as of 4.9.213, 15,648 patches have been added to the original 4.9 release — that is an entire development cycle worth of patches added to a "stable" kernel. Reviewing all of those to see whether each contains a regression is not practical, even for the maintainers of the stable updates. But there is an automated way to get a sense for how many of those stable-update patches bring regressions with them.

The convention in the kernel community is to add a Fixes tag to any patch fixing a bug introduced by another patch; that tag includes the commit ID for the original, buggy patch. Since stable kernel releases are supposed to be limited to fixes, one would expect that almost every patch would carry such a tag. In the real world, about 40-60% of the commits to a stable series carry Fixes tags; the proportion appears to be increasing over time as the discipline of adding those tags improves.

It is a relatively straightforward task (for a computer) to look at the Fixes tag(s) in any patch containing them, extract the commit IDs of the buggy patches, and see if those patches, too, were added in a stable update. If so, it is possible to conclude that the original patch was buggy and caused a regression in need of fixing. There are, naturally, some complications, including the fact that stable-kernel commits have different IDs than those used in the mainline (where all fixes are supposed to appear first); associating fixes with commits requires creating a mapping between the two. Outright reverts of buggy patches tend not to have Fixes tags, so they must be caught separately. And so on. The end result will necessarily contain some noise, but there is a useful signal there as well.

For the curious, this analysis was done with the stablefixes tool, part of the gitdm collection of repository data-mining hacks. It can be cloned from git://git.lwn.net/gitdm.git.

Back in 2016, your editor came up with a regression rate of at least 2% for the longer-term stable kernels that were maintained at that time. The 4.4 series, which had 1,712 commits then, showed a regression rate of at least 2.3%. Since then, the number of commits has grown considerably — to 14,211 in 4.4.213 — as a result of better discipline and the use of automated tools (including a machine-learning system) to select fixes that were not explicitly earmarked for stable backporting. Your editor fixed up his script, ported it to Python 3, and reran the analysis for the currently supported stable kernels; the results look like this.

Series Commits Tags Fixes Reverts

5.4.18 2,423 1,482 61% 74 29 Details

4.19.102 11,758 5,647 48% 588 100 Details

4.14.170 15,527 6,727 43% 985 134 Details

4.9.213 15,647 6,286 40% 951 139 Details

4.4.213 14,210 5,110 36% 834 124 Details

Series	Commits	Tags	Fixes	Reverts
5.4.18	2,423	1,482	61%	74	29	Details
4.19.102	11,758	5,647	48%	588	100	Details
4.14.170	15,527	6,727	43%	985	134	Details
4.9.213	15,647	6,286	40%	951	139	Details
4.4.213	14,210	5,110	36%	834	124	Details

In the above table, Series identifies the stable kernel that was looked at. Commits is the number of commits in that series, while Tags is the number and percentage of those commits with a Fixes tag. The count under Fixes is the number of commits in that series that are explicitly fixing another commit applied to that series. Reverts is the number of those fixes that were outright reverts; a famous person might once have said that reversion is the sincerest form of patch criticism. Hit the "Details" link for a list of the fixes found for each series.

Looking at those numbers would suggest that, for example, 3% of the commits in 5.4.18 are fixing other commits, so the bad commit rate would be a minimum of 3%. The situation is not actually that simple, though, for a few reasons. One of those is that a surprising number of the regression fixes appear in the same stable release as the commits they are fixing. In a case like that, while the first commit can indeed be said to have introduced a regression, no stable release actually contained the regression and no user will have ever run into it. Counting those is not entirely fair. If one subtracts out the same-release fixes, the results look like this:

Series Fixes Same
release Visible
regressions

5.4.18 74 29 45

4.19.102 588 176 412

4.14.170 985 253 732

4.9.213 951 229 722

4.4.213 834 232 602

Series	Fixes	Same release	Visible regressions
5.4.18	74	29	45
4.19.102	588	176	412
4.14.170	985	253	732
4.9.213	951	229	722
4.4.213	834	232	602

Another question to keep in mind is what to do with all those commits without Fixes tags. Many of them are certainly fixes for bugs introduced in other patches, but nobody went to the trouble of figuring out how the bugs happened. If the numbers in the table above are taken as the total count of regressions in a stable series, that implies that none of the commits without Fixes tags are fixing regressions, which will surely lead to undercounting regression fixes overall. On the other hand, if one assumes that the untagged commits contain regression fixes in the same proportion as the tagged ones, the result could well be a count that is too high.

Perhaps the best thing that can be done is to look at both numbers, with a reasonable certainty that the truth lies somewhere between them:

Series Visible
regressions Regression rate

Low High

5.4.18 45 1.9% 3.0%

4.19.102 412 3.5% 7.3%

4.14.170 732 4.7% 10.9%

4.9.213 722 4.6% 11.5%

4.4.213 602 4.2% 11.8%

Series	Visible regressions	Regression rate
Low	High
5.4.18	45	1.9%	3.0%
4.19.102	412	3.5%	7.3%
4.14.170	732	4.7%	10.9%
4.9.213	722	4.6%	11.5%
4.4.213	602	4.2%	11.8%

So that is about as good as the numbers are going to get, though there are still some oddball issues. Consider the case of mainline commit 4abb951b73ff ("ACPICA: AML interpreter: add region addresses in global list during initialization"). This commit included a "Cc: stable@vger.kernel.org" tag, so it was duly included (as commit 22083c028d0b) in the 4.19.2 release. It was then reverted in 4.19.3, with the complaint that it didn't actually fix a bug but did cause regressions. This same change returned in 4.19.6 after an explicit request. Then, two commits followed in 4.19.35: commit d4b4aeea5506 addressed a related issue and the original upstream commit in a Fixes tag, while f8053df634d4 claimed to be the original upstream commit, which had already been applied. That last one looks like a fix for a partially done backport. How does one try to account for a series of changes like that? Honestly, one doesn't even try.

So what can we conclude from all this repository digging? The regression rates seen in 2016 were quite a bit lower than what we are seeing now; that would suggest that the increasing volume of patches being applied to the stable trees is not just increasing the number of regressions, but also the rate of regressions. That is not a good sign. On the other hand, the amount of grumbling about stable regressions seems to have dropped recently. Perhaps that's just because people have gotten used to the situation. Or perhaps the worst problems, such as filesystem-destroying regressions, are no longer getting through, while the problems that do slip through are relatively minor.

Newer kernels have a visibly lower regression rate than the older ones. There are two equally plausible explanations for that. Perhaps the process of selecting patches for stable backporting is getting better, and fewer regressions are being introduced than were before. Or perhaps those kernels just haven't been around for long enough for all of the regressions already introduced to be found and fixed yet. The 2016 article looked at 4.4.14, which had 39 regression fixes (19 fixed in the same release). 4.4.213 now contains 110 fixes for regressions introduced in 4.4.14 or earlier (still 19 fixed in the same release). So there is ample reason to believe that the regression rate in 5.4.18 is higher than indicated above.

In any case, it seems clear that the push to get more and more fixes into the stable trees is unlikely to go away anytime soon. And perhaps that is a good thing; a stable tree with thousands of fixes and a few regressions may still be far more stable than one without all those patches. Even so, it would be good to keep an eye on the regression rate; if that is allowed to get too high, the result is likely to be users moving away from stable updates, which is definitely not the desired result.

Index entries for this article
Kernel	Development model/Stable tree

Revisiting stable-kernel regressions

Posted Feb 13, 2020 17:35 UTC (Thu) by arjan (subscriber, #36785) [Link] (2 responses)

another angle might also be that backports to very recent kernels (so small delta in the general code base) is less regression-prone than "much further back" backports which take code tested in one code base into a very different codebase

Revisiting stable-kernel regressions

Posted Feb 13, 2020 17:59 UTC (Thu) by josh (subscriber, #17465) [Link] (1 responses)

Or more recent stable kernels have had less time for people to notice regressions, and their numbers will get worse over time.

Revisiting stable-kernel regressions

Posted Feb 13, 2020 18:29 UTC (Thu) by arjan (subscriber, #36785) [Link]

yeah many options

maybe the analysis to answer that is how many regressions are in the early stable numbers (so .1 to say .20 or whatever) compared to higher number last digits

Revisiting stable-kernel regressions

Posted Feb 14, 2020 4:44 UTC (Fri) by sashal (✭ supporter ✭, #81842) [Link] (6 responses)

I keep running this sort of analysis myself quite often to make sure that what we're doing with stable trees makes sense, and one gotcha that I feel that this article missed is the rate of "Fixes:" tags in upstream commits.

Either we're getting better at finding bugs (and we are!), or people are getting more disciplined about tagging commits with the Fixes: tag, but consider the following:

$ git log --oneline --no-merges -i --grep "fixes:" v4.4..v4.9 | wc -l
4912
$ git log --oneline --no-merges v4.4..v4.9 | wc -l
67476
$ git log --oneline --no-merges -i --grep "fixes:" v4.14..v4.19 | wc -l
8562
$ git log --oneline --no-merges v4.14..v4.19 | wc -l
69363
$ git log --oneline --no-merges -i --grep "fixes:" v5.0..v5.5 | wc -l
10635
$ git log --oneline --no-merges v5.0..v5.5 | wc -l
70632

So while only 7.3% of the commits between 4.4 and 4.9 had a Fixes: tag, we see that rate jump to %12.3 of the commits between 4.14 and 4.19, and again jump to %15 between 5.0 and 5.5 - more than double(!) of what we've been seeing between 4.4 and 4.9.

I'd argue that if we're seeing an increase of Fixes: tags upstream, we're bound to see a similar increase in stable trees, even if the actual regression rate in stable trees has remained the same (or have gone down - which can explain your observation regarding less grumblings :) ).

Revisiting stable-kernel regressions

Posted Feb 14, 2020 7:43 UTC (Fri) by cpitrat (subscriber, #116459) [Link] (5 responses)

The increase of proportion of patches with Fixes tags is mentioned in the article, and numbers take that into account (at least the new ones, not sure about the old ones). Did I miss a subtle difference in what you point out ?

Revisiting stable-kernel regressions

Posted Feb 14, 2020 14:43 UTC (Fri) by sashal (✭ supporter ✭, #81842) [Link] (4 responses)

It looked to me like the article only looks at the increase of Fixes tags in the context of stable trees, without looking at a corresponding increase in the upstream tree.

An interesting comparison might be to analyze how many upstream Fixes: tags fix something from a "current" merge window vs older release.

Revisiting stable-kernel regressions

Posted Feb 14, 2020 21:21 UTC (Fri) by tytso (subscriber, #9993) [Link] (3 responses)

My understanding of the article was that it was looking at Fixes: tags which point at a commit which was introduced in the stable kernel series. The analysis then excluded those commits which introduced a bug in the stable kernel, but where the Fix was added before it was visible --- that is, where the regression and the fix for the regression both happened between X.Y.Z and X.Y.Z+1, so that the regression was not visible to the user. This might happen if the first commit fixed a real problem, but had a side effect which was bad, and then the fix which fixed the side effect was backported in the same stable kernel release.

It would seem to me that a really interesting thing to do would be to identify those commits in stable kernels which caused a regression (e.g., which had a commit which had an applicable fixes line later on), and see if we can identify any kind of machine learning features for commits that are likely to be problematic, and perhaps use that to delay the length of time between when a commit which might be at risk of introducing an a regression lands in Linus's tree, and when it gets picked up by a stable branch.

Revisiting stable-kernel regressions

Posted Feb 14, 2020 21:38 UTC (Fri) by sashal (✭ supporter ✭, #81842) [Link] (2 responses)

Right, but if more "Fixes:" tags appear upstream, does it mean we introduce more regressions, or are we better at fixing/tagging?

With regards to your question, I've actually looked into that and did a talk last year (LWN covered it here: https://lwn.net/Articles/803695/). Based on the results, it seemed to me that letting -rc patches (and especially late -rc cycle patches) spend more time in -next would be valuable as those tend to be buggy.

I raised it with Linus at the Maintainer's summit (https://lwn.net/Articles/799219/): "Sasha Levin asked about whether the same sort of checking happens after -rc1 comes out; the answer was "generally not". Code entering the mainline after the merge window is supposed to be limited to important fixes, and linux-next is less useful for those. As far as Torvalds is concerned, fixes that do not appear in linux-next are not an issue at all. Levin protested that fixes are often broken; putting them in linux-next at least gives continuous-integration systems a chance to find the problems".

So Linus is just fine with taking patches during -rc cycle that weren't in -next even for a single day, and he isn't too interested in changing that.

Revisiting stable-kernel regressions

Posted Feb 14, 2020 22:17 UTC (Fri) by tytso (subscriber, #9993) [Link] (1 responses)

Yes, I understand that Linus doesn't have a problem with letting things drop into stable right away. Then again, Linus may not be using the stable kernel series, or at least not the same way as say, Google's Contianer-Optimized OS (COS), which is trying to be upstream-first and based on the Stable kernel. There *have* been customer visible regressions that where some stable kernel X.Y.Z caused more problems than if COS had stayed on X.Y.Z-1. I've told them that this means they need to do more testing, and not trust that X.Y.Z+1 will have fewer bugs that they care about than X.Y.Z --- because that's simply not true, and I'm not sure there's anything that can be done to reduce the bug introduction rate to zero.

But if there is something we can do to decrease the bug introduction rate, that would certainly be a good thing. And that's why I'm suggesting that if we can use ML to figure out which commits contain bug fixes, maybe there is a way that we can use a training set of commits that landed in the stable kernels *and* which apparently had regressions, and see if we can find some features that tell us that those commits should get more careful screening. Whether that's "wait longer", or create a list of commits that can be sent around for humans to take a closer look, I don't have any concrete proposals, because I'm not sure what's the best way thing we could do with that information. But I think it's worth some consideration and reflection to see if there's something we can do to further leverage ML; not just to select commits, but to flag commits for special care/handling/testing/review.

Finally, Sasha, please don't take this as a criticism of the job you are currently doing. Bugs and regressions in Linus's tree are inevitable; that's why we have thousnads of commits flowing into the stable kernels. But this also means that bugs caused by bug fixes are also inevitable, and so the question is there something we can do to improve the process to deal with the fact that we are all humans. Trying to improve any kind of development or ops processes are best done in a blame-free manner.

Revisiting stable-kernel regressions

Posted Feb 15, 2020 1:29 UTC (Sat) by sashal (✭ supporter ✭, #81842) [Link]

Not trying to blame anyone, just pointing out that I've already done the research you've suggested to look into but I couldn't convert it into any result in practice. I'd be happy to discuss it further if you have input as to how improve the process.

There is more information about the work here: https://lwn.net/Articles/753329/ .

Revisiting stable-kernel regressions

Posted Feb 20, 2020 21:27 UTC (Thu) by smfrench (subscriber, #124116) [Link]

Seems like the easiest step would be to introduce more automated testing of stable trees (especially for file systems, network drivers etc.). Some of the components have public automated tests (like cifs.ko CIFS/SMB3 client has automated tests run against various servers, Samba, Windows, new SMB3 Linux kernel server, the Cloud etc.) but how could maintainers like me pass the information on the recommended automated tests for our component to someone who is involved in the stable kernel validation process? And who could run those? Is there dedicated hardware or VMs for validating stable builds?