LWN: Comments on "Revisiting stable-kernel regressions"

Revisiting stable-kernel regressions

smfrench — Thu, 20 Feb 2020 21:27:58 +0000

Seems like the easiest step would be to introduce more automated testing of stable trees (especially for file systems, network drivers etc.). Some of the components have public automated tests (like cifs.ko CIFS/SMB3 client has automated tests run against various servers, Samba, Windows, new SMB3 Linux kernel server, the Cloud etc.) but how could maintainers like me pass the information on the recommended automated tests for our component to someone who is involved in the stable kernel validation process? And who could run those? Is there dedicated hardware or VMs for validating stable builds?

Revisiting stable-kernel regressions

sashal — Sat, 15 Feb 2020 01:29:33 +0000

Not trying to blame anyone, just pointing out that I've already done the research you've suggested to look into but I couldn't convert it into any result in practice. I'd be happy to discuss it further if you have input as to how improve the process.

There is more information about the work here: https://lwn.net/Articles/753329/ .

Revisiting stable-kernel regressions

tytso — Fri, 14 Feb 2020 22:17:46 +0000

Yes, I understand that Linus doesn't have a problem with letting things drop into stable right away. Then again, Linus may not be using the stable kernel series, or at least not the same way as say, Google's Contianer-Optimized OS (COS), which is trying to be upstream-first and based on the Stable kernel. There *have* been customer visible regressions that where some stable kernel X.Y.Z caused more problems than if COS had stayed on X.Y.Z-1. I've told them that this means they need to do more testing, and not trust that X.Y.Z+1 will have fewer bugs that they care about than X.Y.Z --- because that's simply not true, and I'm not sure there's anything that can be done to reduce the bug introduction rate to zero.

But if there is something we can do to decrease the bug introduction rate, that would certainly be a good thing. And that's why I'm suggesting that if we can use ML to figure out which commits contain bug fixes, maybe there is a way that we can use a training set of commits that landed in the stable kernels *and* which apparently had regressions, and see if we can find some features that tell us that those commits should get more careful screening. Whether that's "wait longer", or create a list of commits that can be sent around for humans to take a closer look, I don't have any concrete proposals, because I'm not sure what's the best way thing we could do with that information. But I think it's worth some consideration and reflection to see if there's something we can do to further leverage ML; not just to select commits, but to flag commits for special care/handling/testing/review.

Finally, Sasha, please don't take this as a criticism of the job you are currently doing. Bugs and regressions in Linus's tree are inevitable; that's why we have thousnads of commits flowing into the stable kernels. But this also means that bugs caused by bug fixes are also inevitable, and so the question is there something we can do to improve the process to deal with the fact that we are all humans. Trying to improve any kind of development or ops processes are best done in a blame-free manner.

Revisiting stable-kernel regressions

sashal — Fri, 14 Feb 2020 21:38:53 +0000

Right, but if more "Fixes:" tags appear upstream, does it mean we introduce more regressions, or are we better at fixing/tagging?

With regards to your question, I've actually looked into that and did a talk last year (LWN covered it here: https://lwn.net/Articles/803695/). Based on the results, it seemed to me that letting -rc patches (and especially late -rc cycle patches) spend more time in -next would be valuable as those tend to be buggy.

I raised it with Linus at the Maintainer's summit (https://lwn.net/Articles/799219/): "Sasha Levin asked about whether the same sort of checking happens after -rc1 comes out; the answer was "generally not". Code entering the mainline after the merge window is supposed to be limited to important fixes, and linux-next is less useful for those. As far as Torvalds is concerned, fixes that do not appear in linux-next are not an issue at all. Levin protested that fixes are often broken; putting them in linux-next at least gives continuous-integration systems a chance to find the problems".

So Linus is just fine with taking patches during -rc cycle that weren't in -next even for a single day, and he isn't too interested in changing that.

Revisiting stable-kernel regressions

tytso — Fri, 14 Feb 2020 21:21:10 +0000

My understanding of the article was that it was looking at Fixes: tags which point at a commit which was introduced in the stable kernel series. The analysis then excluded those commits which introduced a bug in the stable kernel, but where the Fix was added before it was visible --- that is, where the regression and the fix for the regression both happened between X.Y.Z and X.Y.Z+1, so that the regression was not visible to the user. This might happen if the first commit fixed a real problem, but had a side effect which was bad, and then the fix which fixed the side effect was backported in the same stable kernel release.

It would seem to me that a really interesting thing to do would be to identify those commits in stable kernels which caused a regression (e.g., which had a commit which had an applicable fixes line later on), and see if we can identify any kind of machine learning features for commits that are likely to be problematic, and perhaps use that to delay the length of time between when a commit which might be at risk of introducing an a regression lands in Linus's tree, and when it gets picked up by a stable branch.

Revisiting stable-kernel regressions

sashal — Fri, 14 Feb 2020 14:43:06 +0000

It looked to me like the article only looks at the increase of Fixes tags in the context of stable trees, without looking at a corresponding increase in the upstream tree.

An interesting comparison might be to analyze how many upstream Fixes: tags fix something from a "current" merge window vs older release.

Revisiting stable-kernel regressions

cpitrat — Fri, 14 Feb 2020 07:43:11 +0000

The increase of proportion of patches with Fixes tags is mentioned in the article, and numbers take that into account (at least the new ones, not sure about the old ones). Did I miss a subtle difference in what you point out ?

Revisiting stable-kernel regressions

sashal — Fri, 14 Feb 2020 04:44:04 +0000

I keep running this sort of analysis myself quite often to make sure that what we're doing with stable trees makes sense, and one gotcha that I feel that this article missed is the rate of "Fixes:" tags in upstream commits.

Either we're getting better at finding bugs (and we are!), or people are getting more disciplined about tagging commits with the Fixes: tag, but consider the following:

$ git log --oneline --no-merges -i --grep "fixes:" v4.4..v4.9 | wc -l
4912
$ git log --oneline --no-merges v4.4..v4.9 | wc -l
67476
$ git log --oneline --no-merges -i --grep "fixes:" v4.14..v4.19 | wc -l
8562
$ git log --oneline --no-merges v4.14..v4.19 | wc -l
69363
$ git log --oneline --no-merges -i --grep "fixes:" v5.0..v5.5 | wc -l
10635
$ git log --oneline --no-merges v5.0..v5.5 | wc -l
70632

So while only 7.3% of the commits between 4.4 and 4.9 had a Fixes: tag, we see that rate jump to %12.3 of the commits between 4.14 and 4.19, and again jump to %15 between 5.0 and 5.5 - more than double(!) of what we've been seeing between 4.4 and 4.9.

I'd argue that if we're seeing an increase of Fixes: tags upstream, we're bound to see a similar increase in stable trees, even if the actual regression rate in stable trees has remained the same (or have gone down - which can explain your observation regarding less grumblings :) ).

Revisiting stable-kernel regressions

arjan — Thu, 13 Feb 2020 18:29:30 +0000

yeah many options

maybe the analysis to answer that is how many regressions are in the early stable numbers (so .1 to say .20 or whatever) compared to higher number last digits

Revisiting stable-kernel regressions

josh — Thu, 13 Feb 2020 17:59:05 +0000

Or more recent stable kernels have had less time for people to notice regressions, and their numbers will get worse over time.

Revisiting stable-kernel regressions

arjan — Thu, 13 Feb 2020 17:35:48 +0000

another angle might also be that backports to very recent kernels (so small delta in the general code base) is less regression-prone than "much further back" backports which take code tested in one code base into a very different codebase