Revisiting stable-kernel regressions

Posted Feb 14, 2020 22:17 UTC (Fri) by tytso (subscriber, #9993)
In reply to: Revisiting stable-kernel regressions by sashal
Parent article: Revisiting stable-kernel regressions

Yes, I understand that Linus doesn't have a problem with letting things drop into stable right away. Then again, Linus may not be using the stable kernel series, or at least not the same way as say, Google's Contianer-Optimized OS (COS), which is trying to be upstream-first and based on the Stable kernel. There *have* been customer visible regressions that where some stable kernel X.Y.Z caused more problems than if COS had stayed on X.Y.Z-1. I've told them that this means they need to do more testing, and not trust that X.Y.Z+1 will have fewer bugs that they care about than X.Y.Z --- because that's simply not true, and I'm not sure there's anything that can be done to reduce the bug introduction rate to zero.

But if there is something we can do to decrease the bug introduction rate, that would certainly be a good thing. And that's why I'm suggesting that if we can use ML to figure out which commits contain bug fixes, maybe there is a way that we can use a training set of commits that landed in the stable kernels *and* which apparently had regressions, and see if we can find some features that tell us that those commits should get more careful screening. Whether that's "wait longer", or create a list of commits that can be sent around for humans to take a closer look, I don't have any concrete proposals, because I'm not sure what's the best way thing we could do with that information. But I think it's worth some consideration and reflection to see if there's something we can do to further leverage ML; not just to select commits, but to flag commits for special care/handling/testing/review.

Finally, Sasha, please don't take this as a criticism of the job you are currently doing. Bugs and regressions in Linus's tree are inevitable; that's why we have thousnads of commits flowing into the stable kernels. But this also means that bugs caused by bug fixes are also inevitable, and so the question is there something we can do to improve the process to deal with the fact that we are all humans. Trying to improve any kind of development or ops processes are best done in a blame-free manner.

Revisiting stable-kernel regressions

Posted Feb 15, 2020 1:29 UTC (Sat) by sashal (✭ supporter ✭, #81842) [Link]

Not trying to blame anyone, just pointing out that I've already done the research you've suggested to look into but I couldn't convert it into any result in practice. I'd be happy to discuss it further if you have input as to how improve the process.

There is more information about the work here: https://lwn.net/Articles/753329/ .