Identifying buggy patches with machine learning

By Jonathan Corbet
November 4, 2019

The stable kernel releases are meant to contain as many important fixes as possible; to that end, the stable maintainers have been making use of a machine-learning system to identify patches that should be considered for a stable update. This exercise has had some success but, at the 2019 Open Source Summit Europe, Sasha Levin asked whether this process could be improved further. Might it be possible for a machine-learning system to identify patches that create bugs and intercept them, so that the fixes never become necessary?

Any kernel patch that fixes a bug, Levin began, should include a tag marking it for the stable updates. Relying on that tag turns out to miss a lot of important fixes, though. About 3-4% of the mainline patch stream was being marked, but the number of patches that should be put into the stable releases is closer to 20% of the total. Rather than try to get developers to mark more patches, he developed his machine-learning system to identify fixes in the mainline patch stream automatically and queue them for manual review.

This system uses a number of heuristics, he said. If the changelog contains language like "fixes" or "causes a panic", it's likely to be an important fix. Shorter patches tend to be candidates. Another indicator is the addition of code like:

    if (x == NULL)
        return -ESOMETHING;

In the end, it does turn out to be possible to automatically identify a number of fixes. But if that can be done, could it be possible to use a similar system to find bugs? That turns out to be a harder problem. Levin complained that nobody includes text like "this commit has a bug" or "this will crash your server" in their changelogs — a complaint that Andrew Morton beat him to by many years. Just looking at code constructs can only catch the simplest bugs, and there are already static-analysis systems using that technique. So he needed to look for something else.

That "something else" turns out to be review and testing — or the lack thereof. A lot can be learned by looking at the reviews that patches get. Are there a lot of details in the review? Is there an indication that the reviewer actually tried the code? Does it go beyond typographic errors? Sentiment analysis can also be used to get a sense for how the reviewer felt about the patch in general.

Not all reviewers are equal, so this system needs to qualify each reviewer. Over time, it is possible to conclude how likely it is that a patch reviewed by a given developer contains a bug. The role of the reviewer also matters; if the reviewer is a maintainer of — or frequent contributor to — the relevant subsystem, their review should carry more weight.

A system can look at how long any given patch has been present in linux-next, how many iterations it has been through, and what the "quality" of the conversation around it was. Output from automated testing systems has a place, but only to an extent; KernelCI is a more reliable tester for ARM patches, but the 0day system is better for x86 patches. Manual testing tends to be a good indicator of patch quality; if a patch indicates that it has been tested on thousands of machines in somebody's data center for months, it is relatively unlikely to contain a bug.

Then, one can also try to look at code quality, but that is hard to quantify. Looking at the number of problems found in the original posting of a patch might offer some useful information. But Levin is unsure about how much can be achieved in this area.

Once the data of interest has been identified, it is necessary to create a training set for the system. That is made up of a big pile of patches, of course, along with a flag saying whether each contains a bug or not. The Fixes tags in patches can help here, but not all bugs really matter for the purposes of this system; spelling fixes or theoretical races are not the sort of problem he is looking for. In the end, he took a simple approach, training the system on patches that were later reverted or which have a Fixes tag pointing to them.

That led to some interesting information about where and when bugs are introduced. He had thought that bugs would generally be added during the merge window, then fixed in the later -rc releases, but that turned out to be wrong. On a lines-of-code basis, a patch merged for one of the late -rc releases is three times more likely to introduce a bug than a merge-window patch.

Patches queued for the merge window, it seems, are relatively well tested. Those added late in the cycle, instead, are there to fix some other problem and generally get much less testing — or none at all. Levin said that things shouldn't be this way. There is no reason to rush fixes late in the development cycle; nobody runs mainline kernels in production anyway, so it is better to give those patches more testing then push them into the stable updates when they are really ready. Developers should, he said, trust the system more and engage in less "late-rc wild-west stuff".

Levin complained to Linus Torvalds about this dynamic; Torvalds agreed with the explanation but said that the system was designed that way. Late-cycle problems tend to be more complex, so the fixes will also be more complex and more likely to create a new bug. Levin agreed that this is the case, but disagreed with the conclusion; he thinks that the community should be more strict with late-cycle patches.

Back to the machine-learning system, he said that he is currently using it to flag patches that need more careful review; that has enabled him to find a number of bugs in fixes that were destined for stable updates. Parts of this system are also being used to qualify patches for the stable releases. The goal of detecting buggy patches in general still seems rather distant, though.

Levin concluded with some thoughts on improving the kernel development process. The late-rc problem needs to be addressed; we know there is a problem there, he said, so we should do something about it. Testing of kernel patches needs to be improved across the board; the testing capability we have now is rather limited. More testing needs to happen on actual hardware to be truly useful. He would also like to see some standardization in the policy for the acceptance of patches, including how long they should be in linux-next, the signoffs and reviews needed, etc. These policies currently vary widely from one subsystem to the next, and some maintainers seem to not care much at all. That, he said, is not ideal and needs to be improved.

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting your editor's travel to the event.]

Index entries for this article
Kernel	Development model/Stable tree
Conference	Open Source Summit Europe/2019

Identifying buggy patches with machine learning

Posted Nov 4, 2019 19:19 UTC (Mon) by darwi (subscriber, #131202) [Link] (8 responses)

> nobody runs mainline kernels in production anyway.

Arch *does* push latest kernel.org releases to users.

Identifying buggy patches with machine learning

Posted Nov 4, 2019 19:27 UTC (Mon) by zblaxell (subscriber, #26385) [Link] (1 responses)

Corollary: nobody runs Arch in production.

Identifying buggy patches with machine learning

Posted Nov 9, 2019 17:31 UTC (Sat) by gerdesj (subscriber, #5446) [Link]

I'll provide a counter example: me.

Identifying buggy patches with machine learning

Posted Nov 4, 2019 19:55 UTC (Mon) by sashal (✭ supporter ✭, #81842) [Link]

Sure, distros provide -rc (or even "git") kernels to users, but no one actually deploys them in production.

Users *can* deploy it in production, but I believe that if you have -rc kernels deployed at scale for reasons other than testing and validating the upcoming release, you're doing something wrong.

Identifying buggy patches with machine learning

Posted Nov 5, 2019 9:03 UTC (Tue) by zdzichu (subscriber, #17118) [Link] (3 responses)

Fedora kernel is almost mainline (no big scary patches, mostly fixes cherrypicked for mainline), provided without a lag - at the moment of writing 5.3.8 is ready for stable F31.
And Fedora is certainly intended for production.

Identifying buggy patches with machine learning

Posted Nov 5, 2019 12:22 UTC (Tue) by sashal (✭ supporter ✭, #81842) [Link]

5.3.8 is a stable kernel, which is exactly the point I was trying to make: people use stable kernels in production, not Linus's tree.

Identifying buggy patches with machine learning

Posted Nov 5, 2019 17:11 UTC (Tue) by zblaxell (subscriber, #26385) [Link] (1 responses)

$ git log --oneline v5.3..v5.3.8 | wc -l
1018

"mainline + 1018 patches + whatever cherries Fedora puts on top" is not, in any literal or practical sense, "mainline". It's not "production" either. At best, it's a late-stage CI artifact, an input to downstream integration and verification.

Identifying buggy patches with machine learning

Posted Nov 5, 2019 19:11 UTC (Tue) by zdzichu (subscriber, #17118) [Link]

I must have misunderstood. For me, everything coming from kernel.org is "mainline", but in this discussion this adjective seem to only be used to mean 5.x.0 releases, not even -stable releases.
In the opposition of "mainline" I see so called "distro" kernels with hundreds/thousands of patches and backports.

Identifying buggy patches with machine learning

Posted May 8, 2021 17:21 UTC (Sat) by Smon (guest, #104795) [Link]

Arch does not push mainline kernels.
They wait for x.x.1 and according to kernel.org, x.x.1 is stable. (x.x.0 is mainline)

Identifying buggy patches with machine learning

Posted Nov 4, 2019 19:52 UTC (Mon) by error27 (subscriber, #8346) [Link] (2 responses)

One difference between early and late patches is if you send a fix to an early patch that gets folded in there is no record of it in the git log. I've argued before that reviewer who notice a real bug should get credit.

Then the other question is for the late patches do the fixes come before or after the kernel release? If it comes before then that's fine and the system is working as designed. It's better to push those fixes to the Linus tree quite quickly so they get as much testing as possible before the release.

Identifying buggy patches with machine learning

Posted Nov 5, 2019 10:01 UTC (Tue) by mst@redhat.com (subscriber, #60682) [Link] (1 responses)

There's Reviewed-by - if it's a one liner then what matters is the review not who coded it up, right?

Identifying buggy patches with machine learning

Posted Nov 5, 2019 10:40 UTC (Tue) by error27 (subscriber, #8346) [Link]

Reviewed-by means you reviewed the whole patch. It doesn't necessarily mean anything. If you see a Reviewed-by from me, that means it was a thank you to someone who redid their patch like I asked them to. Reviewed-by tags are not required for most of staging but I think people appreciate a little thank you note for their hard work.

Identifying buggy patches with machine learning

Posted Nov 6, 2019 16:46 UTC (Wed) by riking (guest, #95706) [Link] (2 responses)

> he said that he is currently using it to flag patches that need more careful review

IMO, this is the best use of machine learning systems: using them to sift through bunches of data and raise anomalies for human review. Systems without a human in the loop at the end are prone to silent and undetected bad biases.

Identifying buggy patches with machine learning

Posted Nov 6, 2019 19:52 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

This still rests on the assumption that the "machine-learning model" will at least find all cases which possibly need "more review".

Which made me recall a nice (online) newspaper story of some weeks ago: Someone was using a "machine learning system" in order to help with identifying children which may end up be sexually exploited. It was fed with data of about 7000 people and produced an ordered list supposedly ranking them in order of most-to-least-likely. Of this 7000 people, 5 actually ended up being sexually exploited. 3 of these 5 were among the first 100 on the list, but considering the way this was worded, certainly not among the first 10 and very likely not even among the first 50. A fourth was among the first 200, ie, probably somewhere between 150 - 200. No information about classification of the fifth was mentioned in the article, presumably, because it was so outragously wrong that not even the developers could spin this as something positive anymore.

Put in other words: The output of the computer program was completely wrong and correlations with observable reality are probably happenstance.

Identifying buggy patches with machine learning

Posted Nov 10, 2019 16:56 UTC (Sun) by jezuch (subscriber, #52988) [Link]

It's a known problem that most of "machine learning" models are crap: https://thegradient.pub/nlps-clever-hans-moment-has-arrived/

Identifying buggy patches with machine learning

Posted Nov 7, 2019 1:52 UTC (Thu) by ajdlinux (subscriber, #82125) [Link]

Could we hook up autosel with the Patchwork API and see whether it's useful for identifying patches that are missing a Cc: stable before patches are applied?

Identifying buggy patches with machine learning

Posted Nov 11, 2019 0:07 UTC (Mon) by mina86 (guest, #68442) [Link]

One possible solution for late-rc patches problem is to prefer reverts over fixes during the -rc period.