Identifying buggy patches with machine learning
Any kernel patch that fixes a bug, Levin began, should include a tag marking it for the stable updates. Relying on that tag turns out to miss a lot of important fixes, though. About 3-4% of the mainline patch stream was being marked, but the number of patches that should be put into the stable releases is closer to 20% of the total. Rather than try to get developers to mark more patches, he developed his machine-learning system to identify fixes in the mainline patch stream automatically and queue them for manual review.
This system uses a number of heuristics, he said. If the changelog contains language like "fixes" or "causes a panic", it's likely to be an important fix. Shorter patches tend to be candidates. Another indicator is the addition of code like:
if (x == NULL) return -ESOMETHING;
In the end, it does turn out to be possible to automatically identify a number of fixes. But if that can be done, could it be possible to use a similar system to find bugs? That turns out to be a harder problem. Levin complained that nobody includes text like "this commit has a bug" or "this will crash your server" in their changelogs — a complaint that Andrew Morton beat him to by many years. Just looking at code constructs can only catch the simplest bugs, and there are already static-analysis systems using that technique. So he needed to look for something else.
That "something else" turns out to be review and testing — or the lack
thereof. A lot can be learned by looking at the reviews that patches get.
Are there a lot of details in the review? Is there an indication that the
reviewer actually tried the code? Does it go beyond typographic errors?
Sentiment analysis can also be used to get a sense for how the reviewer
felt about the patch in general.
Not all reviewers are equal, so this system needs to qualify each reviewer. Over time, it is possible to conclude how likely it is that a patch reviewed by a given developer contains a bug. The role of the reviewer also matters; if the reviewer is a maintainer of — or frequent contributor to — the relevant subsystem, their review should carry more weight.
A system can look at how long any given patch has been present in linux-next, how many iterations it has been through, and what the "quality" of the conversation around it was. Output from automated testing systems has a place, but only to an extent; KernelCI is a more reliable tester for ARM patches, but the 0day system is better for x86 patches. Manual testing tends to be a good indicator of patch quality; if a patch indicates that it has been tested on thousands of machines in somebody's data center for months, it is relatively unlikely to contain a bug.
Then, one can also try to look at code quality, but that is hard to quantify. Looking at the number of problems found in the original posting of a patch might offer some useful information. But Levin is unsure about how much can be achieved in this area.
Once the data of interest has been identified, it is necessary to create a training set for the system. That is made up of a big pile of patches, of course, along with a flag saying whether each contains a bug or not. The Fixes tags in patches can help here, but not all bugs really matter for the purposes of this system; spelling fixes or theoretical races are not the sort of problem he is looking for. In the end, he took a simple approach, training the system on patches that were later reverted or which have a Fixes tag pointing to them.
That led to some interesting information about where and when bugs are introduced. He had thought that bugs would generally be added during the merge window, then fixed in the later -rc releases, but that turned out to be wrong. On a lines-of-code basis, a patch merged for one of the late -rc releases is three times more likely to introduce a bug than a merge-window patch.
Patches queued for the merge window, it seems, are relatively well tested. Those added late in the cycle, instead, are there to fix some other problem and generally get much less testing — or none at all. Levin said that things shouldn't be this way. There is no reason to rush fixes late in the development cycle; nobody runs mainline kernels in production anyway, so it is better to give those patches more testing then push them into the stable updates when they are really ready. Developers should, he said, trust the system more and engage in less "late-rc wild-west stuff".
Levin complained to Linus Torvalds about this dynamic; Torvalds agreed with the explanation but said that the system was designed that way. Late-cycle problems tend to be more complex, so the fixes will also be more complex and more likely to create a new bug. Levin agreed that this is the case, but disagreed with the conclusion; he thinks that the community should be more strict with late-cycle patches.
Back to the machine-learning system, he said that he is currently using it to flag patches that need more careful review; that has enabled him to find a number of bugs in fixes that were destined for stable updates. Parts of this system are also being used to qualify patches for the stable releases. The goal of detecting buggy patches in general still seems rather distant, though.
Levin concluded with some thoughts on improving the kernel development process. The late-rc problem needs to be addressed; we know there is a problem there, he said, so we should do something about it. Testing of kernel patches needs to be improved across the board; the testing capability we have now is rather limited. More testing needs to happen on actual hardware to be truly useful. He would also like to see some standardization in the policy for the acceptance of patches, including how long they should be in linux-next, the signoffs and reviews needed, etc. These policies currently vary widely from one subsystem to the next, and some maintainers seem to not care much at all. That, he said, is not ideal and needs to be improved.
[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting your
editor's travel to the event.]
Index entries for this article | |
---|---|
Kernel | Development model/Stable tree |
Conference | Open Source Summit Europe/2019 |
Posted Nov 4, 2019 19:19 UTC (Mon)
by darwi (subscriber, #131202)
[Link] (8 responses)
Arch *does* push latest kernel.org releases to users.
Posted Nov 4, 2019 19:27 UTC (Mon)
by zblaxell (subscriber, #26385)
[Link] (1 responses)
Posted Nov 9, 2019 17:31 UTC (Sat)
by gerdesj (subscriber, #5446)
[Link]
Posted Nov 4, 2019 19:55 UTC (Mon)
by sashal (✭ supporter ✭, #81842)
[Link]
Users *can* deploy it in production, but I believe that if you have -rc kernels deployed at scale for reasons other than testing and validating the upcoming release, you're doing something wrong.
Posted Nov 5, 2019 9:03 UTC (Tue)
by zdzichu (subscriber, #17118)
[Link] (3 responses)
Posted Nov 5, 2019 12:22 UTC (Tue)
by sashal (✭ supporter ✭, #81842)
[Link]
Posted Nov 5, 2019 17:11 UTC (Tue)
by zblaxell (subscriber, #26385)
[Link] (1 responses)
"mainline + 1018 patches + whatever cherries Fedora puts on top" is not, in any literal or practical sense, "mainline". It's not "production" either. At best, it's a late-stage CI artifact, an input to downstream integration and verification.
Posted Nov 5, 2019 19:11 UTC (Tue)
by zdzichu (subscriber, #17118)
[Link]
Posted May 8, 2021 17:21 UTC (Sat)
by Smon (guest, #104795)
[Link]
Posted Nov 4, 2019 19:52 UTC (Mon)
by error27 (subscriber, #8346)
[Link] (2 responses)
Then the other question is for the late patches do the fixes come before or after the kernel release? If it comes before then that's fine and the system is working as designed. It's better to push those fixes to the Linus tree quite quickly so they get as much testing as possible before the release.
Posted Nov 5, 2019 10:01 UTC (Tue)
by mst@redhat.com (subscriber, #60682)
[Link] (1 responses)
Posted Nov 5, 2019 10:40 UTC (Tue)
by error27 (subscriber, #8346)
[Link]
Posted Nov 6, 2019 16:46 UTC (Wed)
by riking (guest, #95706)
[Link] (2 responses)
IMO, this is the best use of machine learning systems: using them to sift through bunches of data and raise anomalies for human review. Systems without a human in the loop at the end are prone to silent and undetected bad biases.
Posted Nov 6, 2019 19:52 UTC (Wed)
by rweikusat2 (subscriber, #117920)
[Link] (1 responses)
Which made me recall a nice (online) newspaper story of some weeks ago: Someone was using a "machine learning system" in order to help with identifying children which may end up be sexually exploited. It was fed with data of about 7000 people and produced an ordered list supposedly ranking them in order of most-to-least-likely. Of this 7000 people, 5 actually ended up being sexually exploited. 3 of these 5 were among the first 100 on the list, but considering the way this was worded, certainly not among the first 10 and very likely not even among the first 50. A fourth was among the first 200, ie, probably somewhere between 150 - 200. No information about classification of the fifth was mentioned in the article, presumably, because it was so outragously wrong that not even the developers could spin this as something positive anymore.
Put in other words: The output of the computer program was completely wrong and correlations with observable reality are probably happenstance.
Posted Nov 10, 2019 16:56 UTC (Sun)
by jezuch (subscriber, #52988)
[Link]
Posted Nov 7, 2019 1:52 UTC (Thu)
by ajdlinux (subscriber, #82125)
[Link]
Posted Nov 11, 2019 0:07 UTC (Mon)
by mina86 (guest, #68442)
[Link]
Identifying buggy patches with machine learning
Identifying buggy patches with machine learning
Identifying buggy patches with machine learning
Identifying buggy patches with machine learning
Identifying buggy patches with machine learning
And Fedora is certainly intended for production.
Identifying buggy patches with machine learning
Identifying buggy patches with machine learning
1018
Identifying buggy patches with machine learning
In the opposition of "mainline" I see so called "distro" kernels with hundreds/thousands of patches and backports.
Identifying buggy patches with machine learning
They wait for x.x.1 and according to kernel.org, x.x.1 is stable. (x.x.0 is mainline)
Identifying buggy patches with machine learning
Identifying buggy patches with machine learning
Identifying buggy patches with machine learning
Identifying buggy patches with machine learning
Identifying buggy patches with machine learning
Identifying buggy patches with machine learning
Identifying buggy patches with machine learning
Identifying buggy patches with machine learning