|
|
Log in / Subscribe / Register

The role of LLMs in patch review

By Daroc Alden
March 31, 2026

Discussion of a memory-management patch set intended to clean up a helper function for handling huge pages spiraled into something else entirely after it was posted on March 19. Memory-management maintainer Andrew Morton proposed making changes to the subsystem's review process, to require patch authors to respond to feedback from Sashiko, the recently released LLM-based kernel patch review system. Other sub-maintainers, particularly Lorenzo Stoakes, objected. The resulting discussion about how and when to adopt Sashiko is potentially relevant to many other parts of the kernel.

Morton began by saying that the current way Sashiko integrates into the memory-management workflow isn't working. He merges patches to his tree, and "then half a day later a bunch of potential issues are identified." Morton stated that he was going to further increase the lag between seeing a patch set on the mailing list and merging it to his tree, to give Sashiko time to produce feedback and patch authors time to respond to it. He also wanted its reviews distributed to a wider audience — partly to better determine how useful its comments are, which he is "paying close attention to".

Stoakes said that he would look at the Sashiko reviews for the patch set, but asked Morton to hold off on incorporating it into the subsystem's workflow. He said that he appreciates the tooling, but that it is currently too noisy to use in that way. Stoakes referenced his message in the thread introducing Sashiko (that began on March 17) where he expressed the opinion that its false-positive rate was higher than his own experience using Chris Mason's kernel-review prompts. David Hildenbrand agreed that the false-positive rate was too high to be useful.

Roman Gushchin, Sashiko's creator, told Morton that he was actively working on integrating Sashiko with the kernel's email-based workflow, and that he hoped to have it sending reviews to appropriate recipients within the next week. Morton took the opportunity to ask about another problem with the tool — that many patch sets sent to the mailing list fail to apply in Sashiko's environment. In a follow-up message he expressed his intention not to apply patches to his tree that the system couldn't. Gushchin explained that Sashiko tries to apply patch sets to several bases, in order. For memory-management patches, it uses the patch set's base commit (if specified), then the mm-new tree, followed by mm-unstable, mm-stable, linux-next, and finally Linus Torvalds's tree. The review system evaluates the code in the first tree where the application attempt is successful. He didn't address why mailing-list patches would be failing to apply to these trees, however.

Stoakes asked Morton to hold off on integrating Sashiko so deeply into his workflow:

Andrew, for crying out loud. Please don't do this.

2 of the 3 series I respun on Friday, working a 13 hour day to do so, don't apply to Sashiko, but do apply to the mm tree.

I haven't the _faintest clue_ how we are supposed to factor a 3rd party experimental website applying or not applying series into our work??

He went on to say that he was not attempting to disrespect Gushchin or his efforts, but that even Gushchin had agreed that the tool was not ready to become a blocking component of the development process. Gushchin replied to say that working on Sashiko had increasingly shown him the subjectivity of reviews, and the importance of social context in providing good reviews. He acknowledged that it wasn't "perfect in any way" but suggested that some level of false positives (for example, 20%) was acceptable from a tool that catches the majority of bugs before they're merged. He suggested that this might be a reasonable lens through which to view Sashiko's current performance and future development.

Stoakes replied to clarify that he was objecting to Morton's unilateral demand that "every single point Sashiko raises is responded to". He was emphatically not blaming Gushchin for failures of the memory-management subsystem's review model, and thought the tool was promising. He reiterated his perception that the tool's false-positive rate was much higher than other people were claiming, and that — given its inexhaustible ability to produce new reviews that require human attention — it was important to think critically about what role it can play in the review process. Incorporating the tool in its present state, as anything other than a simple advisory, would increase the workload on the already overworked memory-management maintainers, he said.

This sentiment resonated with Pedro Falcato, who agreed that Sashiko should remain optional for the time being. Morton disagreed:

Rule #1 is, surely, "don't add bugs". This thing finds bugs. If its hit rate is 50% then that's plenty high enough to justify people spending time to go through and check its output.

[...]

Look, I know people are busy. If checking these reports slows us down and we end up merging less code and less buggy code then that's a good tradeoff.

Avoiding bugs is important, Falcato agreed, but: "I simply don't think either contributors or maintainers will be particularly less stressed with the introduction of obligatory AI reviews." He suggested that simply codifying the memory-management review process (as the netdev review process has been) would be more helpful than mandating the use of Sashiko (a suggestion that Mike Rapoport later supported). Falcato also pointed out that Sashiko is experimental, untested software, and it should probably not be made critical to the process yet on those grounds.

Morton responded by looking at actual replies to Sashiko's reviews on the linux-mm mailing list. Out of about 35 emails, 22 received replies indicating that alterations were definitely needed, with the rest being more ambiguous, being false reports, or not being responded to. He expressed the opinion that such a hit rate of finding actual problems in patches was worth the pain of figuring out how to integrate Sashiko into the process. "That's a really high hit rate! How can we possibly not use this, if we care about Rule #1?"

Stoakes disagreed with Morton's interpretation of the data, pointing out that those 22 emails indicate cases where the tool was correct in at least one individual observation. Since it normally sends multiple suggestions and questions per review, the actual rate of false positives for individual comments must be substantially worse than that.

Stoakes again reiterated that he found Sashiko useful, and was using it in his own reviews to some degree. The problem was in making it a mandatory part of the process. He suggested that Morton should delegate the decision of whether and how to use Sashiko to the sub-maintainers, avoid requiring that its automation cleanly applies a patch before accepting it, avoid requiring that every element of its reviews be responded to, and trust sub-maintainers to discard any parts of the reviews that are not both valid and important.

Sashiko is, at the time of writing, not even a month old. According to its statistics page, it has written over 10,000 reviews in that time, with an average of approximately 3,500 words of output (counting quoted source code) per patch. It is, quite literally, producing words faster than any individual person could reasonably read. But approximately half of those words, according to various different statistics, are about bugs that no human reviewer spotted ahead of time — either due to the difficulty of reviewing complex kernel code, or just due to a lack of infinite time to dedicate to the prospect. And several kernel contributors are finding the reviews useful.

As is the case unfortunately often, the problem posed by the use of Sashiko is a social one, not a technical one: how much extra reading, hallucinated problems, delays in the review process, robotic gatekeeping, and reliance on proprietary models is acceptable in order to make sure the kernel accepts less buggy code? It's a question that will eventually touch every subsystem of the kernel and beyond, not just the memory-management code, and one that undoubtedly deserves a lot of discussion. There are no easy answers, but hopefully the kernel community will eventually be able to reach a consensus.


Index entries for this article
KernelDevelopment tools/Large language models


to post comments

Sashiko

Posted Mar 31, 2026 20:50 UTC (Tue) by Paf (subscriber, #91811) [Link]

It poses interesting problems, but as someone outside the kernel project with a related codebase - the Lustre parallel file system - I'm very excited to have it to play with. I've also been making a few contributions to Sashiko to let it work with our workflows.

Fundamentally: I am very, very gratified to have this tool, and for the work going into it. Wiring up an LLM review harness is not especially difficult. Building up a multi-step pipeline and doing the validation pipeline to finding real bugs? That's of enormous value, can't be done with a quick wave of the LLM coding wand, and I'm happy to be able to use it for our project.

Everyone insane or what?

Posted Apr 1, 2026 4:13 UTC (Wed) by mirabilos (subscriber, #84359) [Link] (8 responses)

Yes, of course a false positive rate of 20% is okay if it finds some bugs; yes of course make people respond to the slop thing…

smh

Everyone insane or what?

Posted Apr 1, 2026 15:14 UTC (Wed) by pbonzini (subscriber, #60935) [Link] (6 responses)

Based on what I exchanged with other developers, most found the false positive rate to be acceptable. The effort to look at it is not far from the effort to attend to a maintainer's review (which, it's worth remembering, *do* have "false positives" - we just call them questions instead).

On a recent patch series of mine it found six things worth changing, including a few small bugs and places where it's worth adding better comments that likely will be helpful for humans as well.

So: it's not yet ready, if it ever will be, for maintainers to look at reports; but it's certainly with looking at for authors of complex changes, to get a second opinion before the proverbially overworked maintainers get your attention.

Everyone insane or what?

Posted Apr 1, 2026 20:17 UTC (Wed) by pm215 (subscriber, #98099) [Link] (5 responses)

Yes, human reviewers can have false positives -- but resolving those has the benefit of passing knowledge to that human reviewer. False positives from an automated tool help nobody, they're pure loss.

The original Coverity authors had a paper decades back noting the importance of a low false positive rate -- if you have too many false positives then users will decide your tool isn't worth paying attention to, and are likely also to ignore any genuine problems it flags up.

Everyone insane or what?

Posted Apr 1, 2026 20:27 UTC (Wed) by mb (subscriber, #50428) [Link] (1 responses)

>The original Coverity authors had a paper decades back noting the importance of a low false positive rate

What was the number?

Everyone insane or what?

Posted Apr 1, 2026 22:08 UTC (Wed) by pm215 (subscriber, #98099) [Link]

Digging out the ACM article I had in mind: https://www.cs.columbia.edu/~junfeng/18sp-e6121/papers/co... the part about false positives is on the last page. They say:

* above 30% is definitely bad
* they aimed for below 20%
* when forced to choose between more bugs and fewer false positives, choose the latter
* the initial reports are really important -- if the first few are bad then the response is "this tool sucks" and people reject it
* "you never want an embarrassing false positive. A stupid false positive implies the tool is stupid"

(My personal experience of Coverity today is that its false positive rate is way higher than I would like.)

Everyone insane or what?

Posted Apr 2, 2026 6:36 UTC (Thu) by pbonzini (subscriber, #60935) [Link] (1 responses)

Absolutely, however the Coverity paper is about a different kind of issue and report. A tool that has looks at a higher level, is able to look up related code, understands the names of variables can (at least for me) afford a higher rate.

That said I have used Coverity a lot more than Sashiko so I admit my picture might be excessively rosy.

> False positives from an automated tool help nobody, they're pure loss.

Not entirely - it can suggest that a comment is necessary, for example. For example see the second report for patch 10 at https://sashiko.dev/#/patchset/20260326181723.218115-1-pb..., which is correct but impossible *now*.

Everyone insane or what?

Posted Apr 2, 2026 8:19 UTC (Thu) by pm215 (subscriber, #98099) [Link]

For me, the false positive situation is true regardless of the tool and what level of analysis it performs, because the cost is the same -- I have to go through the bogus reports, figure out what it's suggesting, determine that it's wrong, and dismiss the report. I might hope that a tool capable of higher level analysis has a lower false positive rate (often Coverity f.p. reports are a result of an inability to see the higher level), but if it doesn't in practice have a low f.p. rate then it's just as bad and timewasting as any other.

Everyone insane or what?

Posted Apr 2, 2026 8:19 UTC (Thu) by khim (subscriber, #9252) [Link]

> False positives from an automated tool help nobody, they're pure loss.

That's not true with LLMs, surprisingly enough — and precisely for the same reason that's usually perceived as LLMs weakness.

LLMs couldn't “think”, but they are the world's best generators of bullshit (term from the 1986 year paper).

That makes their “true false positives” rare surprisingly low. If you define “false positive” not as “there are no issues with code but LLM says there is” but as “there are no issues with the code and it's obvious why there are no issues with the code” then false positives rate drops to almost zero, because most places where LLM finds “something fishy” (when, in fact, everything is fine) are quite tricky and deserve at least a comment if not change to the code.

Everyone insane or what?

Posted Apr 1, 2026 19:44 UTC (Wed) by tchernobog (subscriber, #73595) [Link]

Even compilers have false positives; quite a few, in fact.

Here I would say this is just a static analysis tool.

As long as a human is in the driving seat and can override the false positives, I think this is a good use for LLMs.

False positive identification rate

Posted Apr 1, 2026 14:45 UTC (Wed) by iabervon (subscriber, #722) [Link] (4 responses)

22 of the 35 messages got responses indicating that changes were needed, but that doesn't necessarily mean they found bugs. It's possible that the person who wrote the patch tried making the change and found that the code wasn't actually wrong in the first place, or made a real bug harder to see, or made a change to correct code and introduced a bug. However good LLMs are at being correct, they're better at being convincing, so it's particularly important to get a demonstration that the problems they talk about are real.

False positive identification rate

Posted Apr 1, 2026 18:00 UTC (Wed) by Paf (subscriber, #91811) [Link] (3 responses)

I suggest reading the previous articles on Sashiko or checking out their pages on it - it was extensively validated by checking patches which later received a Fixes: label. It caught about 50% of the bugs in that testing - so patches that were merged, then later fixed. So, bugs that were not caught in human review, and were important enough to later receive a fix.

From my perspective as a user, Google's most valuable contribution here is not the harness, it is the results-validated process. I could vibe code up a (less good, but workable) harness in a few days, most likely. It is the specific detailed sequence of multi-stage review with specific prompts and the concrete testing against real known bugs that is the most valuable part.

False positive identification rate

Posted Apr 1, 2026 19:17 UTC (Wed) by iabervon (subscriber, #722) [Link] (2 responses)

That's one direction of validation, but Sashiko obviously produces more comments than 50% of the later patches with fixes. If it catches half of the bugs in the code it reviews, but induces developers to introduce twice as many bugs in response to other comments, it's not making the code better. The general assumption is that if a reviewer says something about a real bug, the developer will fix it, and if the reviewer says something that's not about a real bug, the developer will do nothing, which means that false positives are a cost to the development process but safe. However, developers presumably don't really react correctly to all comments, so we should be worried about overly convincing false positives as well as true positives buried in too much noise.

False positive identification rate

Posted Apr 2, 2026 0:14 UTC (Thu) by Paf (subscriber, #91811) [Link]

That’s fair in the abstract, but seems overblown. In my usage I have yet to see it offer “this is a bug and must be fixed” and be wrong. There’s a certain amount of “I am not sure about this?”, and those are lower quality, though often linked to questionable code readability in the source. I’m sure it does the first category of error, but it seems to be rare.

Of note, Morton thinks this tool is so good it should be made mandatory. He has some significant experience of code review processes.

False positive identification rate

Posted Apr 17, 2026 7:03 UTC (Fri) by daenzer (subscriber, #7050) [Link]

Is there any information about how leakage was prevented when measuring that ~50% rate, i.e. how it was ensured the LLM didn't have knowledge of the corresponding fixes?

Per https://www.normaltech.ai/p/scientists-should-use-ai-as-a... leakage causes the effectiveness of machine-learning models to be overestimated in many scientific papers.

Neat

Posted Apr 2, 2026 23:34 UTC (Thu) by calvinowens (subscriber, #100757) [Link] (1 responses)

My first experience with it was positive, it correctly identified an obvious nesting case I should have caught. It proposed some confusing solutions which I allowed to confuse me, but that's my fault. I've only just started using these tools for review: it has taken some practice, but I feel like I'm quickly getting better at that.

To put a positive spin on it, forcing myself to reject the false positives has frequently been a productive learning exercise for me. But I can certainly understand why somebody under time pressure to finish something could see that as a strict negative.

Neat

Posted Apr 4, 2026 6:51 UTC (Sat) by tsoni.lwn (subscriber, #139617) [Link]

Qualcomm had a PatchWise tool released much before Sashiko and it did provide facility to plugin any model and not just that you can run many static checkers too. It just that it didn’t had free service of reviews like Sashiko. I prefer running Ollama with it instead. It should be easy to write a dashboard and running it with free model?

https://github.com/qualcomm/PatchWise


Copyright © 2026, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds