The role of LLMs in patch review
Discussion of a memory-management patch set intended to clean up a helper function for handling huge pages spiraled into something else entirely after it was posted on March 19. Memory-management maintainer Andrew Morton proposed making changes to the subsystem's review process, to require patch authors to respond to feedback from Sashiko, the recently released LLM-based kernel patch review system. Other sub-maintainers, particularly Lorenzo Stoakes, objected. The resulting discussion about how and when to adopt Sashiko is potentially relevant to many other parts of the kernel.
Morton began by saying that the current way Sashiko integrates into the
memory-management workflow isn't working. He merges patches to his tree, and
"then half a day later a bunch of potential issues are identified.
"
Morton stated that he was going to further increase the lag between seeing a
patch set on the mailing list and merging it to his tree, to give Sashiko
time to produce feedback and patch authors
time to respond to it. He also wanted its reviews
distributed to a wider audience — partly to better determine how useful its
comments are, which he is "paying close attention to
".
Stoakes said that he would look at the Sashiko reviews for the patch set, but asked Morton to hold off on incorporating it into the subsystem's workflow. He said that he appreciates the tooling, but that it is currently too noisy to use in that way. Stoakes referenced his message in the thread introducing Sashiko (that began on March 17) where he expressed the opinion that its false-positive rate was higher than his own experience using Chris Mason's kernel-review prompts. David Hildenbrand agreed that the false-positive rate was too high to be useful.
Roman Gushchin, Sashiko's creator, told Morton that he was actively working on integrating Sashiko with the kernel's email-based workflow, and that he hoped to have it sending reviews to appropriate recipients within the next week. Morton took the opportunity to ask about another problem with the tool — that many patch sets sent to the mailing list fail to apply in Sashiko's environment. In a follow-up message he expressed his intention not to apply patches to his tree that the system couldn't. Gushchin explained that Sashiko tries to apply patch sets to several bases, in order. For memory-management patches, it uses the patch set's base commit (if specified), then the mm-new tree, followed by mm-unstable, mm-stable, linux-next, and finally Linus Torvalds's tree. The review system evaluates the code in the first tree where the application attempt is successful. He didn't address why mailing-list patches would be failing to apply to these trees, however.
Stoakes asked Morton to hold off on integrating Sashiko so deeply into his workflow:
Andrew, for crying out loud. Please don't do this.
2 of the 3 series I respun on Friday, working a 13 hour day to do so, don't apply to Sashiko, but do apply to the mm tree.
I haven't the _faintest clue_ how we are supposed to factor a 3rd party experimental website applying or not applying series into our work??
He went on to say that he was not attempting to disrespect Gushchin or his
efforts, but that even Gushchin had agreed that the tool was not ready to become a
blocking component of the development process. Gushchin
replied to say
that working on Sashiko had increasingly shown him the subjectivity of reviews,
and the importance of social context in providing good reviews. He acknowledged
that it wasn't "perfect in any way
" but suggested that some level of
false positives (for example, 20%) was acceptable from a tool that catches the
majority of bugs before they're merged. He suggested that this might be a
reasonable lens through which to view Sashiko's current performance and future
development.
Stoakes
replied to clarify that he was objecting to Morton's unilateral demand
that "every single point Sashiko raises is responded to
". He was
emphatically not blaming Gushchin for failures of the memory-management
subsystem's review model, and thought the tool was promising. He reiterated his
perception that the tool's false-positive rate was much higher than other people
were claiming, and that — given its inexhaustible ability to produce new
reviews that require human attention — it was important to think critically
about what role it can play in the review process. Incorporating the tool in its
present state, as anything other than a simple advisory, would increase the
workload on the already overworked memory-management maintainers, he said.
This sentiment resonated with Pedro Falcato, who agreed that Sashiko should remain optional for the time being. Morton disagreed:
Rule #1 is, surely, "don't add bugs". This thing finds bugs. If its hit rate is 50% then that's plenty high enough to justify people spending time to go through and check its output.
[...]
Look, I know people are busy. If checking these reports slows us down and we end up merging less code and less buggy code then that's a good tradeoff.
Avoiding bugs is important, Falcato
agreed, but: "I simply don't think either
contributors or maintainers will be particularly less stressed with the
introduction of obligatory AI reviews.
" He suggested that simply codifying
the memory-management review process (as the netdev review
process has been) would be more helpful than mandating the use of Sashiko
(a suggestion that Mike Rapoport later
supported).
Falcato also pointed out that Sashiko is experimental, untested software, and it
should probably not be made critical to the process yet on those grounds.
Morton
responded by looking at actual replies to Sashiko's reviews on the linux-mm
mailing list. Out of about 35 emails, 22 received replies indicating that
alterations were definitely needed, with the rest being more ambiguous, being
false reports, or not being responded to. He expressed the opinion that such a
hit rate of finding actual problems in patches was worth the pain of figuring
out how to integrate Sashiko into the process. "That's a really high hit
rate! How can we possibly not use this, if we care about Rule #1?
"
Stoakes disagreed with Morton's interpretation of the data, pointing out that those 22 emails indicate cases where the tool was correct in at least one individual observation. Since it normally sends multiple suggestions and questions per review, the actual rate of false positives for individual comments must be substantially worse than that.
Stoakes again reiterated that he found Sashiko useful, and was using it in his own reviews to some degree. The problem was in making it a mandatory part of the process. He suggested that Morton should delegate the decision of whether and how to use Sashiko to the sub-maintainers, avoid requiring that its automation cleanly applies a patch before accepting it, avoid requiring that every element of its reviews be responded to, and trust sub-maintainers to discard any parts of the reviews that are not both valid and important.
Sashiko is, at the time of writing, not even a month old. According to its statistics page, it has written over 10,000 reviews in that time, with an average of approximately 3,500 words of output (counting quoted source code) per patch. It is, quite literally, producing words faster than any individual person could reasonably read. But approximately half of those words, according to various different statistics, are about bugs that no human reviewer spotted ahead of time — either due to the difficulty of reviewing complex kernel code, or just due to a lack of infinite time to dedicate to the prospect. And several kernel contributors are finding the reviews useful.
As is the case unfortunately often, the problem posed by the use of Sashiko is a social one, not a technical one: how much extra reading, hallucinated problems, delays in the review process, robotic gatekeeping, and reliance on proprietary models is acceptable in order to make sure the kernel accepts less buggy code? It's a question that will eventually touch every subsystem of the kernel and beyond, not just the memory-management code, and one that undoubtedly deserves a lot of discussion. There are no easy answers, but hopefully the kernel community will eventually be able to reach a consensus.
| Index entries for this article | |
|---|---|
| Kernel | Development tools/Large language models |
