|
|
Log in / Subscribe / Register

False positive identification rate

False positive identification rate

Posted Apr 1, 2026 14:45 UTC (Wed) by iabervon (subscriber, #722)
Parent article: The role of LLMs in patch review

22 of the 35 messages got responses indicating that changes were needed, but that doesn't necessarily mean they found bugs. It's possible that the person who wrote the patch tried making the change and found that the code wasn't actually wrong in the first place, or made a real bug harder to see, or made a change to correct code and introduced a bug. However good LLMs are at being correct, they're better at being convincing, so it's particularly important to get a demonstration that the problems they talk about are real.


to post comments

False positive identification rate

Posted Apr 1, 2026 18:00 UTC (Wed) by Paf (subscriber, #91811) [Link] (2 responses)

I suggest reading the previous articles on Sashiko or checking out their pages on it - it was extensively validated by checking patches which later received a Fixes: label. It caught about 50% of the bugs in that testing - so patches that were merged, then later fixed. So, bugs that were not caught in human review, and were important enough to later receive a fix.

From my perspective as a user, Google's most valuable contribution here is not the harness, it is the results-validated process. I could vibe code up a (less good, but workable) harness in a few days, most likely. It is the specific detailed sequence of multi-stage review with specific prompts and the concrete testing against real known bugs that is the most valuable part.

False positive identification rate

Posted Apr 1, 2026 19:17 UTC (Wed) by iabervon (subscriber, #722) [Link] (1 responses)

That's one direction of validation, but Sashiko obviously produces more comments than 50% of the later patches with fixes. If it catches half of the bugs in the code it reviews, but induces developers to introduce twice as many bugs in response to other comments, it's not making the code better. The general assumption is that if a reviewer says something about a real bug, the developer will fix it, and if the reviewer says something that's not about a real bug, the developer will do nothing, which means that false positives are a cost to the development process but safe. However, developers presumably don't really react correctly to all comments, so we should be worried about overly convincing false positives as well as true positives buried in too much noise.

False positive identification rate

Posted Apr 2, 2026 0:14 UTC (Thu) by Paf (subscriber, #91811) [Link]

That’s fair in the abstract, but seems overblown. In my usage I have yet to see it offer “this is a bug and must be fixed” and be wrong. There’s a certain amount of “I am not sure about this?”, and those are lower quality, though often linked to questionable code readability in the source. I’m sure it does the first category of error, but it seems to be rare.

Of note, Morton thinks this tool is so good it should be made mandatory. He has some significant experience of code review processes.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds