|
|
Log in / Subscribe / Register

Everyone insane or what?

Everyone insane or what?

Posted Apr 1, 2026 15:14 UTC (Wed) by pbonzini (subscriber, #60935)
In reply to: Everyone insane or what? by mirabilos
Parent article: The role of LLMs in patch review

Based on what I exchanged with other developers, most found the false positive rate to be acceptable. The effort to look at it is not far from the effort to attend to a maintainer's review (which, it's worth remembering, *do* have "false positives" - we just call them questions instead).

On a recent patch series of mine it found six things worth changing, including a few small bugs and places where it's worth adding better comments that likely will be helpful for humans as well.

So: it's not yet ready, if it ever will be, for maintainers to look at reports; but it's certainly with looking at for authors of complex changes, to get a second opinion before the proverbially overworked maintainers get your attention.


to post comments

Everyone insane or what?

Posted Apr 1, 2026 20:17 UTC (Wed) by pm215 (subscriber, #98099) [Link] (5 responses)

Yes, human reviewers can have false positives -- but resolving those has the benefit of passing knowledge to that human reviewer. False positives from an automated tool help nobody, they're pure loss.

The original Coverity authors had a paper decades back noting the importance of a low false positive rate -- if you have too many false positives then users will decide your tool isn't worth paying attention to, and are likely also to ignore any genuine problems it flags up.

Everyone insane or what?

Posted Apr 1, 2026 20:27 UTC (Wed) by mb (subscriber, #50428) [Link] (1 responses)

>The original Coverity authors had a paper decades back noting the importance of a low false positive rate

What was the number?

Everyone insane or what?

Posted Apr 1, 2026 22:08 UTC (Wed) by pm215 (subscriber, #98099) [Link]

Digging out the ACM article I had in mind: https://www.cs.columbia.edu/~junfeng/18sp-e6121/papers/co... the part about false positives is on the last page. They say:

* above 30% is definitely bad
* they aimed for below 20%
* when forced to choose between more bugs and fewer false positives, choose the latter
* the initial reports are really important -- if the first few are bad then the response is "this tool sucks" and people reject it
* "you never want an embarrassing false positive. A stupid false positive implies the tool is stupid"

(My personal experience of Coverity today is that its false positive rate is way higher than I would like.)

Everyone insane or what?

Posted Apr 2, 2026 6:36 UTC (Thu) by pbonzini (subscriber, #60935) [Link] (1 responses)

Absolutely, however the Coverity paper is about a different kind of issue and report. A tool that has looks at a higher level, is able to look up related code, understands the names of variables can (at least for me) afford a higher rate.

That said I have used Coverity a lot more than Sashiko so I admit my picture might be excessively rosy.

> False positives from an automated tool help nobody, they're pure loss.

Not entirely - it can suggest that a comment is necessary, for example. For example see the second report for patch 10 at https://sashiko.dev/#/patchset/20260326181723.218115-1-pb..., which is correct but impossible *now*.

Everyone insane or what?

Posted Apr 2, 2026 8:19 UTC (Thu) by pm215 (subscriber, #98099) [Link]

For me, the false positive situation is true regardless of the tool and what level of analysis it performs, because the cost is the same -- I have to go through the bogus reports, figure out what it's suggesting, determine that it's wrong, and dismiss the report. I might hope that a tool capable of higher level analysis has a lower false positive rate (often Coverity f.p. reports are a result of an inability to see the higher level), but if it doesn't in practice have a low f.p. rate then it's just as bad and timewasting as any other.

Everyone insane or what?

Posted Apr 2, 2026 8:19 UTC (Thu) by khim (subscriber, #9252) [Link]

> False positives from an automated tool help nobody, they're pure loss.

That's not true with LLMs, surprisingly enough — and precisely for the same reason that's usually perceived as LLMs weakness.

LLMs couldn't “think”, but they are the world's best generators of bullshit (term from the 1986 year paper).

That makes their “true false positives” rare surprisingly low. If you define “false positive” not as “there are no issues with code but LLM says there is” but as “there are no issues with the code and it's obvious why there are no issues with the code” then false positives rate drops to almost zero, because most places where LLM finds “something fishy” (when, in fact, everything is fine) are quite tricky and deserve at least a comment if not change to the code.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds