|
|
Log in / Subscribe / Register

Using LLMs to find Python C-extension bugs

By Jake Edge
April 21, 2026

The open-source world is currently awash in reports of LLM-discovered bugs and vulnerabilities, which makes for a lot more work for maintainers, but many of the current crop are being reported responsibly with an eye toward minimizing that impact. A recent report on an effort to systematically find bugs in Python extensions written in C has followed that approach. Hobbyist Daniel Diniz used Claude Code to find more than 500 bugs of various sorts across nearly a million lines of code in 44 extensions; he has been working with maintainers to get fixes upstream and his methodology serves as a great example of how to keep the human in the loop—and the maintainers out of burnout—when employing LLMs.

The numbers are fairly eye-opening: "575+ confirmed bugs (~10-15% false positive rate after review, ~140 reproduced from Python) and fixes already merged in 14 projects". The types of the bugs range widely: "from hard crashes and memory corruption to correctness issues and spec violations". Meanwhile, Diniz would like to work with maintainers to make the effort "more useful and scalable for maintainers"; the goal is to provide high-quality reports of "a large class of non-trivial bugs" that are difficult to find manually.

To do that, Diniz created a Claude Code plugin, cext-review-toolkit, that is tuned for Python-specific problems that might be found in C extensions, such as problems with reference counts, in handling the global interpreter lock (GIL), and with exception state. It uses "13 specialized analysis agents analyzing the C extension source code in parallel, with each agent targeting a different bug class".

Results

The lengthy report is worth reading in its entirety, but we will highlight a few parts of it here. The tool found lots of bugs, as noted, many of which resulted in bug reports and pull requests (PRs). There are lots of links to both for more than a dozen different C extension projects, including Cython, Guppy 3, regex, Pillow, and more. The Guppy 3 maintainer, YiFei Zhu, was highlighted for digging into the extensive report for that project, fixing 24 of 30 issues found, and finding "additional bugs the tool missed". In addition, the feedback provided in the umbrella issue for the findings was "invaluable", leading to improvements to the tools to reduce false positives.

The report describes how the tool and process work: the agents are run for a project, the findings are reviewed, pure-Python reproducers are created when possible, and then a report is shared with the maintainers via a secret GitHub gist. There is another document that describes techniques for creating reproducers in Python and the report itself describes the specific types of bugs targeted by the agents.

More importantly, given the widespread problems with maintainers being buried under slop bug reports and PRs, Diniz is clearly trying to ensure that his work is worthwhile to the projects:

Reports like these can be time and energy-intensive for maintainers to investigate. Historically, automated bug-finding tools have produced far more false positives than useful information, and AI can make those false positives look incredibly convincing.

[...] When a maintainer points out a false positive, I immediately update the agents' prompts so that specific pattern is avoided in the future.

Beyond polishing the tools, I try to communicate in a non-invasive, helpful manner. The maintainer always holds the reins: I ask them how they prefer to receive the information (an umbrella issue? individual issues? direct PRs? or do nothing at all) and let them decide exactly what to do with the findings.

There is more to the report, including an example of a bug and reproducer, a look at things that did not work, and so on. He ended with a set of questions for the community about whether it is useful, how to improve the tools and reports, and ideas for future tools. He mentions several other projects he is working on, such as an analysis tool aimed at C extensions with regard to free-threaded Python and another tool to analyze the CPython source code.

Reaction

The reaction has been quite positive—no surprise—with a few Python developers and maintainers popping up to talk about the experience and to suggest ideas for further refinements. James Parrott wondered about the number of bugs that would be eliminated if Rust had been used instead. Cython maintainer David Woods thought that Rust could eliminate things like reference-counting problems, but probably not the exception-handling bugs that were prevalent in the report for Cython. Diniz prompted Claude Code with the Rust question, which stated that 60-70% would not be prevented by Rust; Diniz cautioned that "given LLM's troubles with numbers and estimates, I wouldn't trust the percentages too much". But even the broad categorization may be suspect, as Matthias Urlichs said that he thought Rust could prevent more types of problems "if the Rust API is designed safely (in the Rust sense) instead of literally following the C API".

Parrott also suggested using the GitHub Actions system to reproduce the bugs. That would improve the tool's reports, which are less than ideal for him: "I don't want to have to read a huge machine generated report and work out what's what." Diniz was appreciative of the suggestions and thought that he could implement them relatively easily. In particular, customizing reports is already on his radar: "I'd like to tailor the reports to what maintainers need, some like having reproducers and suggested fixes, others would prefer just a short description and code locations."

Eric Soroos, one of the Pillow maintainers, thought it was "one of the better sets of reports that we've gotten about potential security/correctness issues". He did note that the coverage was incomplete, as he spotted similar bugs in related functions that were not found. Some of the bugs were difficult to reproduce because they required a memory-allocation failure to occur in a specific place, leading to a tooling suggestion:

It would be interesting as a test run to have a fuzzer that used coverage guidance to fail mallocs (or c-api python methods) to test the error handling in those cases. It would need to run under valgrind to catch memory leaks or invalid accesses. This could give better code coverage for the repetitive if(ptr==null) {free everything allocated in the function} c level error handling.

The idea was met with approval, so Soroos expanded on it some later in the thread.

The severity of the bugs being found, and whether they are worth the maintainer attention needed to fix them, may also factor into the question about the reports, as Maurycy Pawłowski-Wieroński noted. He had tried using Diniz's LLM tool for CPython and had mixed results, in part because some of the bugs are only reproducible in ways that users are unlikely to ever hit:

Unless the issue is critical (even if perfectly reproducible), many fixes are just distracting. Maintainers have their own projects, plans, schedules etc., and some pathological refleak is not really that important. I believe that such PRs used to make it in the past, because they were seen as an investment (education) in a potential maintainer, a future colleague. Now, it's "Contributor" badge hunting.

Diniz had a, seemingly characteristic, thoughtful reply, agreeing that "not all findings are worth fixing". Maintainers will draw their own lines of what warrants a fix, so he is not in a position to decide which bugs merit addressing. "The best I can do is offer a listing of what the tools find and let them decide what to fix." He said that so far he has not gotten much feedback on whether "tiny PRs targeting nits, leaks, etc." are valuable or not, but he is open to discussing it.

This issue is likely to recur. Finding and fixing memory-allocation-failure handling, for example, is certainly important, but it may well not be as important as other things that maintainers are trying to accomplish. Tuning LLMs to prioritize their reports based on the likelihood of real-world exploitation would be another helpful step. Those who are using these tools for ill are surely pointing them toward exploitable bugs; LLM providers could potentially use those prompts (or share them) for defensive purposes. The LLM providers just might have their own tools and models that could be loosed on such a task as well.

Keeping maintainers fully in control is perhaps the most important element of this effort; giving them the ability to opt out is particularly key. There is a balance to be struck there, of course, because there may be bugs found that need escalation even when the project and its maintainers are not interested in the machine-generated reports. These are the early days for LLM bug-finding—and machines can generate far more reports than mere humans can process—so we are likely to see a variety of approaches, both good and ill. For now, this seems like a nice example of the "good" side of the coin.


Index entries for this article
PythonBug reporting
PythonLarge language models (LLMs)


to post comments

Maintainer input and code review

Posted Apr 21, 2026 16:15 UTC (Tue) by jorgegv (subscriber, #60484) [Link] (2 responses)

From my experience using LLM for development, I'd suggest 2 improvements (I have found that they dramatically improve the quality of the result):

First, it would be great if the maintainers specified, broadly, which categories or types of bugs they are _not_ interested in. Then the LLM can be instructed to assess each bug against these restrictions and modify or discard its output accordingly.

As a second improvement, I always have a prompt request similar to the following: "For every piece of code developed (feature, bugfix, test, etc.), have an independent agent review the code. Be very critical. An agent should NEVER review its own code". This definitely increases the number of tokens consumed per feature, but I have found that the reviewer often finds things that the original developer has not or has misdeveloped. In extreme cases (e.g. when designing a complicated architecture refactor and test plan design) I have even requested a second review, to undo any ties between the first 2 agents.

Maintainer input and code review

Posted Apr 21, 2026 23:32 UTC (Tue) by karath (subscriber, #19025) [Link]

In important work, I’d suggest using an entirely different model to perform the reviews. Defensive review of code is important.

Maintainer input and code review

Posted Apr 22, 2026 9:41 UTC (Wed) by devdanzin (subscriber, #183390) [Link]

> First, it would be great if the maintainers specified, broadly, which categories or types of bugs they are _not_ interested in. Then the LLM can be instructed to assess each bug against these restrictions and modify or discard its output accordingly.

That's a great suggestion, thank you! I've been tailoring the report format and style (removing reproducers, making lists of actionable items, etc.) accordingly to what maintainers request, but asking them about insignificant bug classes is a clear improvement. I'll start doing that.

> As a second improvement, I always have a prompt request similar to the following: "For every piece of code developed (feature, bugfix, test, etc.), have an independent agent review the code. Be very critical. An agent should NEVER review its own code".

There's something vaguely similar to this in place, and I'm working on a new plugin that addresses it more directly.

Right now, the analysis is run three times: two naive passes, in which the agents don't know about the others' results, and one informed pass, where the agent is fed the previous agents' findings. This allows to check for convergence (do two agents reach the same conclusions about a given bug?) and differential analysis (what do they disagree on?). And when we reproduce findings it works as independent confirmation from the main Claude Code instance.

I'm working on a related plugin, report-quality-gate, which goes through the report assessing relevance, tone, factual correctness, etc, of a report. In doing this, it reviews the report (as opposed to the findings). Adding an independent "adversarial" finding reviewer before this phase could be interesting, once the new plugin is done I can give it a try and see how it works. Thank you again for the suggestion!

Report author here

Posted Apr 22, 2026 9:23 UTC (Wed) by devdanzin (subscriber, #183390) [Link] (2 responses)

So happy with this article! :)

Please let me know if you have any questions.

And here's my standing invitation: if you have a Python C extension and want it to be analyzed, just point me to it.

Report author here

Posted Apr 22, 2026 11:31 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (1 responses)

Thanks! There is also the GCC project of a CPython API analyzer as well. How feasible would it be to reify the rules into the GCC scanner itself so that we can offload detection of many things from the cloud models and instead make detection "cheap" again?

I'd selfishly request a review of VTK's generated Python bindings, but I don't want to waste cycles on it as I have a(n unscheduled) plan to update them to be abi3-compliant anyways. It is a "critical" package on PyPI if you want to do a scan; just don't put it at the top of a list because of the planned work likely obviating much of it (unless checkpointing for a before/after comparison is also diable).

Report author here

Posted Apr 23, 2026 9:02 UTC (Thu) by devdanzin (subscriber, #183390) [Link]

> There is also the GCC project of a CPython API analyzer as well. How feasible would it be to reify the rules into the GCC scanner itself so that we can offload detection of many things from the cloud models and instead make detection "cheap" again?

I was going to say that I don't believe it would be feasible, but decided to ask Claude for confirmation and actually got a response[0] indicating it would be very feasible and pretty useful for my analysis project! Thank you for asking, it may well be our main future goal.

> I'd selfishly request a review of VTK's generated Python bindings,

Thank you for taking the offer! :)

The main consumable of the review is the list of actionable items[1], but I include all reports and appendices in case you want to see the detailed findings, methodology and other context. Feel free to not act on any of these, we can re-run the tools after your update to abi3.

> I don't want to waste cycles on it as I have a(n unscheduled) plan to update them to be abi3-compliant anyways.

Good news is one of the agents specializes in checking abi3 compliance and migration feasibility, and it generated a report[2] for you. I hope it helps you when you update to abi3.

[0] https://gist.github.com/devdanzin/06c2e34dac411d9a5f3cf45...
[1] https://gist.github.com/devdanzin/bb73cbfce9e421fd853c536...
[2] https://gist.github.com/devdanzin/bb73cbfce9e421fd853c536...

Allocation failures

Posted Apr 23, 2026 6:09 UTC (Thu) by maniax (subscriber, #4509) [Link] (7 responses)

Maybe this needs explicit clarification, but is a Python program expected to continue working after an allocation failure? What would be the use case for that, and shoudn't it be handled with an abort()?

The kernel is supposed to have failing allocations, but other than that, at least under Linux, an allocation failure is a symptom of a dying system, and aborting the processes seems to be the sanest option. And adding error handling for conditions that are not handle-able is a way to add complexity to probably already complex code base.

Allocation failures

Posted Apr 23, 2026 9:46 UTC (Thu) by devdanzin (subscriber, #183390) [Link] (6 responses)

> is a Python program expected to continue working after an allocation failure? What would be the use case for that, and shoudn't it be handled with an abort()?

TL;DR: OOM in CPython can mean many different things, be recoverable, lead to wrong results, etc., and Python deployments may see legitimate ENOMEM routinely.

First I'd point to a bit of evidence in the fact that extension maintainers and CPython maintainers think it's a worthwhile kind of fix. For maintainers that don't think so, I'll suppress this kind of finding (indeed I've had a case where the whole report was dismissed because it lead with this kind of issue).

Second, CPython's API is explicit: return NULL, set an exception, let the caller propagate. You can survive a `MemoryError` just fine. If in your domain it means the system is dying you can still clean things up, log the issue, exit cleanly, etc. You can also have transient allocation failures in other domains, like when asking for too much memory (NumPy asked for 50GB, you tried to parse 1GB of JSON, etc.) that could be handled without bringing the whole program down. OOM-kill of a web worker handling 1000 concurrent requests takes down 999 innocent requests if one request tries to over allocate. An `abort()` in an extension running inside an embedding host (Blender, Jupyter, game engines) kills the host application, not just Python.

Third, OOM injection is a bug-finding technique. We frequently discover OOM failure paths leading to segfaults, which are strictly worse than handling an allocation error situation cleanly. So, even if we assume OOMs are the system dying, we'd be finding places where an `abort()` should be added. The choice is between segfault vs abort(), and we're finding the segfaults. Given you can cleanly handle even a system-is-dying OOM situation where all allocations will fail (by closing resource, logging, etc., without allocating by using pre-allocated error buffers or direct write(2)), finding places where the OOM leads to segfaults seems worthwhile to me.

Fourth, the cost of handling OOM correctly is mostly mechanical (`if (!p) goto fail;` + RAII cleanup) that will help on Linux whether the system is dying or not, and also apply to other OSs with different OOM semantics.

And fifth, we also find mishandling of OOM that leads to wrong results instead of a crash. The program survives the OOM, but a calculation or another kind of result is silently wrong, e.g. an error path that skips the sanity check, or leaves a partially-initialized object, or drops a result without signaling.These pernicious correctness bugs will affect programs going through transient allocation failures, finding them is valuable IMO.

I hope this makes sense.

Allocation failures

Posted Apr 23, 2026 13:23 UTC (Thu) by maniax (subscriber, #4509) [Link]

Yes, thanks, this does clear a few things :)

How common are allocation failures

Posted Apr 25, 2026 2:11 UTC (Sat) by DemiMarie (subscriber, #164188) [Link] (2 responses)

See title. Are they something that you see happen in production, or on users’ systems?

How common are allocation failures

Posted Apr 25, 2026 14:49 UTC (Sat) by devdanzin (subscriber, #183390) [Link] (1 responses)

You can say that with a well behaved program + system combo, they should be rare.

But in Python it's simple to run into them if your program isn't prepared to, e.g., handle problematic input. A too large string multiplication or an attempt to create a gigantic NumPy array (or even a plain list) will result in a MemoryError that is recoverable and can just result in a log entry or a message to the user.

With very resource starved VMs, you can hit MemoryErrors even in well behaved programs, but they'll probably be gracefully handled by them (including aborting if it makes sense). And as I said elsewhere, it might just make a single request fail and let the program continue running.

I'm gathering a few examples of MemoryErrors in production for another answer here, should be able to post it tonight.

How common are allocation failures

Posted Apr 26, 2026 14:14 UTC (Sun) by devdanzin (subscriber, #183390) [Link]

I've tried to collect a few links showing real world OOM situations, but couldn't find many. I found a few more about processing large images, pandas, etc. But here are the main ones, many of them only showing triggerable and recoverable `MemoryError`.

- https://discuss.python.org/t/memoryerror-despite-having-e..., where it's shown a case that raises `MemoryError` despite it looking like the array should fit in RAM.
- https://github.com/msgpack/msgpack-python/issues/239, a real world issue on VMs with low memory amounts resulting in wrong tracebacks.
- https://github.com/gmpy2/gmpy2/issues/280, where GMP aborts on memory error and the user would like to be able to catch an exception instead.
- https://blog.stackademic.com/python-in-production-the-15-..., where a recoverable pandas `MemoryError` brought everythong down.
- https://medium.com/@ryan_forrester_/understanding-and-fix..., showing examples of handling `MemoryError` and a plausible situation.
- https://medium.com/brexeng/debugging-and-preventing-memor..., an interesting article giving an enhanced way of handling `MemoryError`.
- https://pythonspeed.com/articles/python-out-of-memory/ shows how easy it is to get a segfault on an OOM that should be easily recoverable.
- https://discuss.python.org/t/how-must-we-handle-integer-o..., where it's pointed that `string = "a" * 9223372036854775807` raises a `MemoryError`
- https://discuss.python.org/t/trying-to-understand-roundin..., where it's pointed that `import decimal; decimal.getcontext().prec = decimal.MAX_PREC; decimal.Decimal(1) / 3` raises a `MemoryError`
- https://discuss.python.org/t/faster-large-integer-multipl..., where Tim Peters show that `x = 1 << 1000000000000` raises a recoverable `MemoryError`, unlike GMP.
- https://discuss.python.org/t/a-product-function-which-sup..., where it's pointed that `from itertools import product; next(product(range(1 << 30), repeat=2))` causes a real `MemoryError` (exhausts memory in the system), but it's recoverable if not OOM terminated.

Allocation failures

Posted Apr 25, 2026 8:55 UTC (Sat) by mb (subscriber, #50428) [Link] (1 responses)

> OOM-kill of a web worker handling 1000 concurrent requests takes down 999 innocent requests if one request tries to over allocate.

If the one process caused the OOM, the 999 are *not* innocent. They used up all the memory.
The system is designed incorrectly, if this can happen.
OOM is an emergency situation that cannot be handled in a sane way. Even if the one process handles it's NULL pointers correctly, the system's about-to-be OOM state persists and the next request will run into it.
The system is already dead.
The correct handling is to kill processes to free up significant amounts of memory instead of handling the failures that will keep happening.

Allocation failures

Posted Apr 26, 2026 13:20 UTC (Sun) by devdanzin (subscriber, #183390) [Link]

> If the one process caused the OOM, the 999 are *not* innocent. They used up all the memory.
> The system is designed incorrectly, if this can happen.

In Python, it's possible to trigger a `MemoryError` without the memory being all used up. So the 999 can be innocent IMO. And that is without going into transient OOMs, where the system memory is exhausted by another process that gets OOM terminated. Your innocent Python process may well keep running correctly after getting a few allocation errors.

Examples of synthetic code that will raise a `MemoryError` with plenty of memory left (in fact, these work as the first line typed in the REPL):
>>> string = "a" * 9223372036854775807
>>> x = 1 << 1000000000000
>>> import decimal; decimal.getcontext().prec = decimal.MAX_PREC; decimal.Decimal(1) / 3
>>> from itertools import product; next(product(range(1 << 30), repeat=2))` # causes a "real" `MemoryError` (exhausts memory in the system), but it may be recoverable if not OMM terminated. Best fit for aborting though.

You may say a system that would let something like this is incorrectly designed, but it isn't such a far fetched situation. People do get this kind of `MemoryError` error in production, where expected input works but untrusted or problematic input causes the error to happen. And aborting in all of them may create a DoS where one need not exist.

> OOM is an emergency situation that cannot be handled in a sane way. Even if the one process handles it's NULL pointers correctly, the system's about-to-be OOM state persists and the next request will run into it.
> The system is already dead.
> The correct handling is to kill processes to free up significant amounts of memory instead of handling the failures that will keep happening.

I do not agree, as shown above you can get `MemoryError` in CPython because the requested allocation is too big, even with plenty of memory left. A recoverable situation. The system isn't necessarily near OOM nor dead. So killing processes isn't always the right call.

So, all in all, I think there are plenty of situations where defending against and handling `MemoryError` in CPython makes sense. Of course, there are situations like you describe where aborting would be the right choice. Given all the above, what do you think?


Copyright © 2026, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds