Allocation failures
Allocation failures
Posted Apr 23, 2026 9:46 UTC (Thu) by devdanzin (subscriber, #183390)In reply to: Allocation failures by maniax
Parent article: Using LLMs to find Python C-extension bugs
TL;DR: OOM in CPython can mean many different things, be recoverable, lead to wrong results, etc., and Python deployments may see legitimate ENOMEM routinely.
First I'd point to a bit of evidence in the fact that extension maintainers and CPython maintainers think it's a worthwhile kind of fix. For maintainers that don't think so, I'll suppress this kind of finding (indeed I've had a case where the whole report was dismissed because it lead with this kind of issue).
Second, CPython's API is explicit: return NULL, set an exception, let the caller propagate. You can survive a `MemoryError` just fine. If in your domain it means the system is dying you can still clean things up, log the issue, exit cleanly, etc. You can also have transient allocation failures in other domains, like when asking for too much memory (NumPy asked for 50GB, you tried to parse 1GB of JSON, etc.) that could be handled without bringing the whole program down. OOM-kill of a web worker handling 1000 concurrent requests takes down 999 innocent requests if one request tries to over allocate. An `abort()` in an extension running inside an embedding host (Blender, Jupyter, game engines) kills the host application, not just Python.
Third, OOM injection is a bug-finding technique. We frequently discover OOM failure paths leading to segfaults, which are strictly worse than handling an allocation error situation cleanly. So, even if we assume OOMs are the system dying, we'd be finding places where an `abort()` should be added. The choice is between segfault vs abort(), and we're finding the segfaults. Given you can cleanly handle even a system-is-dying OOM situation where all allocations will fail (by closing resource, logging, etc., without allocating by using pre-allocated error buffers or direct write(2)), finding places where the OOM leads to segfaults seems worthwhile to me.
Fourth, the cost of handling OOM correctly is mostly mechanical (`if (!p) goto fail;` + RAII cleanup) that will help on Linux whether the system is dying or not, and also apply to other OSs with different OOM semantics.
And fifth, we also find mishandling of OOM that leads to wrong results instead of a crash. The program survives the OOM, but a calculation or another kind of result is silently wrong, e.g. an error path that skips the sanity check, or leaves a partially-initialized object, or drops a result without signaling.These pernicious correctness bugs will affect programs going through transient allocation failures, finding them is valuable IMO.
I hope this makes sense.
