|
|
Log in / Subscribe / Register

Allocation failures

Allocation failures

Posted Apr 23, 2026 6:09 UTC (Thu) by maniax (subscriber, #4509)
Parent article: Using LLMs to find Python C-extension bugs

Maybe this needs explicit clarification, but is a Python program expected to continue working after an allocation failure? What would be the use case for that, and shoudn't it be handled with an abort()?

The kernel is supposed to have failing allocations, but other than that, at least under Linux, an allocation failure is a symptom of a dying system, and aborting the processes seems to be the sanest option. And adding error handling for conditions that are not handle-able is a way to add complexity to probably already complex code base.


to post comments

Allocation failures

Posted Apr 23, 2026 9:46 UTC (Thu) by devdanzin (subscriber, #183390) [Link] (6 responses)

> is a Python program expected to continue working after an allocation failure? What would be the use case for that, and shoudn't it be handled with an abort()?

TL;DR: OOM in CPython can mean many different things, be recoverable, lead to wrong results, etc., and Python deployments may see legitimate ENOMEM routinely.

First I'd point to a bit of evidence in the fact that extension maintainers and CPython maintainers think it's a worthwhile kind of fix. For maintainers that don't think so, I'll suppress this kind of finding (indeed I've had a case where the whole report was dismissed because it lead with this kind of issue).

Second, CPython's API is explicit: return NULL, set an exception, let the caller propagate. You can survive a `MemoryError` just fine. If in your domain it means the system is dying you can still clean things up, log the issue, exit cleanly, etc. You can also have transient allocation failures in other domains, like when asking for too much memory (NumPy asked for 50GB, you tried to parse 1GB of JSON, etc.) that could be handled without bringing the whole program down. OOM-kill of a web worker handling 1000 concurrent requests takes down 999 innocent requests if one request tries to over allocate. An `abort()` in an extension running inside an embedding host (Blender, Jupyter, game engines) kills the host application, not just Python.

Third, OOM injection is a bug-finding technique. We frequently discover OOM failure paths leading to segfaults, which are strictly worse than handling an allocation error situation cleanly. So, even if we assume OOMs are the system dying, we'd be finding places where an `abort()` should be added. The choice is between segfault vs abort(), and we're finding the segfaults. Given you can cleanly handle even a system-is-dying OOM situation where all allocations will fail (by closing resource, logging, etc., without allocating by using pre-allocated error buffers or direct write(2)), finding places where the OOM leads to segfaults seems worthwhile to me.

Fourth, the cost of handling OOM correctly is mostly mechanical (`if (!p) goto fail;` + RAII cleanup) that will help on Linux whether the system is dying or not, and also apply to other OSs with different OOM semantics.

And fifth, we also find mishandling of OOM that leads to wrong results instead of a crash. The program survives the OOM, but a calculation or another kind of result is silently wrong, e.g. an error path that skips the sanity check, or leaves a partially-initialized object, or drops a result without signaling.These pernicious correctness bugs will affect programs going through transient allocation failures, finding them is valuable IMO.

I hope this makes sense.

Allocation failures

Posted Apr 23, 2026 13:23 UTC (Thu) by maniax (subscriber, #4509) [Link]

Yes, thanks, this does clear a few things :)

How common are allocation failures

Posted Apr 25, 2026 2:11 UTC (Sat) by DemiMarie (subscriber, #164188) [Link] (2 responses)

See title. Are they something that you see happen in production, or on users’ systems?

How common are allocation failures

Posted Apr 25, 2026 14:49 UTC (Sat) by devdanzin (subscriber, #183390) [Link] (1 responses)

You can say that with a well behaved program + system combo, they should be rare.

But in Python it's simple to run into them if your program isn't prepared to, e.g., handle problematic input. A too large string multiplication or an attempt to create a gigantic NumPy array (or even a plain list) will result in a MemoryError that is recoverable and can just result in a log entry or a message to the user.

With very resource starved VMs, you can hit MemoryErrors even in well behaved programs, but they'll probably be gracefully handled by them (including aborting if it makes sense). And as I said elsewhere, it might just make a single request fail and let the program continue running.

I'm gathering a few examples of MemoryErrors in production for another answer here, should be able to post it tonight.

How common are allocation failures

Posted Apr 26, 2026 14:14 UTC (Sun) by devdanzin (subscriber, #183390) [Link]

I've tried to collect a few links showing real world OOM situations, but couldn't find many. I found a few more about processing large images, pandas, etc. But here are the main ones, many of them only showing triggerable and recoverable `MemoryError`.

- https://discuss.python.org/t/memoryerror-despite-having-e..., where it's shown a case that raises `MemoryError` despite it looking like the array should fit in RAM.
- https://github.com/msgpack/msgpack-python/issues/239, a real world issue on VMs with low memory amounts resulting in wrong tracebacks.
- https://github.com/gmpy2/gmpy2/issues/280, where GMP aborts on memory error and the user would like to be able to catch an exception instead.
- https://blog.stackademic.com/python-in-production-the-15-..., where a recoverable pandas `MemoryError` brought everythong down.
- https://medium.com/@ryan_forrester_/understanding-and-fix..., showing examples of handling `MemoryError` and a plausible situation.
- https://medium.com/brexeng/debugging-and-preventing-memor..., an interesting article giving an enhanced way of handling `MemoryError`.
- https://pythonspeed.com/articles/python-out-of-memory/ shows how easy it is to get a segfault on an OOM that should be easily recoverable.
- https://discuss.python.org/t/how-must-we-handle-integer-o..., where it's pointed that `string = "a" * 9223372036854775807` raises a `MemoryError`
- https://discuss.python.org/t/trying-to-understand-roundin..., where it's pointed that `import decimal; decimal.getcontext().prec = decimal.MAX_PREC; decimal.Decimal(1) / 3` raises a `MemoryError`
- https://discuss.python.org/t/faster-large-integer-multipl..., where Tim Peters show that `x = 1 << 1000000000000` raises a recoverable `MemoryError`, unlike GMP.
- https://discuss.python.org/t/a-product-function-which-sup..., where it's pointed that `from itertools import product; next(product(range(1 << 30), repeat=2))` causes a real `MemoryError` (exhausts memory in the system), but it's recoverable if not OOM terminated.

Allocation failures

Posted Apr 25, 2026 8:55 UTC (Sat) by mb (subscriber, #50428) [Link] (1 responses)

> OOM-kill of a web worker handling 1000 concurrent requests takes down 999 innocent requests if one request tries to over allocate.

If the one process caused the OOM, the 999 are *not* innocent. They used up all the memory.
The system is designed incorrectly, if this can happen.
OOM is an emergency situation that cannot be handled in a sane way. Even if the one process handles it's NULL pointers correctly, the system's about-to-be OOM state persists and the next request will run into it.
The system is already dead.
The correct handling is to kill processes to free up significant amounts of memory instead of handling the failures that will keep happening.

Allocation failures

Posted Apr 26, 2026 13:20 UTC (Sun) by devdanzin (subscriber, #183390) [Link]

> If the one process caused the OOM, the 999 are *not* innocent. They used up all the memory.
> The system is designed incorrectly, if this can happen.

In Python, it's possible to trigger a `MemoryError` without the memory being all used up. So the 999 can be innocent IMO. And that is without going into transient OOMs, where the system memory is exhausted by another process that gets OOM terminated. Your innocent Python process may well keep running correctly after getting a few allocation errors.

Examples of synthetic code that will raise a `MemoryError` with plenty of memory left (in fact, these work as the first line typed in the REPL):
>>> string = "a" * 9223372036854775807
>>> x = 1 << 1000000000000
>>> import decimal; decimal.getcontext().prec = decimal.MAX_PREC; decimal.Decimal(1) / 3
>>> from itertools import product; next(product(range(1 << 30), repeat=2))` # causes a "real" `MemoryError` (exhausts memory in the system), but it may be recoverable if not OMM terminated. Best fit for aborting though.

You may say a system that would let something like this is incorrectly designed, but it isn't such a far fetched situation. People do get this kind of `MemoryError` error in production, where expected input works but untrusted or problematic input causes the error to happen. And aborting in all of them may create a DoS where one need not exist.

> OOM is an emergency situation that cannot be handled in a sane way. Even if the one process handles it's NULL pointers correctly, the system's about-to-be OOM state persists and the next request will run into it.
> The system is already dead.
> The correct handling is to kill processes to free up significant amounts of memory instead of handling the failures that will keep happening.

I do not agree, as shown above you can get `MemoryError` in CPython because the requested allocation is too big, even with plenty of memory left. A recoverable situation. The system isn't necessarily near OOM nor dead. So killing processes isn't always the right call.

So, all in all, I think there are plenty of situations where defending against and handling `MemoryError` in CPython makes sense. Of course, there are situations like you describe where aborting would be the right choice. Given all the above, what do you think?


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds