Great!

Posted May 18, 2015 15:06 UTC (Mon) by epa (subscriber, #39769)
In reply to: Great! by jameslivingston
Parent article: Rust 1.0 released

Maybe abort-on-OOM is after all the best choice for 'general purpose libraries' but the core standard library for the language needs to be held to a higher standard. Then at least programmers have the choice of whether to use the abort-on-OOM style or do something more elaborate (with all the extra effort and trade-offs that required). If even the core container classes will kaboom your program on OOM, it's not really possible to build anything on top of that.

To take your suggestion - splitting the application into pieces is a fine solution. But the small supervisor core which co-ordinates the pieces and is in charge of restarting them on failure must itself be robust, and cannot just die and be restarted - otherwise it's turtles all the way down... So there needs to be the possibility of writing code in a provably safe way, even if the amount of code that ends up being like that is quite small.

Great!

Posted May 18, 2015 15:42 UTC (Mon) by Limdi (guest, #100500) [Link] (3 responses)

> it's not really possible to build anything on top of that.

What is the current strategy to avoid oom apart from praying to god that the guy in the front seat managed to calculate the allowed memory use for every application correctly and the applications obey and he did not overcommit on purpose?

Great!

Posted May 18, 2015 15:45 UTC (Mon) by epa (subscriber, #39769) [Link] (2 responses)

I admit, if running on Linux there isn't much you can do to avoid being squashed by the OOM killer. There are however operating systems which take a more cautious approach.

Avoiding the OOM Killer by Quotas?

Posted May 20, 2015 17:42 UTC (Wed) by gmatht (guest, #58961) [Link] (1 responses)

It seems to me that (1) allocating memory and (2) reserving memory are two different things; where (1) adds memory to an address space, while (2) requests a guarantee that some amount of memory is reserved for your use. As I understand, Linux allows you to do (1), but doesn't allow you to do (2) without disabling overcommit (which comes with its own problems).

However (2), could be a syscall of its own. Roughly, the idea is that to avoid the OOM killer, a process is responsible for reserving memory prior to allocating it. This allows, a process to choose, for example, to:
1) not bother reserving memory, it probably won't be OOM killed anyway,
2) Permanently reserve 640K (which should be enough, right?),
3) Release all reservations when in a safe idle state, allowing a OOM kill,
4) Make sure it has 10MB extra in reserve before accepting a new connection, or
5) Wrap every call to malloc, fork, etc. to make sure it has enough in reserve for all children.

Doing (5) may mean the process is denied reservations long before the machine is low on memory, but that's already what happens with overcommit disabled. That may be what the application writer wants. (1) can be quite nice too; sometimes it is much easier to recover from an abort than to recover from low memory. Allowing some processes to choose 1 and 5 already seems like an advantage. For lots of user space code (4) seems cleaner to me than checking the return of every "malloc(sizeof int)". When a bug is found fixing the result of corruption from an abort may be easier than fixing the result of mishandled null pointer. How often do we really need (5) in userspace?

Avoiding the OOM Killer by Quotas?

Posted May 23, 2015 8:53 UTC (Sat) by epa (subscriber, #39769) [Link]

Reserving memory in advance is a useful operation - so you can fail earlier if memory is short. Even if your app does have full checking of all allocations with rollback when necessary, it may just be easier for all concerned to refuse the incoming connection if there probably won't be the resources to service it. I think I would still prefer that malloc() of ten megabytes would reserve that space - not just pretend to succeed and then dynamite my process at some undetermined later point. But for handling legacy or 'lazy' code which prefers to rely on overcommit, a mixed model such as those you suggest could work.

Great!

Posted May 18, 2015 21:17 UTC (Mon) by roc (subscriber, #30627) [Link] (1 responses)

Yes, the runtime (if there is one, Rust doesn't necessarily need one) that enforces isolation should be able to handle OOM and ensure that it can kill an OOM-ing isolate without bringing down the others. But that is orthogonal to whether the standard library APIs abort on OOM.

As others have pointed out, on Linux and other mainstream OSes handling OOM in userspace is pointless because your process is likely to be killed by the system before it sees any out-of-memory errors. Therefore having the standard library return OOM errors is a bad tradeoff, because you're adding complexity for all Rust users that most of them will never be able to benefit from.

For the niche users who can benefit from explicit OOM errors, it would make sense to offer an alternative standard library which does return those errors. Fortunately, Rust is one of the few languages which lets you rip out the entire standard library and replace it with something else (or nothing).

Great!

Posted May 19, 2015 10:39 UTC (Tue) by epa (subscriber, #39769) [Link]

This is something of a vicious circle: userspace doesn't check allocation success, so tends to allocate more than it needs (since allocating more memory than you will use doesn't cause any test failures, it will naturally tend to happen, just as other classes of error will inevitably creep in if the test suite and everyday usage does not cover them). So then the kernel has to allow overcommit - which means that userspace doesn't bother to check allocation... (There is also the vexed issue of fork() requiring overcommit, which has been discussed previously.)

I do agree that in practice, doing unchecked allocations may be the best tradeoff for a lot of code. Although I suggest that it needs better tools and runtime support to set limits: in my process, allocations made from *this* particular shared library should not exceed 100 megabytes total, while *that* function may only allocate at most 2 megs each time it is called... Since if libpng goes mad and develops a memory leak, I would much rather have the application die quickly (and with an informative message) than have it get slower and slower, thrashing the disk more and more until finally the OOM killer puts it out of its misery. Of course, breaking the program into several independent processes is one way to do this, but possibly with a bit more userspace accounting of memory usage the same goal could be achieved without needing separate processes.

However, safe allocation is not just for 'niche users', or if so, kernel programming is quite a large niche. And there may well be a case for writing small parts of your program in the checked-allocation style while leaving other parts to assume allocation never fails. So then if my app does go kaboom, at least I can be certain it wasn't my string class that did it.