Reservations for must-succeed memory allocations

Posted Mar 17, 2015 22:11 UTC (Tue) by fandingo (guest, #67019)
In reply to: Reservations for must-succeed memory allocations by amonnet
Parent article: Reservations for must-succeed memory allocations

This seems like the preferable solution. Make a hard reservation of emergency memory that cannot be allocated. Give it a kernel parameter, so the administrator can set number of reserved pages. When the OOM killer is invoked, it can dip into this reservation if needed, and as it decides which processes to kill, it can tell the allocator that specific processes (and system calls on their behalf) have access to this memory.

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 14:23 UTC (Wed) by mm7323 (subscriber, #87386) [Link]

I was thinking something similar - that there should be a reservation area for these occasions. That memory doesn't have to be totally ring-fenced and idle though - just trivially reclaimable without need for more allocations, locks or IO. Things like read-only pages of mmap()'d files (e.g. text and rodata segments of running programs) which could be re-read from disk later when needed, or any pages which have already been flushed to disk and are clean (e.g. write through caches) could be accounted as in the reservation area already.

In general there's probably enough stuff floating around in the system that there would always be a sizeable reservation area, but the 99.999% occasion could still be problematic, so an API would be needed to check that the reservation area is at least a certain size before XFS or other things goes off on a path of no return. The reservation request function could have a blocking variant (which tries to increase the reservation area to meet demand if needed, or waits for other reservation users to complete), or return a failure which could be propagated back to userspace well before any critical actions have taken place in the caller e.g. open() might return ENOMEM if the reservation area isn't sufficiently large to meet the demands needed to ensure that the system call can progress in the worst case. After a critical operation completes, the reservation area request should be released.

Some other API changes may be needed so that a caller can request pages that use the reservation area, and book-keeping to ensure callers don't request more from the reservation than they have previously 'reserved' would be prudent.

Finally, I was also thinking that a simple swap device could also help. Simple swap would mean that pages can be read and written trivially without calling memory allocators or introducing lock dependencies. If a block device could indicate that it was 'simple swap' compatible according to these requirements, then any of its free space could be accounted to the 'reservation area' by enabling dirty pages to be swapped out without ending up in the circular allocation and lock dependency battles which seem to be the cause of all these problems. Directly accessed partitions on a locally attached hard-drive, or zram could probably be made to flag as 'simple swap' compatible.