Expedited memory reclaim from killed processes

Posted Apr 15, 2019 15:01 UTC (Mon) by farnz (subscriber, #17727)
In reply to: Expedited memory reclaim from killed processes by epa
Parent article: Expedited memory reclaim from killed processes

I believe opting-in to overcommit is already possible, with the MAP_NORESERVE flag - which essentially says that the mapped range can be overcommitted, and defines behaviour if you write to it when there is insufficient commit available.

There's a bit of a chicken-and-egg problem here, though - heuristic overcommit exists because it's easier for system administrators to tell the OS to lie to applications that demand too much memory than it is for those self-same administrators to have the applications retooled to handle overcommit sensibly.

And even if you are retooling applications, it's often easier to simply turn on features like Kernel Same-page Merging to cope with duplication (e.g. in the Suricata ruleset in-memory form) than it is to handle al the fun that comes from opt-in overcommit.

Expedited memory reclaim from killed processes

Posted Apr 18, 2019 6:26 UTC (Thu) by thestinger (guest, #91827) [Link] (1 responses)

MAP_NORESERVE is a no-op with overcommit disabled or full overcommit enabled. It only has an impact on heuristic overcommit, by bypassing the immediate failure heuristic.

Expedited memory reclaim from killed processes

Posted Apr 18, 2019 7:26 UTC (Thu) by farnz (subscriber, #17727) [Link]

Ah - on other systems (Solaris, at least, and IRIX had the same functionality under a different name), which do not normally permit any overcommit, it allows you to specifically flag a memory range as "can overcommit". If application-controlled overcommit ever becomes a requirement on Linux, supporting the Solaris (and documented) semantics would be a necessary part.

Expedited memory reclaim from killed processes

Posted Apr 18, 2019 6:32 UTC (Thu) by thestinger (guest, #91827) [Link] (5 responses)

See https://www.kernel.org/doc/Documentation/vm/overcommit-ac... and the sources.

The linux-man-pages documentation is often inaccurate, as it is in this case. MAP_NORESERVE does not do what it describes at all:

> When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available.

Expedited memory reclaim from killed processes

Posted Apr 18, 2019 6:40 UTC (Thu) by thestinger (guest, #91827) [Link]

To clarify, I'm quoting the inaccurate description of MAP_NORESERVE. The actual functionality is omitting the memory from heuristic overcommit, which has no impact in the non-overcommit memory accounting mode.

Mappings that aren't committed and cannot be committed without changing protections don't have an accounting cost (see the official documentation that I linked) so the way to reserve lots of address space is by mapping it as PROT_NONE.

To make memory that has used not be accounted again while keeping the address space, you clobber it with new PROT_NONE memory using mmap with MAP_FIXED. It may seem that you achieve the same thing with madvise MADV_DONTNEED + mprotect to PROT_NONE but that doesn't work since it doesn't actually go through it all to check if it can reduce the accounted memory (for good reason).

Man pages

Posted Apr 18, 2019 12:54 UTC (Thu) by corbet (editor, #1) [Link]

The man pages are actively maintained. I am sure that Michael would appreciate a patch fixing the error.

Expedited memory reclaim from killed processes

Posted Apr 18, 2019 15:24 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (2 responses)

JFTR: On Linux, applications can actually handle SISEGV,

#include <signal.h>
#include <stdio.h>
#include <unistd.h>

static void do_brk(int unused)
{
    sbrk(128);
}

int main(int argc, char **argv)
{
    unsigned *p;

    signal(SIGSEGV, do_brk);

    p = sbrk(0);
    *p = atoi(argv[1]);
    printf("%u\n", *p);
    
    return 0;
}

If the signal handler is disabled, this program segfaults. Otherwise, the handler extends the heap and the faulting instruction then succeeds when being restarted. SIGSEGV is a synchronous signal, hence, this would be entirely sufficient to implement some sort of OOM-handling strategy in an application, eg, free some memory and retry or wait some time and retry.

Expedited memory reclaim from killed processes

Posted Apr 19, 2019 15:03 UTC (Fri) by lkundrak (subscriber, #43452) [Link]

This is, in fact, what the original Bourne shell infamously managed memory: https://www.in-ulm.de/~mascheck/bourne/segv.html

Expedited memory reclaim from killed processes

Posted Apr 25, 2019 14:30 UTC (Thu) by nix (subscriber, #2304) [Link]

JFTR: On Linux, applications can actually handle SISEGV,

I'd be surprised if there were any Unixes on which this was not true, given that SIGSEGV in particular was one of the original motivations for the existence of signal handling in the first place.