Expedited memory reclaim from killed processes

By Jonathan Corbet
April 12, 2019

Running out of memory puts a Linux system into a difficult situation; in the worst cases, there is often no way out other than killing one or more processes to reclaim their memory. This killing may be done by the kernel itself or, on systems like Android, by a user-space out-of-memory (OOM) killer process. Killing a process is almost certain to make somebody unhappy; the kernel should at least try to use that process's memory expeditiously so that, with luck, no other processes must die. That does not always happen, though, in current kernels. This patch set from Suren Baghdasaryan aims to improve the situation, but the solution that results in the end may take a different form.

The kernel requires processes to clean up their own messes when they exit. So, before a process targeted by an OOM killer can rest in peace, it must first run (in the kernel) to free its memory resources and return them to the system. If the process is running with a low priority, or if it gets hung up waiting for a resource elsewhere in the system, this cleanup work may take a long time. Meanwhile, the system is still struggling, since the hoped-for influx of free memory has not yet happened; at some point, another process will also be killed in the hope of getting better results.

This problem led to the development of the OOM reaper in 2015. When the kernel's OOM killer targets a process, the OOM reaper will attempt to quickly strip that process of memory that the process will never attempt to access again. But the OOM reaper is not available to user-space OOM killers, which can only send a SIGKILL signal and hope that the target exits quickly. The user-space killer never really knows how long it will take for a killed process to exit; as Baghdasaryan pointed out, that means that it must maintain a larger reserve of free pages and kill processes sooner than might be optimal.

Baghdasaryan's proposal is to add a new flag (SS_EXPEDITE) to the (new in 5.1) pidfd_send_signal() system call. If that flag is present, the caller has the CAP_SYS_NICE capability, and the signal is SIGKILL, then the OOM reaper will be unleashed on the killed process. That should result in quicker and more predictable freeing of the target's memory, regardless of anything that might slow down the target itself.

The comments on the proposal were heavily negative, which is interesting because most of the people involved were supportive of the objective itself. The strongest critic, perhaps, was Michal Hocko (the author of the OOM reaper), who complained that it "is abusing an implementation detail of the OOM implementation and makes it an official API". He questioned whether this capability was useful at all, saying that relying on cleanup speed is "a fundamental design problem". Johannes Weiner, though, argued that the idea was, instead, "just optimizing the user experience as best as it can". Others generally agreed that freeing memory quickly when a process is on its way out is a good thing.

Daniel Colascione liked the idea but wanted a different interface. Rather than adding semantics to a specific signal, he suggested a new system call along the lines of:

    size_t try_reap_dying_process(int pidfd, int flags, size_t max_bytes);

This call would attempt to strip memory from the process indicated by pidfd, but only if the process is currently dying. The max_bytes parameter would tell the kernel how much memory the caller would liked to see freed; that would allow the kernel to avoid doing the full work of stripping the target process if max_bytes has been reached. Notably, the reaping would be done in the context of the calling process, allowing user space to determine how important this work is relative to other tasks.

Matthew Wilcox had a simpler question: why not just expedite the reclaim of memory every time a process exits, rather than on specific request? Weiner agreed with that idea, noting that the work has to be done anyway, and doing it always would avoid the need to add a new interface for that purpose. Daniel Colascione pointed out, though, that such a mechanism would slow down the delivery of SIGKILL in general; that slowdown might be felt when running something like killall.

Baghdasaryan didn't think that expedited reaping makes sense all the time. But, he said, there are times when it is critical. "It would be great if we can identify urgent cases without userspace hints, so I'm open to suggestions that do not involve additional flags". If such a solution can be found, it is likely to end up as the preferred alternative here. Cleaning up after a killed process is, in the end, the kernel's responsibility, and there is little desire to create a new interface to control how that responsibility is carried out. Solutions that "just work" are thus to be preferred over the addition of yet another Linux-specific API for this purpose.

Index entries for this article
Kernel	Memory management/Out-of-memory handling
Kernel	OOM killer

Expedited memory reclaim from killed processes

Posted Apr 13, 2019 0:22 UTC (Sat) by wahern (subscriber, #37304) [Link] (20 responses)

The fundamental design problem is overcommit. Everything else is just the interminable nightmare. Blame overcommit on fork, broken 1980s database software, or w'ever. That's almost worse as its just an admission that we traded a few years of pain for a lifetime of debilitation.

Expedited memory reclaim from killed processes

Posted Apr 13, 2019 2:05 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (1 responses)

Believe me, I'm no fan of overcommit --- but in this specific instance, it's amazingly not overcommit's fault. The same delay in memory reclaim can occur in a strict commit system! It's about the timing of page reclaim, not about (which is the point of overcommit) how many pages get allocated in the first place and from what sources.

Expedited memory reclaim from killed processes

Posted Apr 13, 2019 3:55 UTC (Sat) by wahern (subscriber, #37304) [Link]

I don't doubt the criticisms of the patch. But in my most recent foray digging into the OOM logic, my general sense is that some of these timing and priority issues are largely a consequence of the complexity added to forestall the OOM killer and minimize the harm. For example, right now I'm helping to deal with the OOM killer being invoked in situations where there's plenty of uncommitted memory, but because of heavy concurrent dirtying the reclaim code attempting/waiting on page flush gives up too soon. Without overcommit there would be no reason to give up--"gives up to too soon" is a characterization only meaningful in a best effort, no guarantee environment.

Overcommit seems to have resulted in kernel code that emphasizes mitigations rather than on providing consistent, predictable, and tunable behavior. There is no correct or even best heuristic for mitigating OOM. It's a constantly moving target. Any interface to those heuristic mechanisms and policies that look best today will look horrible tomorrow.

One may want to opt into overcommit and wrestle with such lose-lose scenarios, but designing *that* interface (opt-in) is much easier because you're already in a position of being able to contain the system-wide fallout--architectural, performance, code maintenance, etc. Likewise, one may want the ability to cancel ("give up") some operation, but an interface describing when to cancel (a finite set of criteria) is much easier to devise and use than one describing when not to cancel (a possibly infinite set of criteria).

So, yeah, I don't doubt that one sh*t sandwich can be objectively preferable to another sh*t sandwich, but the spectacle of sh*t sandwich review is tragic.

Expedited memory reclaim from killed processes

Posted Apr 13, 2019 12:18 UTC (Sat) by knan (subscriber, #3940) [Link]

One man's design problem, another man's feature. Cleaning up after a 1.5TB rss process takes time, regardless of overcommit.

Expedited memory reclaim from killed processes

Posted Apr 14, 2019 20:15 UTC (Sun) by rweikusat2 (subscriber, #117920) [Link] (16 responses)

Something I implemented in the past: Imagine a system with numerous virtualized (container-based) VPN servers which uses suricata for network traffic inspection and alert generation in case of suspicious activity. Suricata processes are typically huge (>3G isn't uncommon). Each virtual VPN server needs its own suricata process but they all use the same ruleset. As the memory used to store the processed rule information isn't ever written to after the corresponding data structures were created, it's possible to use a per-host suricata server process which loads the rules once. A VPN server suricata process is then created by contacting the host suricata server which forks a child and moves it into the container. All VPN server suricata processes thus share a single
copy of the memory used for rule data structures.

Without overcommit, that is, lazy memory/ swap allocation as the need arises, this wouldn't work (in this way).

"Refuste to try because the system might otherwise run out of memory in future" is not a feature. The system might as well not. The kernel doesn't know this. Or it might run out of memory for a different reason. The original UNIX fork worked by copying the forking core image to the swap space to be swapped in for execution at some later time. This was a resource management strategy for seriously constrained hardware, not a heavenly relevation of the one true way of implementing fork.

Expedited memory reclaim from killed processes

Posted Apr 15, 2019 13:56 UTC (Mon) by epa (subscriber, #39769) [Link] (9 responses)

That’s certainly a valid use case. Ideally, though, the overcommit would be opt-in, and the side effects would be paid only by the processes that had used it — so if a Suricata process went rogue and started writing to the shared data, causing lots of copy-on-write pages to be allocated, it could be whacked by an OOM killer.

Perhaps programming languages could have better support for marking a data structure read-only, which would then notify the kernel to mark the corresponding pages read-only. Then you could allocate the necessary structure and mark it read-only before forking.

Expedited memory reclaim from killed processes

Posted Apr 15, 2019 15:01 UTC (Mon) by farnz (subscriber, #17727) [Link] (8 responses)

I believe opting-in to overcommit is already possible, with the MAP_NORESERVE flag - which essentially says that the mapped range can be overcommitted, and defines behaviour if you write to it when there is insufficient commit available.

There's a bit of a chicken-and-egg problem here, though - heuristic overcommit exists because it's easier for system administrators to tell the OS to lie to applications that demand too much memory than it is for those self-same administrators to have the applications retooled to handle overcommit sensibly.

And even if you are retooling applications, it's often easier to simply turn on features like Kernel Same-page Merging to cope with duplication (e.g. in the Suricata ruleset in-memory form) than it is to handle al the fun that comes from opt-in overcommit.

Expedited memory reclaim from killed processes

Posted Apr 18, 2019 6:26 UTC (Thu) by thestinger (guest, #91827) [Link] (1 responses)

MAP_NORESERVE is a no-op with overcommit disabled or full overcommit enabled. It only has an impact on heuristic overcommit, by bypassing the immediate failure heuristic.

Expedited memory reclaim from killed processes

Posted Apr 18, 2019 7:26 UTC (Thu) by farnz (subscriber, #17727) [Link]

Ah - on other systems (Solaris, at least, and IRIX had the same functionality under a different name), which do not normally permit any overcommit, it allows you to specifically flag a memory range as "can overcommit". If application-controlled overcommit ever becomes a requirement on Linux, supporting the Solaris (and documented) semantics would be a necessary part.

Expedited memory reclaim from killed processes

Posted Apr 18, 2019 6:32 UTC (Thu) by thestinger (guest, #91827) [Link] (5 responses)

See https://www.kernel.org/doc/Documentation/vm/overcommit-ac... and the sources.

The linux-man-pages documentation is often inaccurate, as it is in this case. MAP_NORESERVE does not do what it describes at all:

> When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available.

Expedited memory reclaim from killed processes

Posted Apr 18, 2019 6:40 UTC (Thu) by thestinger (guest, #91827) [Link]

To clarify, I'm quoting the inaccurate description of MAP_NORESERVE. The actual functionality is omitting the memory from heuristic overcommit, which has no impact in the non-overcommit memory accounting mode.

Mappings that aren't committed and cannot be committed without changing protections don't have an accounting cost (see the official documentation that I linked) so the way to reserve lots of address space is by mapping it as PROT_NONE.

To make memory that has used not be accounted again while keeping the address space, you clobber it with new PROT_NONE memory using mmap with MAP_FIXED. It may seem that you achieve the same thing with madvise MADV_DONTNEED + mprotect to PROT_NONE but that doesn't work since it doesn't actually go through it all to check if it can reduce the accounted memory (for good reason).

Man pages

Posted Apr 18, 2019 12:54 UTC (Thu) by corbet (editor, #1) [Link]

The man pages are actively maintained. I am sure that Michael would appreciate a patch fixing the error.

Expedited memory reclaim from killed processes

Posted Apr 18, 2019 15:24 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (2 responses)

JFTR: On Linux, applications can actually handle SISEGV,

#include <signal.h>
#include <stdio.h>
#include <unistd.h>

static void do_brk(int unused)
{
    sbrk(128);
}

int main(int argc, char **argv)
{
    unsigned *p;

    signal(SIGSEGV, do_brk);

    p = sbrk(0);
    *p = atoi(argv[1]);
    printf("%u\n", *p);
    
    return 0;
}

If the signal handler is disabled, this program segfaults. Otherwise, the handler extends the heap and the faulting instruction then succeeds when being restarted. SIGSEGV is a synchronous signal, hence, this would be entirely sufficient to implement some sort of OOM-handling strategy in an application, eg, free some memory and retry or wait some time and retry.

Expedited memory reclaim from killed processes

Posted Apr 19, 2019 15:03 UTC (Fri) by lkundrak (subscriber, #43452) [Link]

This is, in fact, what the original Bourne shell infamously managed memory: https://www.in-ulm.de/~mascheck/bourne/segv.html

Expedited memory reclaim from killed processes

Posted Apr 25, 2019 14:30 UTC (Thu) by nix (subscriber, #2304) [Link]

JFTR: On Linux, applications can actually handle SISEGV,

I'd be surprised if there were any Unixes on which this was not true, given that SIGSEGV in particular was one of the original motivations for the existence of signal handling in the first place.

Expedited memory reclaim from killed processes

Posted Apr 15, 2019 15:54 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

That's a bad design. You shouldn't design software that depends on implicit assumptions.

A better designed software would store rules in a file and map it explicitly into the target processes. This way there's no problem with overcommit - the kernel would know that the data is meant to be immutable.

Expedited memory reclaim from killed processes

Posted Apr 15, 2019 17:03 UTC (Mon) by rweikusat2 (subscriber, #117920) [Link] (3 responses)

> That's a bad design. You shouldn't design software that depends on implicit assumptions.

This statement means nothing (as it stands).

> A better designed software would store rules in a file and map it explicitly into the target processes. This way there's no problem
> with overcommit - the kernel would know that the data is meant to be immutable.

A much more invasive change to suricata (this is an open source project I'm not anyhow associated with) could have gotten rid of all the pointers in its internal data structures. Assuming this had been done and the code had also been changed to use a custom memory allocator instead of the libc one, one could have used a shared memory segment/ memory mapped file to implement the same kind of sharing. I'm perfectly aware of this. But this complication isn't really necessary with Linux as sharing-via-fork works just as well and is a lot easier to implement.

Expedited memory reclaim from killed processes

Posted Apr 15, 2019 17:10 UTC (Mon) by rweikusat2 (subscriber, #117920) [Link]

For completeness¸ before someone pulls that out of his hat and triumphantly waves it through the air: Changing to a custom allocator had been sufficient as the parent could have put all of its data into a shared memory segment/ memory mapped file and children could have inherited the mapping via fork.

But that's still more complicated than just relying on the default behaviour based on knowing how the application will use the inherited memory.

Expedited memory reclaim from killed processes

Posted Apr 15, 2019 17:30 UTC (Mon) by farnz (subscriber, #17727) [Link] (1 responses)

You could also, assuming it's backed by an mmaped file, just use MAP_FIXED to ensure that all the pointers match in every Suricata process; this works out best on 64-bit systems, as you need a big block of VA space available that ASLR et al won't claim.

Expedited memory reclaim from killed processes

Posted Apr 15, 2019 19:14 UTC (Mon) by rweikusat2 (subscriber, #117920) [Link]

Considering that the amount of memory which will be needed isn't known in advance, one would need to use MAP_FIXED in a "known good" location, anyway. There are some more complication with this approach as well. And the Linux default policy of "real COW", ie do not copy anyting unless it has to be done and don't (try to) reserve anything unless it's demonstrably required handles this cause just fine.

Expedited memory reclaim from killed processes

Posted Apr 26, 2019 18:26 UTC (Fri) by roblucid (guest, #48964) [Link]

Why can't the Suricata rules be read from a RO map file, with a seperate means for updating & changing rulesets and then the equivalent of sighup to get the ruleset re-read and container processes re-reading stuff, perhaps a counter incremented in some ruleset RW meta data, set so a newer version available can be read in by containers as they restart?

The disk backed data files can be shared amongst 1000's of VMs, right?

Then the VM system can be sure it's safe to fork without committing much memory and the apparent need for over-commit vanishes. I admit I haven't tried it and as I used VMs for isolation and jails for data sharing, not the kind of efficiency hack but conceptually I don't see why software developed in a stricter world couldn't handle the case reasonably.

Sparse arrays, are perhaps a better case for over-commit but again I wonder about memory map file and/or smarter data structures wouldn't be feasible for the programs which actually deliberately require these features, rather than by accident due to a permissive environment.

Expedited memory reclaim from killed processes

Posted Apr 13, 2019 18:51 UTC (Sat) by mjthayer (guest, #39183) [Link] (3 responses)

> "It would be great if we can identify urgent cases without userspace hints, so I'm open to suggestions that do not involve additional flags"

What about a system low memory watershed?

Expedited memory reclaim from killed processes

Posted Apr 14, 2019 7:21 UTC (Sun) by k3ninho (subscriber, #50375) [Link] (2 responses)

>What about a system low memory watershed?
The Linux kernel exists in a perpetual low-available-memory state. Inherently it's a caching strategy responsible for a lot of the speed in Linux. I don't know the statistics but several multiples of your machine's physical memory are allocated in the virtual memory addressing space by a design choice called memory over-commit.

But which bits are meaningful over-commit that you'd count towards a system low memory state and which are merely 'maybe useful'? This question is a cache invalidation problem for those 'maybe useful' bits -- a problem already solved elsewhere in the kernel, which provides a pattern we can follow. That pattern is the least-recently-used list so, while you can't predict what user or network interaction comes next, you keep track of access times and then have the bottom end of the list for old and unused items. Pick a level of granularity for parts of a process's memory space and track the least-recently-used bits, hoping to find a ratio of maybe-useful vs definitely-useful memory commit that you'd use to set the line at which the oom-killer gets invoked.

This isn't the whole picture -- the fork()+exec() paradigm can leave child processes sharing over-committed memory with their parents, which is only moved into dedicated address space for the child when it's changed -- a pattern called copy-on-write. We'd need to do more work to be certain that this memory is definitely-useful, for example it might be read-only state that each child needs from the parent, reading it often.

There's excellent write-up in lwn's history:
Taming the OOM killer -- https://lwn.net/Articles/317814/
Improving the OOM killer -- https://lwn.net/Articles/684945/

K3n.

Expedited memory reclaim from killed processes

Posted Apr 14, 2019 12:45 UTC (Sun) by nix (subscriber, #2304) [Link] (1 responses)

which is only moved into dedicated address space for the child when it's changed

The address space is always dedicated to the child after fork(), even before CoW. The *physical memory* is not.

Expedited memory reclaim from killed processes

Posted Apr 15, 2019 12:20 UTC (Mon) by k3ninho (subscriber, #50375) [Link]

oops. thanks.

K3n.