Hastening process cleanup with process_mrelease()

By Jonathan Corbet
July 26, 2021

One of the fundamental invariants of computing is that, regardless of how much memory is installed in a system, it is never enough. This is especially true of systems with tight performance constraints, where every page of memory is allocated and in use, making it difficult to find more when it is badly needed. One way to make more memory available is to kill one or more processes, freeing their resources for other users. But that often does not work as quickly or reliably as users would like. In an attempt to improve the situation, Suren Baghdasaryan has proposed the addition of a system call named process_mrelease().

Systems running mixed workloads, where some tasks are more important than others, are not uncommon. If the system is being run near its maximum capacity, the relatively unimportant tasks may end up using memory that is needed by the more important work, at which point it might be better if the unimportant processes went away. Such systems often run process managers that will kill off the low-priority processes in these situations; perhaps the most widespread example of this pattern is Android, which will kill background apps if the available memory is insufficient for whatever is running in the foreground. Cloud-computing systems will also kill low-priority, best-effort workloads if their memory is needed by more important work.

Killing a process should, in principle, make its memory immediately available for other users. In the real world, though, things are not so simple. The killed process is, itself, responsible for cleaning up and freeing its resources, a task that is carried out in kernel context. If, however, the killed process finds itself blocked in an uninterruptible sleep, that cleanup work could be delayed indefinitely. There are other factors that can slow down the freeing of memory, including how busy the relevant CPU is and whether that CPU is running in a slow, low-power state.

When this happens, the system has paid the cost of killing the process (which was presumably doing something useful) without receiving the benefits from that action. Unfortunately, those benefits tend to be needed urgently; the system would not be killing processes otherwise. Delays in process cleanup can have immediate and visible effects on the higher-priority workloads; these can include jerky response on a handset or a delay in the delivery of a cat video to an impatient viewer.

This problem was encountered years ago in the context of the system's out-of-memory (OOM) killer, which is the kernel's last-resort response when memory runs out. Back in 2015, the development of the OOM reaper addressed this problem by taking the memory cleanup work out of the dying process's hands and making it the responsibility of a separate kernel thread. That made OOM killing significantly more robust, with the ability to free memory quickly even if the chosen process is not able to exit immediately.

That work did not address one other unfortunate characteristic of the OOM killer, though: its opinion of what is the least important process on the system tends to differ from that of the system's users. Invoking the OOM killer may allow the system as a whole to continue functioning, but the user whose window-system server was just killed may be forgiven for not being fully enthusiastic in their celebration of that feat.

For this reason, systems developers have tended to take the business of killing processes to rob them of their memory into their own hands. An out-of-memory handler running in user space can take more proactive steps to prevent the system from going into the OOM state to begin with, and it probably has a better idea of which processes will cause the least pain should they encounter an untimely demise. The oomd daemon released by Facebook is one example of this kind of utility; there are others as well.

User-space OOM killers, though, are not in the same position as the kernel's OOM killer; they must rely on the kill() system call (or, more recently, pidfd_send_signal()) to implement the sharp end of their memory-freeing decisions. Killing a process that way does not bring the OOM reaper into play, so user-space daemons are back in the situation of having to wait for the targeted processes to release their own resources.

Baghdasaryan's answer to this problem is a new system call:

    int process_mrelease(int pidfd, unsigned int flags);

The pidfd argument is a pidfd identifying the process of interest; that process must be exiting (presumably as the result of a previous kill() operation) when the call is made. The flags argument must be zero for now. This call will have the same effect as setting the OOM reaper on the indicated process, stripping away as much of its memory as possible.

One of the reasons behind the creation of a separate call for this work is to give the system a context in which to do it. The task of going through the process's address space and freeing up all that memory will be done by the process that calls process_mrelease(), which may or may not be the process that killed the target in the first place. The kernel can then do this work with the priority of the calling process, and with its CPU assignments, allowing the cleanup work to be contained where it will not interfere with the (remaining) system workload.

An alternative that was discussed with an earlier attempt to solve this problem was to just unconditionally reap the memory of a process when it is killed, without requiring a separate system call to make that happen. In that case, though, the work would be done in the context of the process sending the signal, which might not be welcome. A process that kills a lot of other ones — a killall command, for example — could be significantly slowed if that policy were to be adopted. Adding a separate system call gives user space more control over when and how that work is done.

In the previous posting of this work, the main topic of discussion was the name of the system call itself — process_reap() at that time. That is a reasonably clear sign that the more significant issues have been addressed and that the work may be about ready to move forward. The number of callers of process_mrelease() is likely to be small, but it seems there will be some situations where it will be a useful tool to have.

Index entries for this article
Kernel	Memory management/Out-of-memory handling
Kernel	System calls/process_mrelease()

Hastening process cleanup with process_mrelease()

Posted Jul 26, 2021 20:26 UTC (Mon) by josh (subscriber, #17465) [Link] (3 responses)

> An alternative that was discussed with an earlier attempt to solve this problem was to just unconditionally reap the memory of a process when it is killed, without requiring a separate system call to make that happen. In that case, though, the work would be done in the context of the process sending the signal, which might not be welcome.

This doesn't seem like it needs a completely separate syscall. pidfd_send_signal takes flags, and a "reap process memory immediately" flag seems like it would fit well there.

Hastening process cleanup with process_mrelease()

Posted Jul 26, 2021 23:19 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (1 responses)

If you want process A to do the killing, and process B to do the reaping, then you need a new syscall regardless.

(Why have two? To improve throughput. A can spend all its CPU cycles on identifying useful things to kill, and B can spend its CPU cycles on actual reaping. You can then adjust the relative priorities of those two tasks independently of each other with the usual process-management techniques.)

Hastening process cleanup with process_mrelease()

Posted Jul 27, 2021 1:10 UTC (Tue) by josh (subscriber, #17465) [Link]

Interesting point! Technically process B (or thread B) could do the kill as well after A picked out a target, but it does make sense to have a high-priority process do the kill and a lower-priority (but still higher-priority-than-normal) process do the reclamation.

Hastening process cleanup with process_mrelease()

Posted Jul 30, 2021 14:03 UTC (Fri) by brauner (subscriber, #109349) [Link]

> This doesn't seem like it needs a completely separate syscall. pidfd_send_signal takes flags, and a "reap process memory immediately" flag seems like it would fit well there.

In one of the first iterations this has been a flag to pidfd_send_signal() but I really disliked it. I get why it feels appealing but how memory is released has nothing to do with signaling imho. It's better suited as a separate API where it can also be extended in the future.

Hastening process cleanup with process_mrelease()

Posted Jul 27, 2021 11:16 UTC (Tue) by dancol (guest, #142293) [Link] (2 responses)

> The task of going through the process's address space and freeing up all that memory will be done by the process that calls process_mrelease(), which may or may not be the process that killed the target in the first place

As I've said before on LKML, I like this general pattern. Too often in kernel-land we make kernel threads that performs some action on *behalf* of user code, but without giving user code control over the number or characteristics of those kernel threads. (I'm looking at you, io_uring.) Userspace should be in control of the kernel threads doing work on its behalf, and the easiest and best way of giving userspace this control is to make userspace provide the threads to do the kernel work. After all: every userspace thread *is* a kernel thread too!

Hastening process cleanup with process_mrelease()

Posted Jul 27, 2021 14:33 UTC (Tue) by mario-campos (subscriber, #152845) [Link] (1 responses)

>After all: every userspace thread *is* a kernel thread too!

Does this mean that the only difference between a userspace thread and a kernel thread is the context (userspace, kernel) in which it is running?

Hastening process cleanup with process_mrelease()

Posted Jul 27, 2021 15:06 UTC (Tue) by Paf (subscriber, #91811) [Link]

Ehhhhh, I would usually use the term kernel thread to mean dedicated to the kernel and does not (with rare exceptions) ever run in user space.

I get the point being made above, though - user threads can run in kernel context (obviously :)) and so they can do some of the work directly.