LWN: Comments on "Teaching the OOM killer about control groups"

Teaching the OOM killer about control groups

mstsxfx — Wed, 01 Aug 2018 13:21:24 +0000

> This patch set is not yet in -mm, but there does not appear to be any real opposition to it at this point.

Well, this is not entirely true. There were quite serious concerns about the
proposed API allowing for weird corner cases. E.g.
http://lkml.kernel.org/r/20180130085013.GP21609@dhcp22.su.... The
discussion circled in some repetitive arguments so it is not really easy to
follow up, unfortunately. The last version hasn't been reviewed AFAIK but bases
are quite similar so I am skeptical this is mergeable anytime soon.

The biggest quarrel about this whole thing is a different view about feature
completeness IMHO.

While the original Roman's proposal was targeting very specific class of
usecases (containers as you've mentioned) and it was an opt-in so those
uninterested could live with the original policy/heuristic with potential
extensions[1] to be done on top.

David was pushing hard for a more generic solution which would give a bigger
power over the oom selection policy to userspace. While this is a good thing
in general, the primary problem is that this is extremely hard to get right.
We have been discussing this for years without finding moving forward much
because opinions of what is really important vary a lot.

[1] group_oom makes a lot of sense regardless of the oom victim selection
policy because some workloads are inherently indivisible

Michal Hocko

OOM killer and cgroups. Better do it in user-space?

nilsmeyer — Tue, 31 Jul 2018 21:04:15 +0000

Depends on the service really, what I often see is that there is a main process that fork()s into multiple children and some of the child processes grow out of control. Although this may be handled better within the cgroup?

Teaching the OOM killer about control groups

vbabka — Mon, 30 Jul 2018 07:42:43 +0000

AFAIU this is not about cgroups memory limit, but a system-wide OOM in a situation where there are cgroups. There might be per-cgroup limits, but it's not that the sum of these limits would equal the system memory, because that would lead to underutilized system. So instead there's overcommit and the workloads can tolerate being killed on the level of a whole group.

Teaching the OOM killer about control groups

dbe — Mon, 30 Jul 2018 05:50:35 +0000

Would that mean that a cgroup could ignore an oom notification?

Teaching the OOM killer about control groups

meyert — Sun, 29 Jul 2018 21:27:40 +0000

How does this related to cgroups with the memory controller? Isn't oom killer also invoked when the cgroups memory limit is hit? Can then"kill the whole cgroup" not also be implemented in the cgroup itself by installing an cgroup oom notification listener?

Killing containers.

rweikusat2 — Sun, 29 Jul 2018 16:09:27 +0000

Obviously.

But if a control group happens to represent a container, the processes belonging to it will usually be cooperating more closely than unrelated, other processes also running on the same system. Hence, killing one process running inside a container will end up as more or less drastically malfunctioning virtual server while killing all of them would be more like a clean shutdown. I certainly prefer the latter of the former.

There's also another issue: Assuming the OOM situation was caused by some software bug/ malfunction, chances are that it will happen again fairly quickly as the "monitoring infrastructure" will seek to reestablish the problematic state. Killing everything in the cgroup should prevent that from happening.

OOM killer and cgroups. Better do it in user-space?

foom — Sun, 29 Jul 2018 15:15:59 +0000

I don't think killing all the processes in the cgroup makes anything appreciably easier. Since any of the processes could crash in some other way, you still need whatever monitoring mechanisms you had before.

OOM killer and cgroups. Better do it in user-space?

darwish — Sat, 28 Jul 2018 23:40:17 +0000

Yeah, I really like the "kill the whole cgroup" option too (memory.oom_group).

It makes sense especially since systemd puts each "service" in its its own cgroup, and killing the whole service sounds much more graceful than picking a single victim process from within it.

OOM killer and cgroups. Better do it in user-space?

simcop2387 — Sat, 28 Jul 2018 21:16:12 +0000

I think making it aware of the groups for this kind of scheduling isn't too much policy. Since the cgroups can affect everything else the kernel does with them (scheduling, io, etc.) it makes sense to extend it to this. And adding the "kill the whole group" bit makes administration easier, have your container all in one cgroup and just let the oom killer take the whole thing out and then your infrastructure restarts it. No need for a complicated heartbeat or other more likely to fail signal saying this container isn't functioning properly anymore. That said, anything beyond making it aware of these structures I think does make sense to put into something userspace. Not sure how facebook's setup works but I could see something that's dynamically adjusting the oom scores to try to keep certain things from being killed while encouraging others and that's going to be a much more complicated policy since it'll be system/service/etc. specific.

OOM killer and cgroups. Better do it in user-space?

darwish — Sat, 28 Jul 2018 19:09:32 +0000

Thanks a lot Jon for the great article :-)

Part of me feels that this is _too much policy_ leaking into the kernel code. I guess the Facebook approach of relegating this to user-space makes more sense?

▷▷ https://code.fb.com/production-engineering/open-sourcing-...

Maybe something similar can be done in systemd (systemd-oomd?), with its standardized user-space configuration options instead of all of these policy decisions and tunables ..

Teaching the OOM killer about control groups

abk77 — Sat, 28 Jul 2018 13:51:04 +0000

Very nice explanation. thank you!. Diagrams/pictures are very useful.