Re: [patch] memcg: add oom killer delay
[Posted March 9, 2011 by corbet]
From: |
| David Rientjes <rientjes-AT-google.com> |
To: |
| Andrew Morton <akpm-AT-linux-foundation.org> |
Subject: |
| Re: [patch] memcg: add oom killer delay |
Date: |
| Mon, 7 Mar 2011 17:33:44 -0800 (PST) |
Message-ID: |
| <alpine.DEB.2.00.1103071721330.25197@chino.kir.corp.google.com> |
Cc: |
| KAMEZAWA Hiroyuki <kamezawa.hiroyu-AT-jp.fujitsu.com>,
Balbir Singh <balbir-AT-linux.vnet.ibm.com>,
Daisuke Nishimura <nishimura-AT-mxp.nes.nec.co.jp>, linux-mm-AT-kvack.org |
Archive‑link: | |
Article |
On Mon, 7 Mar 2011, Andrew Morton wrote:
> > It could be, if users assign the handler to a different memcg; otherwise,
> > it's guaranteed.
>
> Putting the handler into the same container would be rather daft.
>
> If userspace is going to elect to take over a kernel function then it
> should be able to perform that function reliably. We don't have hacks
> in the kernel to stop runaway SCHED_FIFO tasks, either. If the oom
> handler has put itself into a memcg and then has permitted that memcg
> to go oom then userspace is busted.
>
We have a container specifically for daemons like this and have struggled
for years to accurately predict how much memory it needs and what to do
when it is oom. The problem, in this case, is that when it's oom it's too
late: the memcg is livelocked and then no memory limits on the system have
a chance of getting increased and nothing in oom memcgs are guaranteed to
ever make forward progress again.
That's why I keep bringing up the point that this patch is not a bugfix:
it's an extension of a feature (memory.oom_control) to allow userspace a
period of time to respond to memcgs reaching their hard limit before
killing something. For our container with vital system daemons, this is
absolutely mandatory if something consumes a large amount of memory and
needs to be restarted; we want the logic in userspace to determine what to
do without killing vital tasks or panicking. We want to use the oom
killer only as a last resort and that can effectively be done with this
patch and not with memory.oom_control (and I think this is why Kame acked
it).
> My issue with this patch is that it extends the userspace API. This
> means we're committed to maintaining that interface *and its behaviour*
> for evermore. But the oom-killer and memcg are both areas of intense
> development and the former has a habit of getting ripped out and
> rewritten. Committing ourselves to maintaining an extension to the
> userspace interface is a big thing, especially as that extension is
> somewhat tied to internal implementation details and is most definitely
> tied to short-term inadequacies in userspace and in the kernel
> implementation.
>
The same could have been said for memory.oom_control to disable the oom
killer entirely which no seems to be solidified as the only way to
influence oom killer behavior from the kernel and now we're locked into
that limitation because we don't want dual interfaces. I think this patch
would have been received much better prior to memory.oom_control since it
allows for the same behavior with an infinite timeout. memory.oom_control
is not an option for us since we can't guarantee that any userspace daemon
at our scale will ever be responsive 100% of the time.
I don't think the idea of a userspace grace period when a memcg is oom is
that abstract, though. I think applications should have the opportunity
to free some of their own memory first when oom instead of abruptly
killing something and restarting it.
So, in the end, we may have to carry this patch internally forever but I
think as memcg becomes more popular we'll have a higher demand for such a
grace period.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>