Avoiding the OOM killer with mem_notify

By Jake Edge
January 30, 2008

Having applications that use up all the available memory can be a fairly painful experience. For Linux systems, it generally means a visit from the out-of-memory (OOM) killer, which will try to find processes to kill. As one would guess, coming up with rules governing which process to kill is challenging—someone, somewhere, will always be unhappy with a choice the OOM killer makes. Avoiding it altogether is the goal of the mem_notify patch.

When memory gets tight, it is quite possible that applications have memory allocated—often caches for better performance—that they could free. After all, it is generally better to lose some performance than to face the consequences of being chosen by the OOM killer. But, currently, there is no way for a process to know that the kernel is feeling memory pressure. The patch provides a way for interested programs to monitor the /dev/mem_notify file to be notified if memory starts to run low.

/dev/mem_notify is a character device that signals memory pressure by becoming readable. Interested programs can open the file and then use poll() or select() to monitor the file descriptor. Alternatively, signal-driven I/O can be enabled via the FASYNC flag and the system will deliver a SIGIO signal to the process when the device becomes readable. If it becomes readable, the process should free any memory that it can afford to give up. If enough memory is freed this way, the kernel will have no need to call in the OOM killer.

The crux of the patch is how to decide that memory pressure is occurring. mem_notify modifies shrink_active_list() to look for movement of an anonymous page to the inactive list, which is an indication that some will likely be swapped out soon. When that occurs, memory_pressure_notify() (with the pressure flag set to 1) will be called for that zone. When the number of free pages for the zone increase above a threshold—based on pages_high and lowmem_reserve for the zone—memory_pressure_notify() is called again, but with the pressure flag set to 0, effectively ending the memory pressure event for that zone.

If there are numerous processes waiting for a memory pressure notification, it could be counterproductive to wake them all at once—the "thundering herd" problem. To combat this, the patch set adds the ability to wake fewer processes than are waiting on the poll event, by adding the poll_wait_exclusive() function. poll_wait_exclusive() will in turn call add_wait_queue_exclusive() so that a member of the wake_up() family can be used that will limit the number of processes woken up. Previously, only poll_wait() was available, it uses add_wait_queue(), which does not provide this ability. Also, to reduce the frequency of processes waking up to reclaim memory, memory_pressure_notify() will only do that once every five seconds.

The /proc/zoneinfo output has been changed to include the mem_notify status. This can be used by a human for diagnostic purposes or by a program to check the current status of zones for memory pressure.

The embedded community has a lot of interest in seeing this feature get added to the kernel. Devices like phones and PDAs are often running close to their memory limits and the OOM killer is currently unavoidable when the user opens yet another application. With this patch in place, programs that use a lot of memory, but could get by with less, can be changed to free up their caches and the like when memory gets tight. As memory hungry programs get changed, other users will benefit as well.

The patch, submitted by Kosaki Motohiro, has been through several iterations on linux-kernel. The work was originally started by Marcelo Tosatti, with the fifth version recently posted by Kosaki. Previous versions have been well received and with relatively few comments on this iteration, it would seem to be getting close to being merged.

Index entries for this article
Kernel	Memory management/Out-of-memory handling
Kernel	OOM killer

Avoiding the OOM killer with mem_notify

Posted Jan 31, 2008 3:34 UTC (Thu) by salimma (subscriber, #34460) [Link] (29 responses)

Presumably the OOM killer looks at applications' changes in memory footprint when picking a
victim? Otherwise there'd be no incentive for an app to play nice and voluntarily relinquish
memory.

Avoiding the OOM killer with mem_notify

Posted Jan 31, 2008 9:09 UTC (Thu) by njs (subscriber, #40338) [Link] (27 responses)

It's possible to take this metaphor of processes fighting over memory too far.  I don't think
in practice app writers consider themselves to have "won" if they've managed to partially
crash the user's system but keep their own process running while they did it.  The goal is to
not invoke the capricious god OOM at all.

Interesting possibility enabled by this patch: userspace OOM killer.  You don't *have* to
reduce memory by freeing caches -- killing other processes is quite effective too :-).  And if
you have a relatively integrated environment like a phone UI, you may know perfectly well from
userspace that killing that java game is better than killing the windowing system, which is
better than killing the gsm daemon.  (Even on desktops, one knows that killing X will also
automatically kill all its clients -- so one should always start by killing those clients
first, because if that works, then you've managed to escape the OOM situation with strictly
less damage.)

Requesting 'real' memory

Posted Jan 31, 2008 11:08 UTC (Thu) by epa (subscriber, #39769) [Link] (25 responses)

The OOM killer is needed because the kernel has overallocated memory.  Surely for critical
processes there is a way to request 'hard' memory, where you can be sure that it really exists
either as RAM or swap space, and you can be certain you're not going to be arbitrarily killed
later for using the memory you requested.  The tradeoff is that a memory allocation request
can fail - but better to have malloc() return 0 where the app can handle it sensibly than to
have it pretend to work and then randomly kill your process later.

Can you turn off overallocation (and OOM killing) on a per-process basis?

Requesting 'real' memory

Posted Jan 31, 2008 11:43 UTC (Thu) by njs (subscriber, #40338) [Link] (10 responses)

>and you can be certain you're not going to be arbitrarily killed later for using the memory
you requested

Well, here's what makes designing the OOM-killer hard -- attempting to use memory that the
system doesn't have actually *doesn't* kill you, it just wakes up the OOM-killer and then it's
perfectly possible that the OOM-killer will go after someone else.  Imagine a scenario where
one app allocates 99% of the system's memory (and not with some virtually allocated
overallocation bs, like they actually touch the pages or whatever), and then stops.  Then you
try to run "ls", and it overflows that last 1% of memory and wakes up the OOM-killer.  The
OOM-killer tries to identify and then attack the runaway giant process, not ls, even though ls
was the one who tried to get more memory.

So it doesn't actually help much for you, personally, to make sure your memory is not
overallocated -- if anything it will hurt, since it increases memory pressure overall and also
makes your process a bigger target.  You can turn off overallocation globally, but that
doesn't necessarily help either, since in the scenario above it just means that ls (and every
other program you try to run) fails, while the runaway monster just sits there.

Google say you can disable the OOM killer on a per-process basis, though, which I hadn't
known: http://linux-mm.org/OOM_Killer

Requesting 'real' memory

Posted Feb 1, 2008 20:21 UTC (Fri) by giraffedata (guest, #1954) [Link] (9 responses)

So it doesn't actually help much for you, personally, to make sure your memory is not overallocated

Right, that's as dodgy as expecting to avoid the OOM Killer pseudo-crash by having the kernel notify users that memory is tight.

What you want is a combination of the two: a process turns off overallocation for itself, and in exchange, is made immune to the OOM Killer.

That way, processes that need determinism can have it while processes that don't want to waste swap space can have that.

Linux doesn't have this. AIX does.

Requesting 'real' memory

Posted Feb 1, 2008 23:55 UTC (Fri) by njs (subscriber, #40338) [Link] (8 responses)

> What you want is a combination of the two: a process turns off overallocation for itself,
and in exchange, is made immune to the OOM Killer.

I don't see the connection between these.  Turning off overallocation just means that you get
a different error handling API.  It certainly doesn't stop you from running the system out of
memory.

Making a process immune from the OOM killer is clearly a root-level operation; all you have to
do to force allocation is to touch pages after you allocate them, obviously not a root level
sort of ability.

Requesting 'real' memory

Posted Feb 2, 2008 2:00 UTC (Sat) by giraffedata (guest, #1954) [Link] (7 responses)

Turning off overallocation just means that you get a different error handling API

How is getting killed by the OOM Killer an error handling API? Turning off overallocation means you get an error handling API where you had none before.

It certainly doesn't stop you from running the system out of memory.

Turning off overallocation for one process doesn't stop you from running the system out of memory; that's why the OOM Killer is still there. But he only kills other processes. The connection is that if Process X is not overallocating memory (swap space), then the OOM Killer is guaranteed to be able to relieve memory pressure without having to kill Process X. You can't say that about an overallocating process.

Think of it as two separate pools of swap space; one managed by simple allocation; the other with optimistic overallocation and an OOM Killer. A process decides which one works best for it.

all you have to do to force allocation is to touch pages after you allocate them,

No, that's not enough. In overallocating mode, the swap space does not get allocated until the kernel decides to steal a page frame. By then, the process is in no position to be able to deal with the fact that there's not enough swap space for him.

Requesting 'real' memory

Posted Feb 3, 2008 1:39 UTC (Sun) by njs (subscriber, #40338) [Link] (6 responses)

Ah, I see, that's not what overallocation means. It has nothing to do with swap. As far as
the memory manager is concerned, the total amount of memory available in the system is RAM +
swap -- if you have 1G ram and 2G swap, then you have 3G total memory.

This 3G total is distributed among processes. If overallocation is turned off, then each time
a process calls malloc() (well, really mmap()/sbrk()/fork()/etc., but never mind), either some
pages from that 3G are shaved off and reserved for that process's use, or if there are not
enough pages remaining then the syscall fails.

If overallocation is turned on, then malloc() never actually allocates memory. What it does
instead is set up some virtual pages in the process's address space, and then the first time
the process tries to write anywhere on each of those not-really-there pages, the process takes
a fault, the memory manager allocates one of those 3G of pages, sticks it in place of the fake
page, and then finally allows the write to continue. The upside of this is that if a process
malloc()'s a big chunk of memory and then only uses part of it, it's as if they only
malloc()'ed exactly what they ended up needing. The downside is that since the actual
allocation is now happening somewhere in the middle of a single user-space cpu instruction,
there's no way to signal an error back to the process if the allocation fails, and so in that
case the only thing you can do is wake up an OOM-killer.

> The connection is that if Process X is not overallocating memory (swap space), then the OOM
Killer is guaranteed to be able to relieve memory pressure without having to kill Process X.

Which means that this just isn't true. Memory pressure is caused by actually-allocated pages,
and the only difference between a process using overallocation and one that isn't is that the
overallocating process may have some virtual pages set up to trigger allocation sometime in
the future. Whether such pages exist has no bearing whatsoever on memory pressure *now*. The
only way the OOM-killer can relieve memory pressure is to kill off processes that are using
memory, and Process X qualifies.

Requesting 'real' memory

Posted Feb 3, 2008 2:29 UTC (Sun) by giraffedata (guest, #1954) [Link] (3 responses)

Though you start out saying overallocation has nothing to do with swap, your second sentence shows that it is strongly related to swap, saying that swap space is half the equation in determining how much memory is available to allocate.

But what overallocation are you talking about? You describe it like some well-defined Linux function. Are you talking about something that Linux implements? AFAIK, Linux does not implement a per-process overallocation mode, and we were talking about what should be.

It's clear how the mode should work: The same way it does in AIX, which is what I described. "Turning off" overallocation has to mean that once you've allocated memory, you can use it and can't be killed for lack of memory. Otherwise, why bother having the mode?

And the way to do that is to allocate swap space to the process at the moment you allocate the virtual addresses to it. You could alternatively permanently allocate some real memory to the process, but that would be really wasteful (and we already have a means to do that: mlockall()).

BTW, it's not helpful to talk about memory not being actually allocated by malloc (brk/mmap/etc). malloc() does actually allocate virtual memory. Allocating swap space and allocating real memory are separate things that exist in support of using virtual memory which has been allocated. I also don't think "virtual" amounts to "fake." It's a different form of memory.

Memory pressure is caused by actually-allocated pages

I assume you mean pages of real memory. Filling up of real memory is the primary cause of memory pressure, but it's easy to relieve that pressure: just push data you aren't using out to swap space and free up the real memory. When swap space is full, that's when the pressure backs up into the real memory and the OOM Killer is needed. That's why I say swap space is the key to giving guaranteed-usable virtual memory to a process.

Requesting 'real' memory

Posted Feb 3, 2008 4:06 UTC (Sun) by njs (subscriber, #40338) [Link] (1 responses)

Obviously we're totally talking past each other, but I'll try one more time...

>Though you start out saying overallocation has nothing to do with swap, your second sentence
shows that it is strongly related to swap, saying that swap space is half the equation in
determining how much memory is available to allocate.

Well, sure, swap is, by any measure, an important part of a VM system, but that doesn't mean
it's "strongly related" to any other particular part of a VM system. My point is that from
the point of view of overallocation, the difference between swap and physical RAM is just
irrelevant.

> But what overallocation are you talking about? You describe it like some well-defined Linux
function. Are you talking about something that Linux implements?

Yes. I'm talking about "overallocation" or "overcommit", which in this context is a technical
term with a precise meaning. When people are talking about it in this thread, they are
referring to a particular policy implemented by the Linux kernel and enabled by default.
Evidentally you haven't encountered this particular design before, which is why I described it
in my previous comment...

>AFAIK, Linux does not implement a per-process overallocation mode, and we were talking about
what should be.

No, AFAIK it doesn't, but it does support a global overallocation/no-overallocation switch,
and it's obvious what it would mean to take that switch and make it process-granular. Maybe
there's yet another policy that Linux should implement, but if you want to talk about that
then trying to redefine an existing term to do so will just confuse people.

>And the way to do that is to allocate swap space to the process at the moment you allocate
the virtual addresses to it.

Huh, so is this how traditional Unix works? Is this tight coupling between memory allocation
policy and swap management policy the original source of that old advice to make your swap
space = 2xRAM? I've long wondered where that "rule" came from.

I guess I've heard before that in traditional Unix all RAM pages are backed on disk somewhere,
either via the filesystem or via swap, but I hadn't thought through the consequences before.

I'm guessing this is the original source of confusion. Linux doesn't work like that at all;
I'm not sure there's any modern OS that does. In a system where RAM is always swap-backed,
having 1G RAM and 2G of swap means that all processes together can use 2G total; in Linux,
they can use 3G total, because if something is in RAM it doesn't need to be in swap, and
vice-versa. (What happens in your scheme if someone is running without swap? I bet there are
people in this thread who both disable overallocation and run without swap (hi Zooko!).)

"Allocated memory" in my comment really just means "anonymous transient data that the kernel
has committed to storing on the behalf of processes". It can arrange for it to be stored in
RAM, or in swap, or whatever. There is no such thing as "allocated swap" or "allocated RAM"
in Linux. (Except via mlockall(), I guess, if you want to call it that, but I don't think
calling it that is conceptually useful -- it's more of a way to pin some "allocated memory"

Does that make my previous comment make more sense?

Requesting 'real' memory

Posted Feb 4, 2008 9:43 UTC (Mon) by giraffedata (guest, #1954) [Link]

but it does support a global overallocation/no-overallocation switch, and it's obvious what it would mean to take that switch and make it process-granular.

That would be only slightly different from the scheme I described. The only difference is that the global switch lets you add a specified amount of real memory to the size of swap space in calculating the quota. If you could stop the kernel from locking up that amount of real memory for other things, you could have the OOM-proof process we're talking about, with less swap space.

I think the only reason I haven't seen it done that way is that swap space is too cheap to make it worthwhile to bring in the complexity of allocating the real memory. If I were to use the Linux global switch, I would just tell it to consider 0% of the real memory and throw some extra disk space at it, for that reason.

What happens in your scheme if someone is running without swap? I bet there are people in this thread who both disable overallocation and run without swap

They don't have that option. The price they pay to have zero swap space is that nothing is ever guaranteed to be free from being OOM-killed. Which is also the case for the Linux users today who disable overallocation and run without swap.

Requesting 'real' memory

Posted Feb 5, 2008 13:44 UTC (Tue) by filipjoelsson (guest, #2622) [Link]

> But what overallocation are you talking about? You describe it like some
> well-defined Linux function. Are you talking about something that Linux
> implements? AFAIK, Linux does not implement a per-process overallocation
> mode, and we were talking about what should be.

I think the overallocation he talks about is on the userspace level. When I'm programming an
application to read, store and analyze data on a laptop (a field application - the user wants
to have preliminary analysis right away), I can make a big fat allocation to use for cache. I
store the data in a database, which also uses a big fat cache. Until now, I have been had to
either make both caches no larger than a quarter the size of the RAM (roughly), since swapping
out really defeats the point of the cache.

If I had bigger caches, I could just hope that the sum of the actually used memory would not
be larger than RAM (or else it'd start swapping out, or if I run without swap - trigger the
OOM). With this patch, I can safely overcommit - and when I get the notification, I can cut
down on the caches and survive. The analysis step of the app will be slower, but my user will
not have been near a crash - and the analysis would have been slower anyway because of
swapping. One difference is that he can run without swap in a much more safe manner. Anyway,
this is not at all a case of a partial crash. You could argue that it is a case of sloppy
programming, but why should I reinvent memory managing? I'd much rather let the kernel do that
for me.

Oh, and lastly - on embedded platforms, you often have to do without swap. Swapping to flash
does not strike me as a good idea.

Requesting 'real' memory

Posted Feb 8, 2008 5:00 UTC (Fri) by goaty (guest, #17783) [Link] (1 responses)

Don't forget about stack! For heavily multi-threaded processes on Linux, stack is usually the
biggest user of virtual memory.

As an aside, I know some operating systems use a "guard page" stack implementation, where
writes into the first page off the bottom (top?) of the stack trigger a page fault, which
allocates another page worth of real memory, and also moves the "guard page". The benefit
being that virtual memory use is much closer to actual memory use, and turning off overcommit
is much more viable. The downside is that the ABI requires the compiler to generate code to
probe every page when it needs to allocate a stack frame larger than the page size. Which
ironically can end up using more real memory than Linux's virtual-memory-hungry approach.

Requesting 'real' memory

Posted Feb 8, 2008 21:23 UTC (Fri) by nix (subscriber, #2304) [Link]

Linux has used a guard-page stack implementation since forever. It's 
transparent to userspace: the compiler doesn't need to do a thing.

(Obviously when threading things get more complex.)

Requesting 'real' memory

Posted Feb 1, 2008 0:01 UTC (Fri) by zooko (guest, #2589) [Link] (13 responses)

You can set sys.vm.overcommit_memory to policy #2.  Unfortunately, it isn't entirely clear
that this will banish the OOM killer entirely, or if it will just make it very rare.

http://www.linuxinsight.com/proc_sys_vm_overcommit_memory...

Requesting 'real' memory

Posted Feb 1, 2008 1:21 UTC (Fri) by zlynx (guest, #2285) [Link] (3 responses)

I ran my Linux laptop with strict overcommit enabled for a while.  Unfortunately, it does not
help.  Almost all desktop applications expect memory allocation to succeed.  From some of the
application errors I saw, developers seem to have become very lax about checking for NULL from
malloc.

C++ and Python applications did better, because they get an exception, and they have to do
*something* with it.

Requesting 'real' memory

Posted Feb 1, 2008 3:28 UTC (Fri) by zooko (guest, #2589) [Link] (2 responses)

Even if what you say is true, I would think that this would make the effects of memory
exhaustion more deterministic/reproducible/predictable.

C++ and Python apps, and also C apps that use malloc sparingly, would be less likely to crash
than others, I guess.

Perhaps this degree of predictability isn't enough to be useful.

Requesting 'real' memory

Posted Feb 1, 2008 17:56 UTC (Fri) by zlynx (guest, #2285) [Link] (1 responses)

I did not notice any extra predictability.  The effect was that the desktop programs crash
apparently randomly.  It was much like the OOM killer.  And just like the OOM killer, it was
generally the big stuff that blew up, like Evolution and Open Office.  I lost gnome-terminal a
few times.

The C++ and Python apps still crashed, they were simply more polite about it.

By the way, I don't read it that way, but your phrasing "Even if what you say is true" *could*
be offensive.  It seems to be saying that I wrote untruthfully.

Even if you don't see the same effect on your system, I did see it just the way I described it
on mine.

Requesting 'real' memory

Posted Feb 1, 2008 19:34 UTC (Fri) by giraffedata (guest, #1954) [Link]

Desktop applications aren't where I would expect to see deterministic memory allocation exploited. Allocation failures and crashes aren't such a big deal with these applications because if things fall apart, there's a user there to pick up the pieces. Overallocation and OOM Killer may well be the optimum memory management scheme for desktop systems.

Where it matters is business-critical automated servers. For those, application writers do spend time considering running out of memory -- at least they do in cases where an OOM killer doesn't make it all pointless anyway. They check the success of getting memory and do it at a time when there is some reasonable way to respond to not getting it.

And they shouldn't spend time worrying about freeing up swap space for other processes (i.e. mem_notify is no good). That resource management task belongs to the kernel and system administrator.

Requesting 'real' memory

Posted Feb 1, 2008 20:14 UTC (Fri) by giraffedata (guest, #1954) [Link] (8 responses)

You can set sys.vm.overcommit_memory to policy #2. Unfortunately, it isn't entirely clear that this will banish the OOM killer entirely, or if it will just make it very rare.

It's entirely clear to me that it banishes the OOM killer entirely. The only reason the OOM killer exists is that sometimes the processes use more virtual memory than there is swap space to put its contents. With Policy 2, virtual memory isn't created in the first place unless there is a place to put the contents.

Requesting 'real' memory

Posted Feb 1, 2008 20:47 UTC (Fri) by zooko (guest, #2589) [Link] (7 responses)

But doesn't the kernel itself dynamically allocate memory?  And when it does so, can't it
thereby use up memory so that some user process will be unable to use memory that it has
already malloc()'ed?  Or do I misunderstand?

Requesting 'real' memory

Posted Feb 1, 2008 21:26 UTC (Fri) by giraffedata (guest, #1954) [Link] (6 responses)

The kernel reserves at least one page frame for anonymous virtual memory (actually, it's a whole lot more than that, but in theory one frame is enough for all the processes to access all their virtual memory as long as there is adequate swap space).

So any kernel real memory allocation can fail, and the code is painstakingly written to allow it to handle that failure gracefully (more gracefully than killing an arbitrary process). It allocates memory ahead of time so as to avoid deadlocks and failures at a time that there is no graceful way to handle it.

Requesting 'real' memory

Posted Feb 1, 2008 21:50 UTC (Fri) by zooko (guest, #2589) [Link] (5 responses)

Right, but I wasn't asking about the kernel's memory allocation failing -- I was asking about
the kernel's virtual memory allocation succeeding by using memory that had already been
offered to a process as the result of malloc().

Oh -- perhaps I misunderstood and you were answering my question.  Are you saying that the
kernel will fail to dynamically allocate memory rather than allocate memory which has already
been promised to a process (when overcommit_memory == 2)?

Thanks,

Zooko

Requesting 'real' memory

Posted Feb 1, 2008 23:05 UTC (Fri) by giraffedata (guest, #1954) [Link]

The kernel doesn't use virtual memory at all (well, to be precise let's just say it doesn't use paged memory at all). The kernel's memory is resident from the moment it is allocated, it can't ever be swapped out, and the kernel uses no swap space.

Requesting 'real' memory

Posted Feb 5, 2008 23:05 UTC (Tue) by dlang (guest, #313) [Link] (3 responses)

the problem that you will have when you disable overallocating memory is that when your 200M
firefox process tries to spawn a 2k program (to handle some mime type) it first forks, and
will need 400M of ram, even though it will immediatly exec the 2k program and never touch the
other 399.99M of ram.

with overallocation enabled this will work. with it disabled you have a strong probability of
running out of memory instead.

yes, it's more reproducable, but it's also a completely avoidable failure.

Requesting 'real' memory

Posted Feb 6, 2008 3:25 UTC (Wed) by giraffedata (guest, #1954) [Link] (2 responses)

I wonder why we still have fork. As innovative as it was, fork was immediately recognized, 30 years ago, as impractical. vfork took most of the pain away, but there is still this memory resource allocation problem, and some others, and fork gives us hardly any value. A fork-and-exec system call would fix all that.

Meanwhile, if you have the kind of system that can't tolerate even an improbable crash, and it has processes with 200M of anonymous virtual memory, putting up an extra 200M of swap space which will probably never be used is a pretty low price for the reliability of guaranteed allocation.

Requesting 'real' memory

Posted Feb 6, 2008 5:26 UTC (Wed) by dlang (guest, #313) [Link] (1 responses)

many people would disagree with your position that vfork is better then fork. (the issue came
up on the lkml within the last week and was dismissed with something along the lines of 'vfork
would avoid this, but the last thing we want to do is to push more people to use vfork')

I agree that a fexec (fork-exec) or similar call would be nice to have, but it wouldn't do
much good for many years (until a significant amount of software actually used it)

as for your comment of just add swap space to avoid problems with strict memory allocation.

overcommit will work in every case where strict allocation will work without giving
out-of-memory errors, and it will work in many cases where strict allocation would result in
errors. overcommit will also work in many (but not all) cases where strict allocation would
result in out of memory errors.

if it's trivial to add swap space to avoid the OOM errors in strict allocation, that same swap
space can be added along with overcommit and the system will continue to work in even more
cases.

the only time strict allocation will result in a more stable system is when your resources are
fixed and your applications are fairly well behaved (and properly handle OOM conditions), even
then the scenerio of one app allocating 99% of your ram, preventing you from running other
apps, is still a very possible situation. the only difference is that the timing of the OOM
error is more predictable (assuming that you can predict what software will be run when in the
first place)

Requesting 'real' memory

Posted Feb 7, 2008 0:35 UTC (Thu) by giraffedata (guest, #1954) [Link]

Many people would disagree with your position that vfork is better then fork

No, they wouldn't, because I was talking about the early history of fork and comparing the original fork with the original vfork. The original fork physically copied memory. The original vfork didn't, making it an unquestionable improvement for most forks. A third kind of fork, with copy-on-write, came later and obsoleted both. I didn't know until I looked it up just now that a distinct vfork still exists on modern systems.

the only time strict allocation will result in a more stable system is when your resources are fixed and your applications are fairly well behaved (and properly handle OOM conditions)

The most important characteristic of a system that benefits from strict allocation is that there be some meaningful distinction between a small failure and a catastrophic one. If all your memory allocations must succeed for your system to meet requirements, then it's not better to have a fork fail than to have some process randomly killed, and overallocation is better because it reduces the probability of failure.

But there are plenty of applications that do make that distinction. When a fork fails, such an application can reject one piece of work with a "try again later" and a hundred of those is more acceptable than one SIGKILL.

Avoiding the OOM killer with mem_notify

Posted Jan 31, 2008 15:12 UTC (Thu) by salimma (subscriber, #34460) [Link]

Nokia has something similar on their Linux-based Maemo platform -- run it without swap, start
a bunch of applications, and a lot of the built-in applications would enter a
reduced-memory-usage mode -- noticeable because it takes much longer to switch to them than it
normally would.

I wonder whether the apps currently just poll the system to find out how much memory is left,
or they have their own mechanism, though.

You forget about higher god!

Posted Jan 31, 2008 9:47 UTC (Thu) by khim (subscriber, #9252) [Link]

Applications are always under threat from OOM-killer, but there are another, more powerfull god: the end user! If applications does not play nice and forces other applications to be killed by OOM-killer (one way or another) then eventually this information reaches the user and application is either silenced forever or fixed. So the applications (or rather the application writers) have every incentive to play well with others...

memory congestion avoidance

Posted Jan 31, 2008 10:01 UTC (Thu) by sasha (guest, #16070) [Link]

All the topic looks for me like TCP congestion avoidance protocol.  Currently, we are at the
early start of its development -- we are going to get a notification that the congestion
exists.  However, it is not enough to have a notification system, because different
applications are going to play by different rules.

So, I' looking forward for some clear rules for "memory congestion avoidance" written down as
a standard and implemented in the most of the high-level languages, especially languages with
garbage collectors.

Avoiding the OOM killer with mem_notify

Posted Jan 31, 2008 17:06 UTC (Thu) by jzbiciak (guest, #5246) [Link] (2 responses)

How effective can this be, though, for many C programs? If I malloc a bunch of memory, perhaps as caches, and then am asked to free it, that doesn't magically release pages back to the OS. Now, if malloc uses mmap for some of the larger allocations, those can be released back to the OS by munmap. But, for the general sbrk managed heap, I have to free stuff near the end of the heap before I can ask for my brk to be lowered. There's no guarantee I can do that.

For this to be useful, whatever I malloc needs to have an additional level of indirection in user space, so I can move the objects I wish to keep and then compact the heap. Otherwise, simply freeing stuff up won't be enough.

It may be useful to compare/contrast this to the HURD's approach, which is simply to force user space to do its own VM management. There, the kernel and user-space dicker about physical pages only, and user space figures out how best to handle the burden when a given app wants more pages than the OS can give it. The answer could be garbage collection, discarding caches, swapping or whatever makes sense to a given application.

The main thing is that the app knows way ahead of time that real RAM is in short supply, and avoids getting into the overcommitted state entirely. And since the kernel isn't doing the swapping, it seems like you wouldn't get into situations where you need to free memory so you have enough memory so that you can write out pages and the like. Example: Imagine that to wake an app so it can free some pages, you have to bring it in from swap, but swap is too full to write any dirty anonymous pages out. If your policy is that each app self-swaps, this should never happen since the OS guarantees it'll have enough pages to do its work, and user space will just muddle along with what its given. (In theory, it seems like a user space app could get by with just a few pages... a couple executable pages and a couple data pages.)

I'm guessing mem_notify will try to wake apps sufficiently far ahead that it can avoid those "need RAM to free RAM" situations in practice, but setting proper thresholds seems like it ought to be rather tricky.

Avoiding the OOM killer with mem_notify

Posted Feb 1, 2008 3:03 UTC (Fri) by vomlehn (guest, #45588) [Link] (1 responses)

> I'm guessing mem_notify will try to wake apps sufficiently far ahead that it can avoid those
"need RAM to free RAM" situations in practice, but setting proper thresholds seems like it
ought to be rather tricky

I don't see how the kernel can possibly know enough for it to notify applications far enough
ahead about the need to free memory; the memory allocation behavior of applications is just
too unpredictable.

An approach that seems like it would be better would be to notify the kernel that certain
pages in your application are being used to cache data. The kernel is then free to simply grab
them if it needs them. If your application decides it needs the data later, it uses as sytem
call to notify the kernel that the pages are no longer being used as a cache. If the kernel
didn't need the pages, they would still have their old data and the application could use them
directly.

On the other hand, if the kernel did have to grab the pages in the interim, the system call
used to grab the pages back would return an error. Your application would then know it needs
to remap the pages and regenerate the data. Of course, it's possible the pages can't be
remapped because memory is too low. The application would handle that as though the data
wasn't cached and it couldn't get the memory to read it. It already has to be able to do this,
so this doesn't add to the application's complexity.

The advantages of this approach are that the pages are immediately available to the kernel
without having to wake the process up. No need to figure out complex threshholds, no need to
allocate enough memory for the process to run, no delay in making the needed memory available.
You could even allow for priorities when telling the kernel the pages are being used for cache
so that the kernel would grab lower priority pages first.

I wish I had the time to code this and submit it because I think that mem_notify is an awful
botch that will cause unending pain as people add patch on patch to try to make it work. But
that's just my personal opinion...

Avoiding the OOM killer with mem_notify

Posted Feb 1, 2008 19:47 UTC (Fri) by zlynx (guest, #2285) [Link]

Applications can already see if they're missing memory pages by using mincore().

You may not have been carefully reading the mem_notify patch descriptions.

What it does is trigger on memory pages going into the inactive list.  This is what happens to
prepare memory pages that are good candidates for swapping.

Here is the Changelog from version 5 of the mem_notify patch, see the v3 changes:
Changelog
-------------------------------------------------
  v4 -> v5 (by KOSAKI Motohiro)
    o rebase to 2.6.24-rc8-mm1
    o change display order of /proc/zoneinfo
    o ignore very small zone
    o support fcntl(F_SETFL, FASYNC)
    o fix some trivial bugs.

  v3 -> v4 (by KOSAKI Motohiro)
    o rebase to 2.6.24-rc6-mm1
    o avoid wake up all.
    o add judgement point to __free_one_page().
    o add zone awareness.

  v2 -> v3 (by Marcelo Tosatti)
    o changes the notification point to happen whenever
      the VM moves an anonymous page to the inactive list.
    o implement notification rate limit.

  v1(oom notify) -> v2 (by Marcelo Tosatti)
    o name change
    o notify timing change from just swap thrashing to
      just before thrashing.
    o also works with swapless device.

Avoiding the OOM killer with mem_notify

Posted Feb 1, 2008 4:57 UTC (Fri) by ikm (guest, #493) [Link] (5 responses)

> When memory gets tight, it is quite possible that applications have memory allocatedoften
caches for better performancethat they could free.

Not many programs actually have any adjustable caches. More often it would be some useless
unreclaimed junk (and of course i'm talking java here). So this change should primarily
benefit them, I guess.

Avoiding the OOM killer with mem_notify

Posted Feb 1, 2008 13:27 UTC (Fri) by nix (subscriber, #2304) [Link]

I don't think I've ever written a substantial program that didn't have 
*some* sort of caching in it, to trade off space against time somewhere 
where `spend the time, every time' was undesirable. Often these caches are 
not expected to be terribly large, and have the exciting expiration policy 
`never', but it would be fairly trivial to respond to a mem_notify signal 
by just ditching the entire contents of all of those caches.

Avoiding the OOM killer with mem_notify

Posted Feb 1, 2008 21:36 UTC (Fri) by droundy (subscriber, #4559) [Link] (3 responses)

There are certain obscure programs like firefox and gimp that have very large caches which
could be dumped under pressure.

Avoiding swap IO with mem_notify

Posted Feb 2, 2008 17:49 UTC (Sat) by riel (subscriber, #3142) [Link] (2 responses)

The patch series is indeed designed primarily to increase system performance by avoiding the
IO penalty of swapping out (and back in) memory that contains data that is useless or can be
easily recalculated.

Decompressing (part of) a jpeg just has to be faster than swapping in something from disk,
simply because disk seek times are on the order of 10ms.

Avoiding the OOM killer is a secondary goal.  I am not sure why that is the headline of the
article...

Avoiding swap IO with mem_notify

Posted Feb 3, 2008 4:14 UTC (Sun) by njs (subscriber, #40338) [Link]

Oh!  This makes *much* more sense.  (Especially the otherwise unintelligible part of the
original article that talks about pages getting swapped out, which has nothing to do with
OOM.)

In fairness, though, the LKML patch announcement just talks about it being good to avoid the
OOM.

Avoiding swap IO with mem_notify

Posted Feb 3, 2008 21:02 UTC (Sun) by oak (guest, #2786) [Link]

The article talks also about embedded systems. Those use use flash which 
doesn't suffer from the seek problem like hard disks do.  On embedded 
memory usage is much more of a problem though and kernel gets pretty slow 
too on devices without swap when memory gets really tight (all kernel does 
is page read-only pages from disk to memory and then discard them again 
until it finally does an OOM-kill).

I thought the point of the patch is for user-space to be able to do the  
memory management in *manageable places* in code.   As mentioned earlier, 
a lot of user-space code[1] doesn't handle memory allocation failures. And 
even if it's supposed to be, it can be hard to verify (test) that the 
failures are handled in *all* cases properly.  If user-space can get a 
pre-notification of a low-memory situation, it can in suitable place in 
code free memory so that further allocations will succeed (with higher 
propability). 

That also allows doing somehing like what maemo does.  If system gets 
notified about kernel low memory shortage, it kills processes which have 
notified it that they are in "background-killable" state (saved their UI 
state, able to restore it and not currently visible to user). I think it 
also notifies applications (currently) through D-BUS about low memory 
condition. Applications visible to user or otherwise non-background 
killable are then supposed to free their caches and/or disable features 
that could take a lot of additional memory.  If the caches are from heap 
instead of memory mapped, it's less likely to help because of heap 
fragmentation and it requiring more work/time though.

[1] Glib and anything built on top of it, like Gtk, assume that if process 
is still running, it got the memory, otherwise it's aborted.

Avoiding the OOM killer with mem_notify

Posted Feb 7, 2008 16:04 UTC (Thu) by ringerc (subscriber, #3071) [Link]

As a (fairly bad) application developer I'm not sure I understand how I can use this
effectively.

Say I'm notified that the kernel is running low on RAM. Furthermore, say I have a 265MB block
- a single allocation - and a bunch of small heap allocations before and after it in the heap.
Assuming I can afford to throw away the 256MB allocation I can delete() or free() it; however,
since I have more memory both higher and lower in the heap I don't see how the OS can reclaim
the RAM. I presume it relies on the huge allocation being a separate memory mapping (where
anonymous mmap() has been used by operator new or by malloc() ) that can be unmapped as per
/proc/self/maps?

So ... what about if my cache is a tree of heap-allocated objects of variable sizes (say, a
tree of polymorphic subtype instances)? Free()ing these objects is unlikely to help, since
many pages will have other things allocated on them too, and in any case the OS has no way to
know if a given page is free. This is especially likely when using the libstdc++ new(), which
(I understand) internally pre-allocates chunks of memory and parcels them out without invoking
lower level memory manegement. My understanding was that in this case all the kernel can do is
swap out the inactive page(s). I presume that to benefit from mem_notify I'd need to modify
the application to perform single large allocations dedicated only to use for this particular
cache, presumably either managing it with a C++ allocator interface (with the STL/Boost/etc)
or manually manage allocations within the block?

We all know it's often cleaner to use a container or an allocator in C++ when you want to
manage lots of small allocations of a particular type (especially when each is a fixed size
but they happen at unpredictable times) ... but many people don't, and in fact don't
understand them at all. It doesn't help that the STL and core language provide absolutely no
flexible pool allocators etc; you have to go to Boost for those, and many OSS / Free Software
projects are strangely reluctant to add a dependency on Boost.

What might help would be if there was a tiny portable C/C++ library (suitable to be compiled
directly into apps) that provided entirely standard and highly portable routines for non-Linux
platforms, and on Linux provided mem-pressure-aware C++ allocators and C memory allocation
calls + notifier callback hooks. All with identical interfaces, of course, though the
non-Linux ones would never actually detect and respond to memory pressure (unless other
platform support was added). A set of C++ "cache allocators" for various usage patterns that
could monitor memory pressure and automatically notify interested parties then invalidate and
release themselves would be particularly cool.

--
Craig Ringer