Toward a smarter OOM killer

Posted Nov 4, 2009 17:02 UTC (Wed) by holstein (guest, #6122)
In reply to: Toward a smarter OOM killer by mjthayer
Parent article: Toward a smarter OOM killer

Yes, exactly.

For a server, that would let the sysadmin log the event, react to it, etc.

For a desktop, one can imagine a popup warning the user, perhaps with a list of memory-hog processes. This will let the user use the session support of Firefox to restart it's browsing session...

In many case, restarting anew one guilty process can be enough to prevent OOM killing spree.

Toward a smarter OOM killer

Posted Nov 4, 2009 17:22 UTC (Wed) by mjthayer (guest, #39183) [Link] (37 responses)

One of my favourite grouses applies here though - at least on my system, I can be sure that a swap-to-death will start long before the OOM killer does, or before any notification would occur. For me, that would be something much more urgent to fix.

I think it might be doable by making the algorithm choosing pages to swap out more smart. So that each process was guaranteed a certain amount of resident memory (depending on the number of other processes and logged in users and probably a few other factors), and that a process that tried to hog memory would end up swapping out its own pages when other running processes dropped to their guaranteed minimum. If I ever have time, I will probably even try coding that up.

death by swap

Posted Nov 4, 2009 18:12 UTC (Wed) by jabby (guest, #2648) [Link] (36 responses)

Indeed. Many systems are installed with way too much swap space, such that swap thrashing begins killing performance before the OOM killer can be invoked. So, this is a plea to other people to

STOP USING the old "TWICE RAM" guideline!

That dates back to when 64MB of RAM was considered "beefy". I seriously had to provision a server last night with 16GB of physical memory and the customer wanted a 32GB swap partition!! Seriously?! If your system is still usable after you're 2 to 4 gigs into your swap, I'd be shocked.

death by swap

Posted Nov 4, 2009 18:21 UTC (Wed) by mjthayer (guest, #39183) [Link] (1 responses)

I think Ubuntu still do that by default :) Yes, the workaround for swapping to death is allocating no or little swap. I would hope though, that my "algorithm" above would reduce the need for workarounds though, by making sure that all non-hogging processes can have their working set in real memory.

Not sure that 32GB of swap would be appropriate even then though...

death by swap

Posted Nov 6, 2009 11:40 UTC (Fri) by patrick_g (subscriber, #44470) [Link]

>>> I think Ubuntu still do that by default :)

Yes Ubuntu still do that by default...despite many bug reports like mine.
Note that my bug report is old, very old (pre-Gutsy time) => https://bugs.launchpad.net/ubuntu/+source/partman-auto/+bug/134505
No reaction at all from the Ubuntu devs....very discouraging.

death by swap

Posted Nov 4, 2009 18:31 UTC (Wed) by clugstj (subscriber, #4020) [Link]

"If your system is still usable after you're 2 to 4 gigs into your swap, I'd be shocked."

It all depends on the workload.

death by swap

Posted Nov 4, 2009 18:36 UTC (Wed) by ballombe (subscriber, #9523) [Link] (4 responses)

If you use suspend-to-disk, you need a large amount of swap-space anyway.

death by swap

Posted Nov 5, 2009 6:54 UTC (Thu) by gmaxwell (guest, #30048) [Link] (3 responses)

Or using tmpfs for /tmp

death by swap

Posted Nov 5, 2009 11:11 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

Which really should be the default. /var/tmp is for large temporary files.

death by swap

Posted Nov 5, 2009 18:25 UTC (Thu) by khc (guest, #45209) [Link] (1 responses)

How does tmpfs for /tmp help suspend to disk?

death by swap

Posted Nov 5, 2009 18:46 UTC (Thu) by nix (subscriber, #2304) [Link]

I think his point was that using tmpfs for /tmp needs a large amount of
swap :)

death by swap

Posted Nov 4, 2009 18:49 UTC (Wed) by knobunc (guest, #4678) [Link] (2 responses)

I agree with you... unless you want to hibernate. In which case you need at least the same amount of swap as RAM. But if you actually want to use that swap whilst running, you'd need a bit more. So now you are back to somewhere between 1x and 2x RAM.

death by swap

Posted Nov 4, 2009 19:59 UTC (Wed) by zlynx (guest, #2285) [Link] (1 responses)

Doesn't the user-space hibernate still exist? Didn't that allow hibernate to a file, with compression and everything?

I don't know, I haven't run a Linux laptop in almost a year now since an X.org bug killed my old laptop by overheating it.

death by swap

Posted Nov 5, 2009 18:22 UTC (Thu) by nix (subscriber, #2304) [Link]

tuxonice allows that too. But just because you *can* hibernate to a file
doesn't mean you *must*.

death by swap

Posted Nov 4, 2009 21:43 UTC (Wed) by drag (guest, #31333) [Link] (11 responses)

I don't know how the particular algorithm works but I really doubt that the
amount of swap space will dictate how likely or how much swap space the
Linux kernel wants to use.

I expect the only time it'll want to use swap in a busy system is if the
active amount of used memory exceeds the amount of main memory.

death by swap

Posted Nov 4, 2009 22:54 UTC (Wed) by mjthayer (guest, #39183) [Link] (10 responses)

The problem with the amount of swap space is that the more there is, the longer it takes until the OOM killer is fired up. If you have 2GB of swap, then it will be delayed by at least the time it takes to write a gig or so of data to the disk, page by page with pauses for running the swapping thread in-between each page, and your system will be unusable until then.

The algorithm is roughly as follows.

* Assign each process a contingent of main memory, e.g. by dividing the total available by the number of users with active running processes, and giving each process an equal share of the contingent of the user running it.
* When a page of memory is to be evicted to the swap file, make sure that it has either not been accessed for a certain minimum length of time, or that the process owning it is over it's contingent, or that it is owned by the process on who's behalf it is to be swapped out. If not, search for a new page to evict.

This should mean that if a process starts leaking memory badly or whatever, after a while it will just loop evicting its own pages and not trouble the other processes on the system. It should also mean that all not-too-large processes on the system should stay reasonably snappy, making it easier to find and kill the out-of-control process.

death by swap

Posted Nov 4, 2009 22:56 UTC (Wed) by mjthayer (guest, #39183) [Link]

Oh yes, and of course processes with a working set of less than their contingent will not hog the unused part when memory gets tight.

death by swap

Posted Nov 7, 2009 1:21 UTC (Sat) by giraffedata (guest, #1954) [Link] (8 responses)

If you get to the point that you're stealing a page from a process simply because that process is over its quota of real memory, you should steal ALL that process' pages. It can't fit its working set into memory, so it isn't going to make decent progress, so the memory you do give it is wasted. You're also wasting the swap I/O it's doing. After a while, after other processes have had a chance to progress, you can swap them out and give the first process the memory it needs. If you can't do that because it's run amok and simply demands more memory than you can afford, that's when you kill that process.

Algorithms for this were popular in the 1970s for batch systems. Unix systems were born as interactive systems where the idea of not dispatching a process at all for ten seconds was less palatable than making the user kill some stuff or reboot, but with Unix now used for more diverse things, I'm surprised Linux has never been interested in long term scheduling to avoid page thrashing.

death by swap

Posted Nov 7, 2009 3:39 UTC (Sat) by tdwebste (guest, #18154) [Link]

On embedded devices I have constructed processing states with runit to control the running processes. This simple but effective long term scheduling to avoid out memory/swapping, works well when you know in advance what processes will be running on the device.

death by swap

Posted Nov 7, 2009 10:01 UTC (Sat) by dlang (guest, #313) [Link] (1 responses)

if the program you are stealing the memory from needs it to make progress, you would be right.

but back in the 70's they realized that most of the time most programs don't use all their memory at any one time. so the odds are pretty good that the page of ram that you swap out will not be needed right away.

and the poor programming practices that are commone today make this even more true

death by swap

Posted Nov 7, 2009 17:06 UTC (Sat) by giraffedata (guest, #1954) [Link]

but back in the 70's they realized that most of the time most programs don't use all their memory at any one time. so the odds are pretty good that the page of ram that you swap out will not be needed right away.

I think you didn't follow the scenario. We're specifically talking about a page that is likely to be needed right away. It's a page that the normal page replacement policy would have left alone because it expected it to be needed soon -- primarily because it was accessed recently.

But the proposed policy would steal it anyway, because the process that is expected to need it is over its quota and the policy doesn't want to harm other processes that aren't.

What was known in the 70s was that at any one time, a program has a subset of memory it accesses a lot, which was dubbed its working set. We knew that if we couldn't keep a process' working set in memory, it was wasteful to run it at all. It would page thrash and make virutally no progress. Methods abounded for calculating the working set size, but the basic idea of keeping the working set in memory or nothing was constant.

death by swap

Posted Nov 9, 2009 9:02 UTC (Mon) by mjthayer (guest, #39183) [Link] (4 responses)

> If you get to the point that you're stealing a page from a process simply because that process is over its quota of real memory, you should steal ALL that process' pages. It can't fit its working set into memory, so it isn't going to make decent progress, so the memory you do give it is wasted.
I suppose I see three cases here. One is that the page was part of the process' working set at an earlier point in time, but no longer is. In that case swapping it out is the right thing to do. The other is that the process is in control, but it's working set is bigger than the available memory. Then I agree that there is a good case for putting it on hold until enough memory is available, although that is a non-trivial problem which is somewhat outside of the scope of what I am trying to do. And the third case is the one that I am interested in - a runaway process which will eventually be OOMed. In this case, the quota will stop it from trampling on the working set of every other process in memory in the meantime.

While we are on the subject, does anyone reading this know where RSS quotas are handled in the current kernel code? I was able to find the original patches enabling them, but the code seems to have changed out of recognition since then.

death by swap

Posted Nov 9, 2009 12:34 UTC (Mon) by hppnq (guest, #14462) [Link]

You may want to look at Documentation/cgroups/memory.txt. Otherwise, it seems there is no way to enforce RSS limits. Rik van Riel wrote a patch a few years ago but it seems to have been dropped.

Personally, I would hate to think that my system spends valuable resources managing runaway processes. ;-)

Encouragement encouragement encouragement

Posted Nov 13, 2009 22:32 UTC (Fri) by efexis (guest, #26355) [Link] (1 responses)

I (for one) would be most interested in your work. The systems I manage are very binary in whether they're behaved or not, because I have configured them to behave (ie, how much memory is available, decide how much to give to database query caching etc, so everything just works). I try to keep swap file around 384Meg whether the system has 1G or 8G of RAM because that's a nice size to swap stuff out that you don't need to keep in memory, but using disk as virtual RAM is just way too slow, I'd prefer processes be denied memory requests than have them granted at the cost of slowing the whole system down. But all in all, because everything's set up for the amount of memory available, the only time I will get into OOM situations is when there is a runaway process (I manage systems for hosting small numbers of database driven websites, some of them may be developed on windows systems and then moved to the linux system, most are written in PHP which has a very low bar of entry, and so developers often do not have a clue when it comes to writing scalable code).

So, what I would want is something that assumes that most of the system is being well behaved, but will quickly chop off anything that is not, and will stop the badly bahaved stuff from dragging the well behaved stuff down with it. The well behaved stuff quite simply doesn't need managing; that's my job. The badly behaved stuff needs taking care of quickly, by something that your idea seems to reflect *perfectly* (it's not often you read someones ideas and your brain flips "that's -exactly- what I need").

How would I find out if you do get chance to hammer out the code that achieves this? Is there an non-LKML route to watch this (please don't say twitter :-p )

Encouragement encouragement encouragement

Posted Nov 16, 2009 13:45 UTC (Mon) by mjthayer (guest, #39183) [Link]

Ahem, I haven't thought that far ahead yet :) So far it is one of a few small projects I have lined up for whenever I have time, but I was posting here in order to get some feedback from wiser minds than my own before I made a start.

death by swap

Posted Nov 16, 2009 13:50 UTC (Mon) by mjthayer (guest, #39183) [Link]

>> If you get to the point that you're stealing a page from a process simply because that process is over its quota of real memory, you should steal ALL that process' pages. It can't fit its working set into memory, so it isn't going to make decent progress, so the memory you do give it is wasted.
>I suppose I see three cases here. One is that the page was part of the process' working set at an earlier point in time, but no longer is. In that case swapping it out is the right thing to do. The other is that the process is in control, but it's working set is bigger than the available memory. Then I agree that there is a good case for putting it on hold until enough memory is available, although that is a non-trivial problem which is somewhat outside of the scope of what I am trying to do. And the third case is the one that I am interested in - a runaway process which will eventually be OOMed. In this case, the quota will stop it from trampling on the working set of every other process in memory in the meantime.
Actually case 2 could be handled to some extent by lowering the priority of a process that kept on swapping for too long.

death by swap

Posted Nov 4, 2009 23:49 UTC (Wed) by jond (subscriber, #37669) [Link] (2 responses)

On the other hand, if you set vm_overcommit to 2, you almost certainly want
lots of swap, especially in cheapo VMs. There's a whole raft of programs
that you cannot start with say, 256M RAM and little swap without overcommit.
Mutt and irssi are two that spring to mind. Lots of swap lets you
"overcommit" with the risk being you end up swapping rather than you end up
going on a process killing spree.

death by swap

Posted Nov 6, 2009 8:46 UTC (Fri) by iq-0 (subscriber, #36655) [Link]

You can still set the vm.overcommit_ratio to something higher. I think in most cases a vm.overcommit_memory=2 and vm.overcommit_ratio=1000 is saner than using vm.overcommit_memory=0.
The only reason this couldn't be a sane default is that on systems with 32MB a overcommit_ratio of 1000% is still too small (but still if you have 32MB and no swap, your probably still better off with this limit)

death by swap

Posted Nov 18, 2009 16:12 UTC (Wed) by pimlottc (guest, #44833) [Link]

irssi can't start on a system with 256MB RAM and no swap? Seriously?

death by swap

Posted Nov 5, 2009 17:52 UTC (Thu) by sbergman27 (guest, #10767) [Link]

When I was running our 60 desktop XDMCP/NX server on 12GB of memory, performance was
fine even though swap space usage was usually 8+ GB. We ran this way for months with no
complaints. Just because you're using a lot of swap space doesn't mean you are paging
excessively. Note that if a page is paged out and then brought back into memory, it stays
written in swap to save writting it again if that page gets pages out again. You can't tell a
whole lot about how much swap you are *really* using by looking at the swap used number.
systat monitoring and sar -W are more useful than the swap used number for accessing
swapping.

I do use the twice ram rule. I'd rather the system get slow than crash or have the OOM
running loose on it.

death by swap

Posted Nov 5, 2009 21:14 UTC (Thu) by anton (subscriber, #25547) [Link]

I have seen several cases where a process slowly consumed more and more memory, but apparently always had a small working set, so it eventually consumed all the swap space and the OOM killer killed it (sometimes it killed other processes first, though). The machine was so usable during this, that I did not notice that anything was amiss until some process was missing. IIRC one of these cases happened on a machine with 24GB RAM and 48GB swap; there it took several days until the swap space was exhausted.

death by swap

Posted Nov 6, 2009 6:35 UTC (Fri) by motk (subscriber, #51120) [Link] (1 responses)

CPU states: 73.2% idle, 16.4% user, 10.4% kernel, 0.0% iowait, 0.0% swap
Memory: 64G real, 2625M free, 62G swap in use, 40G swap free

You were saying? :)

death by swap

Posted Nov 6, 2009 8:43 UTC (Fri) by mjthayer (guest, #39183) [Link]

I take it you don't have any out-of-control processes allocating memory there, do you? I wasn't saying it doesn't work in the general case :)

death by swap

Posted Nov 18, 2009 16:47 UTC (Wed) by pimlottc (guest, #44833) [Link] (2 responses)

STOP USING the old "TWICE RAM" guideline!

You know, I've been hearing this lately, but the problem is there seems to be no consensus on what the guideline should be. Some swear by no swap at all, while others say running without at least some is dangerous. No one seems to agree on what an appropriate amount is. Until there is a new accepted rule of thumb, everyone will keep using the old one, even if it's wrong.

death by swap

Posted Nov 18, 2009 18:37 UTC (Wed) by dlang (guest, #313) [Link] (1 responses)

there used to be a technical reason for twice ram.

nowdays it depends on your system and how you use it.

if you use the in-kernel suspend code, the act of suspending will write all your ram into the swap space, so swap must be > ram

if you don't use the in-kernel suspend code you need as much swap as you intend to use. How much swap you are willing to use depends very much on your use case. for most people a little bit of swap in use doesn't hurt much to use and by freeing up additional ram results in a overall faster system. for other people the unpredictable delays in applications due to the need to pull things from swap is unacceptable. In any case, having a lot of swap activity if pretty much unacceptable for anyone.

note that if you disable overcommit you need more swap or allocations (like when a large program forks) will fail so you need additional swap space > max memory footprint of any process you intend to allow to fork (potentially multiples of this). With overcommit disabled I could see you needing swap significantly higher than 2x ram in some conditions.

my recommendation is that if you are using a normal hard drive (usually including the SSD drives that emulate normal hard drives), allocate a 2G swap partition and leave overcommit enabled (and that's probably a lot larger than you will ever use)

if you are using a system that doesn't have a normal hard drive (usually this sort of thing has no more than a few gig of flash as it's drive) you probably don't want any swap, and definantly want to leave overcommit on.

death by swap

Posted Nov 19, 2009 16:46 UTC (Thu) by nye (subscriber, #51576) [Link]

>my recommendation is that if you are using a normal hard drive (usually including the SSD drives that emulate normal hard drives), allocate a 2G swap partition and leave overcommit enabled (and that's probably a lot larger than you will ever use)

FWIW, I agree, except that I'd make it a file instead of a partition - it's just as fast, and it leaves some flexibility just in case.

I use a 2GB swapfile on machines ranging from 256MB to 8GB of RAM - it may be overkill but that much disk space costs next to nothing. I wouldn't want to set it higher, because if I'm really using swap to that extent, the machine's probably past the point of usability anyway.

death by swap

Posted Nov 19, 2009 13:29 UTC (Thu) by makomk (guest, #51493) [Link] (2 responses)

On the other hand, too small a swap partition can also cause "death by swap". Once the swap partition fills up, the only way to free up memory is to start evicting code from RAM - and that utterly kills the responsiveness of the system, as less and less RAM becomes available for the running apps' code. The system enters a kind of death spiral, where the more memory the application allocates, the more slowly it and every other application runs.

death by swap

Posted Nov 19, 2009 18:33 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

I don't understand how that is any different than thrashing the swap.

in both cases you have to read from disk to continue, the only difference is if you are reading from the swap space or the initial binary (and since both probably require seeks, it's not even a case of random vs sequential disk access)

death by swap

Posted Dec 5, 2009 17:19 UTC (Sat) by misiu_mp (guest, #41936) [Link]

That was a description of trashing.
I would presume that executables do not make up much of the used memory. So reusing their pages will probably not be much gain.

Trashing is what happens when processes get their pages continuously swapped in and out as the system schedules them to run. Thats when everything grinds to a halt because each context switch or memory access needs to swap out some memory in order to make place for some other memory to be read in from the swap or the binary.
That can possibly happen when the total working set (actively used memory) of the busy processes exceeds the amount of ram, or more realistically, when the swap (nearly) runs out so there is nowhere to evict unused pages to free up ram - leaving space for only small chunks to run at a time.
Usually (in my desktop experience) soon after that the oom killer start kicking in, which causes the system to trash even more (as the oom have needs too) and it takes hours for it to be done.
When it happens I usually have no choice but to reboot loosing some data, so for me the oom killer has been useless and over-commitment the root of all evil.

Using up swap alone does not affect performance much, if you dont access whats on the swap. If you continuously do that - that's trashing.

Toward a smarter OOM killer

Posted Nov 5, 2009 13:45 UTC (Thu) by hppnq (guest, #14462) [Link]

What could work for you, is to run a dummy process (allocate memory as you like) and have that killed first by the OOM (use Evgeniy Polyakov's patch), so it would 1) notify the administrator that the system has run into this problem, and 2) free up enough memory so something can actually be done about it.

Just as an exercise, of course. ;-)

Toward a smarter OOM killer

Toward a smarter OOM killer

Toward a smarter OOM killer

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

** Encouragement encouragement encouragement **

** Encouragement encouragement encouragement **

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

death by swap

Toward a smarter OOM killer

Encouragement encouragement encouragement

Encouragement encouragement encouragement