Respite from the OOM killer
[Posted September 28, 2004 by corbet]
Thomas Habets had an unfortunate experience recently. His Linux system ran
out of memory, and the dreaded "OOM killer" was loosed upon the system's
unsuspecting processes. One of its victims turned out to be his screen
locking program, leaving his session open to whoever might happen to walk
by. His response was
the oom_pardon patch,
which allows the system administrator to exempt certain processes from the
OOM killer's revenge. It turns out that SUSE has
a similar patch which allows administrators to
set the "OOM score" of specific processes, increasing or decreasing their
chances of being chosen for an untimely demise.
The OOM killer exists because the Linux kernel, by default, can commit to
supplying more memory than it can actually provide. Overcommitting memory
in this way allows the kernel to make fuller use of the system's resources,
because processes typically do not use all of the memory they claim. As an
example, consider the fork() system call, which copies all of a
process's memory for the new child process. In fact, all it does is to
mark the memory as "copy on write" and allow parent and child to share it.
Should either change a page shared in this way, a true copy is made. In
theory, the kernel could be called upon to copy all of the copy-on-write
memory in this way; in practice, that does not happen. If the kernel
reserved all of the necessary virtual memory (which includes swap space),
some of that space would certainly go unused. Rather than waste that space
- and fail to run programs or memory allocations that, in practice, it
could have handled - the kernel overcommits itself and hopes for the best.
When the best does not happen, the OOM killer comes into play; its job is
to kill processes and free up some memory. Getting it to kill the right
processes has been an ongoing challenge, however. One person's useless
memory hog is another's crucial application. Thus, over the years,
numerous efforts have been made to refine the OOM killer's heuristics, and
patches like "oom_pardon" have been created.
Not everybody agrees that this is a fruitful use of developer time.
Andries Brouwer came up with this analogy:
An aircraft company discovered that it was cheaper to fly its
planes with less fuel on board. The planes would be lighter and use
less fuel and money was saved. On rare occasions however the amount
of fuel was insufficient, and the plane would crash. This problem
was solved by the engineers of the company by the development of a
special OOF (out-of-fuel) mechanism. In emergency cases a passenger
was selected and thrown out of the plane. (When necessary, the
procedure was repeated.) A large body of theory was developed and
many publications were devoted to the problem of properly selecting
the victim to be ejected. Should the victim be chosen at random?
Or should one choose the heaviest person? Or the oldest? Should
passengers pay in order not to be ejected, so that the victim would
be the poorest on board? And if for example the heaviest person was
chosen, should there be a special exception in case that was the
pilot? Should first class passengers be exempted? Now that the OOF
mechanism existed, it would be activated every now and then, and
eject passengers even when there was no fuel shortage. The
engineers are still studying precisely how this malfunction is
caused.
Overcommitting memory and fearing the OOM killer are not necessary parts of
the Linux experience, however. Simply setting the sysctl parameter
vm/overcommit_memory to 2 turns off the overcommit
behavior and keeps the OOM killer forever at bay. Most modern systems
should have enough disk space to provide an ample swap file for most
situations. Rather than trying to keep pet processes from being killed
when overcommitted memory runs out, it might be easier just to avoid the
situation altogether.
(
Log in to post comments)