Short topics in memory management
[Posted March 6, 2007 by corbet]
Memory management has been a relatively quiet topic over much of the life
of the 2.6.x kernels. Many of the worst problems have been solved and the
MM hackers have gone on to other things. That does not mean that there is
no more work to do, however; indeed, things might be about to heat up. A
few recent discussions illustrate the sort of pressures which may lead to a
renewed interest in memory management work in the near future.
Mel Gorman's fragmentation avoidance patches have been discussed here a few
times in the past. The core idea behind Mel's work is to identify pages
which can be easily moved or reclaimed and group them together. Movable
pages include those allocated to user space; moving them is just a matter
of changing the relevant page table entries. Reclaimable pages include
kernel caches which can be released should the need arise. Grouping
these pages together makes it easy for the kernel to free large blocks of
memory, which is useful for enabling high-order allocations or for vacating
regions of memory entirely.
In the past, reviewers of Mel's patches have disagreed over how they should
work. Some argue in favor of maintaining separate free lists for the
different types of allocations, while others feel that this sort of memory
partitioning is just what the kernel's zone system was created to do. So,
this time around, Mel has posted two sets of patches: a list-based grouping mechanism
and a new ZONE_MOVABLE
zone which is restricted to movable allocations.
The difference this time around is that the two patches are designed to
work together. By default, there is no movable zone, so the list-based
mechanism handles the full job of keeping alike allocations together. The
administrator can configure in ZONE_MOVABLE at boot time with the
kernelcore= option, which specifies the amount of memory which is
not to be put into that zone. In addition, Mel has posted some comprehensive information on how
performance is affected by these patches. In an unusual move, Mel has
included a set of videos showing just how memory allocations respond to
system stress with different allocation mechanisms in place; the image at
the right shows one frame from one of those videos. The demonstration is
convincing, but one is left with the uneasy hope that the creation of
multimedia demonstrations will not become necessary to get patches into the
kernel in the future.
These patches have found their way into the -mm tree, though Andrew Morton
is still unclear on whether he thinks they are worthwhile or not. Among
other things, he is concerned about how they fit with other, related work,
especially memory hot-unplugging and per-container memory limits. While
patches addressing both areas have been posted, nothing is really at a
point where it is ready to be merged. This
discussion between Mel and Andrew is worth reading for those who are
interested in this topic.
The hot removal of memory can clearly be helped by Mel's work - memory
which is subject to removal can be restricted to movable and reclaimable
allocations, allowing it to be vacated if need be. Not everybody is
convinced that hot-unplugging is a useful feature, though. In particular,
Linus is opposed to the idea. The biggest
potential use for hot-unplugging is for virtualization; it allows a
hypervisor to move memory resources between guests as their needs change.
Linus points out that most virtualization mechanisms already have
mechanisms which allow the addition and removal of individual pages from
guests; there is, he says, no need for any other support for memory
changes.
Another use for this technique is allowing systems to conserve power by
turning off banks of memory when they are not needed. Clearly, one must be
able to move all useful data out of a memory bank before powering it down.
Linus is even more dismissive of this idea:
The whole DRAM power story is a bedtime story for gullible
children. Don't fall for it. It's not realistic. The hardware
support for it DOES NOT EXIST today, and probably won't for several
years. And the real fix is elsewhere anyway...
More information on his objections is available here for those who are interested. In short,
Linus thinks it would make much more sense to look at turning off entire
NUMA nodes rather than individual memory banks. That notwithstanding, Mark
Gross has posted a patch enabling
memory power-down which includes some basic anti-fragmentation
techniques. Says Mark:
To be clear PM-memory will not be useful unless you have workloads
that can take advantage of it. The identified workloads are not
desktop workloads. However; there is a non-zero number of
interested users with applicable workloads that make pushing the
enabling patches out to the community worth while. These workloads
tend to be within network elements and servers where memory
utilization tracks traffic load.
It has also been suggested that resident set size limits (generally
associated with containers) can solve many of the same problems that the
anti-fragmentation work is aimed at. Rik van Riel was heard to complain in response that RSS limits
could aggravate the scalability problems currently being experienced by
the Linux memory management system. That drew questions from people like
Andrew, who were not really aware of those problems. Rik responded with a few relatively vague
examples; his ability to be specific is evidently restricted by agreements
with the customers experiencing the problems.
That led to a whole discussion on whether it makes any sense to try to
address memory management problems without test cases which demonstrate
those problems. Rik argues that fixing
test cases tends to break things in the real world. Andrew responds:
Somehow I don't believe that a person or organisation which is
incapable of preparing even a simple testcase will be capable of
fixing problems such as this without breaking things.
Rik has put together a page
describing some problem workloads in an attempt to push the discussion
forward.
One of Andrew's points is that trying to fix memory management problems
caused by specific workloads in the kernel will always be hard; the kernel
simply does not always have the information to know which pages will be
needed soon and which can be discarded. Perhaps, he says, the right answer
is to make it easier for user space to communicate its expected future
needs. To that end, he put together a pagecache management tool for
testing. It works as an LD_PRELOAD library which intercepts
file-related system calls, tracks application usage, and tells the kernel
to drop pages out of the cache after they have been used. The result is
that common operations (copying a kernel tree, for example) can be carried
out without forcing other useful data out of the page cache.
There were some skeptical responses to this posting. There was also
some interest and some discussion of how smarter, application-specific
policies could be incorporated into the tool. A possible backup tool policy, for example,
would force the output file out of memory immediately, track pages read
from other files and force them back out - but only if they were not
already in the page cache, and so on. It remains to be seen whether
anybody will run with this tool and try to use it to solve real workload
problems, but there is some potential there. The kernel does not always
know best.
(
Log in to post comments)