Fragmentation avoidance
[Posted November 2, 2005 by corbet]
Mel Gorman's fragmentation avoidance patches were covered here
last February. This patch set
divides all memory allocations into three categories: "user reclaimable,"
"kernel reclaimable," and "kernel non-reclaimable." The idea to support
multi-page contiguous allocations by grouping reclaimable allocations
together. If no contiguous memory ranges are available, one can be created
by forcing out reclaimable pages. Since non-reclaimable pages have been
segregated into their own area, the chances of such a page blocking the
creation of a contiguous set of free pages is relatively small.
Mel recently posted version 19 of
the fragmentation avoidance patch and requested that it be included in
the -mm kernel. That request started a lengthy discussion on whether this
patch set is a good idea or not. There is, it seems, a fair amount of
uncertainty over whether this code belongs in the kernel.
There are a few reasons for wanting fragmentation avoidance, and the
arguments differ for each of them.
The first of these reasons is to increase the probability of high-order
(multi-page) allocations in the kernel. Nobody denies that Mel's patch
achieves that goal, but there are developers who claim that a better
approach is to simply eliminate any such allocations. In fact, most
multi-page allocations were dealt with some time ago. A few remain,
however, including the two-page kernel stacks still used by default on most
systems. When the kernel stack allocation fails, it blocks the creation of
a new process. The kernel may eventually move to single-page stacks in all
situations, but a few higher-order allocations will remain. It is not
always possible to break required memory into single-page chunks.
The next reason, strongly related to the first, is huge pages. The huge
page mechanism is used to improve performance for certain applications on
large systems; there are few users currently, but that could change if huge
pages were easier to work with. Huge pages cannot be allocated for
applications in the absence of a large - and suitably aligned - region of
contiguous memory. In practice, they are very difficult to create on
systems which have been running for any period of time. Failure to
allocate a huge page is relatively benign; the application simply has to
get by with regular pages and take the performance hit. But, given that
you have a huge page mechanism, making it work more reliably would be
worthwhile.
The fragmentation avoidance patches can help with both high-order
allocations and huge pages. There is some debate over whether it is the
right solution to the problem, however. The often-discussed alternative
would be to create one or more new memory zones set aside for reclaimable
memory. This approach would make use of the zone system already built into
the kernel, thus avoiding the creation of a new layer. A zone-based system
might also avoid the perceived (though somewhat unproven) performance
impact of the fragmentation avoidance patches. Given that this impact is
said to be felt in that most crucial of workloads - kernel compiles - this
argument tends to resonate with the kernel developers.
The zone-based approach is not without problems, however. Memory zones,
currently, are static; as a result, somebody would have to decide how to
divide memory between the reclaimable and non-reclaimable zones. This
adjustment looks like it would be hard to get right in any sort of reliable
way. In the past, the zone system has also been the source of a number of
performance problems, mostly related to balancing of allocations between
the zones. Increasing the complexity of the zone system and adding more
zones could well bring those problems back.
There is another motivation for fragmentation avoidance which brings a
different set of constraints: support for hot-pluggable memory. This
feature is useful on high-availability systems, but it is also heavily used
in association with virtualization. A host running a number of virtualized
Linux instances can, by way of the hotplug mechanism, shift its memory
resources between those instances in response to the demands of each.
Before memory can be removed from a running system, its contents must be
moved elsewhere - at least, if one wants to still have a running system
afterward. The fragmentation avoidance patches can help by putting only
reclaimable allocations in the parts of memory which might be removed. As
long as all the pages in a region can be reclaimed, that region is
removable.
A very different argument has surfaced here: Ingo Molnar is insisting that any mechanism claiming to
support hot-pluggable memory be able to provide a 100% success rate. The
current code need not live up to that metric, but there needs to be a clear
path toward that goal. Otherwise, the kernel developers risk advertising a
feature which they may not ever be able to support in a reliable way. The
backers of fragmentation avoidance would like to merge the patches, solving
90% of the problem, and leave the other 90%
for later. Ingo, instead, fears that second 90%, and wants to know how it
will get done.
Why can't the current patches offer 100% reliability if they only put
reclaimable memory in hot-pluggable regions? There are ways to lock down
pages which were once reclaimable; these include DMA operations and pages
explicitly locked by user space. There is also the issue of what happens
when the kernel runs out of non-reclaimable memory. Rather than fail a
non-reclaimable allocation attempt, the kernel will allocate a page from
the reclaimable region. This fallback is necessary to avoid inflicting
reliability problems on the rest of the kernel. But the presence of a
non-reclaimable page in a reclaimable region will prevent the system from
vacating that region.
This problem can be solved by getting rid of non-reclaimable allocations
altogether. And that can be done by changing how the kernel's address
space works. Currently, the kernel runs in a single, contiguous virtual
address space which is mapped directly onto physical memory - often using a
single, large page table entry. (The vmalloc() region is a
special exception, but it is not an issue here). If the kernel were,
instead, to use normal-sized pages like the rest of the system, its memory
would no longer need to be physically contiguous. Then, if a kernel page
gets in the way, it can simply be moved to a more convenient location.
Beyond the fact that this approach fundamentally changes the kernel's
memory model, there are a couple of little issues with it. There would be
a performance hit caused by the higher translation buffer use, and an
increase in the amount of memory needed to store the kernel's page tables.
Certain kernel operations - DMA in particular - cannot tolerate physical
addresses which might change at arbitrary times. So there would have to be
a new API where drivers could request physically-nailed regions - and be
told by the kernel to give them up. In other words, breaking up the
kernel's address space opens a substantial barrel of worms. It is not the
sort of change which would be accepted in the absence of a fairly strong
motivation, and it is not clear that hot-pluggable memory is a sufficiently
compelling cause.
So no conclusions have been reached on the inclusion of the fragmentation
avoidance patches. In the short term, Andrew Morton's controversy
avoidance mechanisms are likely to keep the patch out of the -mm tree,
however. But there are legitimate reasons for wanting this capability in
the kernel, and the issue is unlikely to go away. Unless somebody comes up
with a better solution, it could be hard to keep Mel's patch out forever.
(
Log in to post comments)