The 2005 Kernel Summit included three sessions dedicated to virtual memory
issues. The first of those, led by Dave Hansen, addressed the issue of
memory fragmentation. The kernel allocates memory by pages; over time,
most of physical memory ends up broken into individual pages, and
allocating multiple, contiguous pages ("high-order allocations") becomes
difficult. In many parts of the kernel, much effort has been expended to
avoid the need for high-order allocations; as long as kernel code only
needs to allocate individual pages, chances are that its allocations will
There are some places, however, where memory fragmentation remains a
problem. One of those is memory hotplug; before memory can be removed from
the system, the entire affected block must be freed. If that memory has
been split into many pieces, some of which are not freeable, removing the
memory is impossible. Virtualization schemes (Xen in particular) also run
into this problem; they employ a sort of virtual memory hotplugging to
quickly move pages from one virtual machine to another.
This "memory ballooning" technique runs into another problem: each client
must maintain a struct page structure for every page which might,
at some point, be part of its virtual machine. The resulting overhead
(which is replicated across all virtual machines running on the host) can
be quite large. The sparse memory patches which have been circulating
recently help in this regard, but there is more to be done. Fragmentation
makes the problem worse, since a larger address space must be maintained to
get sufficiently large contiguous areas.
Memory migration techniques are one way of dealing with this problem: if
there is no contiguous area which is big enough, simply move some pages
around to create such an area. This is a virtual memory system, after all;
that sort of move should be possible. In practice, migration behaves much
like swapping, except that the trip to the disk and back can be shorted
out. The contents of a page can be directly "swapped" to a new location,
and the page tables updated accordingly.
The problem here is that some sorts of
memory cannot be moved so easily. Pages which have been locked down are,
well, locked down. Large ("hugetlb") pages are difficult to move; among
other things, moving a huge page requires finding another large,
contiguous, suitably-aligned area elsewhere. Finally, kernel data cannot,
in general, be moved. A user page can be moved by simply changing the page
table entries which point to it. Kernel memory, however, will have any
number of internal kernel pointers referring to it; it's effectively nailed
down until the relevant kernel code lets go of it.
The kernel memory is quite a problem. Over time, kernel data structures
end up being spread all over physical memory. Some kernel allocation
algorithms seem designed to create worst-case results in this regard.
While there are ways to recover some kernel allocations, many structures
(inodes and dentry cache entries in particular) are sticky, and there is no
way to find and relocate kernel data structures which are holding down a
All of this leads to Mel Gorman's fragmentation-avoidance patches (covered
here last February). These
patches segregate kernel allocations into specific parts of memory,
reserving most of physical memory for allocations which can be relocated or
recovered at need. Some developers fear that this approach, besides adding
complexity to the virtual memory subsystem, may lead to more zone balancing
problems. If the memory region set aside for kernel allocations is sized
improperly, the performance of the whole system will suffer. Experience
has shown that this sort of memory partitioning can be very hard to get
right. There is a real need for for some sort of fragmentation avoidance,
however, so these patches might yet go in.
Martin Bligh moved the discussion into the area of responding to memory
pressure and related topics. One problem he pointed out was a remaining
issue of balance: how much of memory should be devoted to kernel caches,
and how much of it should hold user data? That balancing, he notes, still
does not always come out right.
Page allocation latency is an issue: how long will it take a process which
has just incurred a page fault to get the required page and continue
executing? This time is partially determined by how page reclaim and
writeback is handled. If a
page must be written out to satisfy a page fault, should that writeback be
performed by the faulting process, or by a system process like
kswapd? The "direct reclaim" method (making the faulting process
perform the write) has the advantage of throttling thrashing processes,
reducing their impact on the rest of the system. It restrains the behavior
of memory-hog processes, and makes them pay more of the cost of their
resource requirements. It does, however, also
increase allocation latency.
In some cases, direct reclaim can have a strong impact on the performance
of memory-intensive processes; as a result, this technique is unpopular in
some areas. Linus suggested that the real problem is that we still have
not learned how to do direct reclaim right. Rather than tossing it out
altogether, we should put more effort into figuring out how to get the
right degree of throttling without ruining performance altogether.
Memory fragmentation issues, especially with kernel memory, came up again.
A number of kernel structures - dentries, inodes, and
address_space structures in particular - are widespread,
interdependent, and hard to get rid of. There are ways to ask the system
to reduce the numbers of such structures, but it's still hard to free full
pages (much less groups of pages) this way. Some time was spent discussing
whether pulling the file name out of the dentry structure (thus making it
smaller) would help the situation. Since that name would then have to be
allocated elsewhere, the consensus was that things might even get worse.
More work needs to go into dealing with failure; if page writeback starts
to fail, "everything goes to hell." The system starts to thrash, and the
dreaded out-of-memory killer might get involved.
Unfortunately, if writeback is
happening to a remote device (via iSCSI, say, or NFS), the chances of
this sort of failure increase. Any writeback method which involves networking will
require memory allocations to work; it is, thus, subject to failure when
memory is tight - just when it is most needed. Solving this problem will
require marking high priority sockets, and continuing to process network
traffic from those sockets (using pre-allocated memory pools) even when
memory is tight. (See this
Kernel Page article from last March for more details on this issue.)
Large page handling in Linux still needs work; the current mechanism
satisfies the current needs of a certain large database vendor, but does
not go much farther. The large page subsystem should support more than
anonymous memory; it should handle program text and data areas as well.
There needs to be swapping for large pages, even if the swapper is forced
to break such pages up. Finally, large page allocation should be automatic
and transparent. A certain large Linux vendor recently lost a sale to
Solaris, which won by virtue of its more advanced, transparent large page
Linus responded that mixing large pages and a sane virtual memory subsystem
is just not possible. He sees the commercial pressures driving large
pages, however. Solve the fragmentation problems that come with large
pages, says Linus, and he'll have no arguments with the rest.
Finally, Martin entered a plea for better instrumentation of the virtual
memory subsystem. Many problems come up at customer sites; they are hard
(or impossible) to reproduce in the lab. There needs to be information
available on where pages have been allocated, so that developers can begin
to track down problems when they are reported.
The final portion of the memory management discussion was led by Christoph
Lameter and Nick Piggin; their topic was scalability. Christoph's page
table scalability patches (see this article from last December)
were discussed; he seems to have gotten a little tired of maintaining and
revising those patches, and would like to know when he might see them
merged. There are still developers who do not like this patch; Hugh
Dickins described it as "a hack for a special case" driven by one vendor.
Linus, however, sees it as a relatively simple solution, and sees no reason
why it should not be merged.
Nick Piggin's lockless page cache patches were briefly discussed. These
patches are relatively new, relatively complex, and relatively scary. Not
too many people have taken a detailed look at them yet, and there was no
real discussion of them.
to post comments)