LWN.net Logo

Kernel Summit 2005: Virtual memory topics

From LWN's 2005 Kernel Summit coverage.
The 2005 Kernel Summit included three sessions dedicated to virtual memory issues. The first of those, led by Dave Hansen, addressed the issue of memory fragmentation. The kernel allocates memory by pages; over time, most of physical memory ends up broken into individual pages, and allocating multiple, contiguous pages ("high-order allocations") becomes difficult. In many parts of the kernel, much effort has been expended to avoid the need for high-order allocations; as long as kernel code only needs to allocate individual pages, chances are that its allocations will succeed.

There are some places, however, where memory fragmentation remains a problem. One of those is memory hotplug; before memory can be removed from the system, the entire affected block must be freed. If that memory has been split into many pieces, some of which are not freeable, removing the memory is impossible. Virtualization schemes (Xen in particular) also run into this problem; they employ a sort of virtual memory hotplugging to quickly move pages from one virtual machine to another.

This "memory ballooning" technique runs into another problem: each client must maintain a struct page structure for every page which might, at some point, be part of its virtual machine. The resulting overhead (which is replicated across all virtual machines running on the host) can be quite large. The sparse memory patches which have been circulating recently help in this regard, but there is more to be done. Fragmentation makes the problem worse, since a larger address space must be maintained to get sufficiently large contiguous areas.

Memory migration techniques are one way of dealing with this problem: if there is no contiguous area which is big enough, simply move some pages around to create such an area. This is a virtual memory system, after all; that sort of move should be possible. In practice, migration behaves much like swapping, except that the trip to the disk and back can be shorted out. The contents of a page can be directly "swapped" to a new location, and the page tables updated accordingly.

The problem here is that some sorts of memory cannot be moved so easily. Pages which have been locked down are, well, locked down. Large ("hugetlb") pages are difficult to move; among other things, moving a huge page requires finding another large, contiguous, suitably-aligned area elsewhere. Finally, kernel data cannot, in general, be moved. A user page can be moved by simply changing the page table entries which point to it. Kernel memory, however, will have any number of internal kernel pointers referring to it; it's effectively nailed down until the relevant kernel code lets go of it.

The kernel memory is quite a problem. Over time, kernel data structures end up being spread all over physical memory. Some kernel allocation algorithms seem designed to create worst-case results in this regard. While there are ways to recover some kernel allocations, many structures (inodes and dentry cache entries in particular) are sticky, and there is no way to find and relocate kernel data structures which are holding down a specific page.

All of this leads to Mel Gorman's fragmentation-avoidance patches (covered here last February). These patches segregate kernel allocations into specific parts of memory, reserving most of physical memory for allocations which can be relocated or recovered at need. Some developers fear that this approach, besides adding complexity to the virtual memory subsystem, may lead to more zone balancing problems. If the memory region set aside for kernel allocations is sized improperly, the performance of the whole system will suffer. Experience has shown that this sort of memory partitioning can be very hard to get right. There is a real need for for some sort of fragmentation avoidance, however, so these patches might yet go in.

Martin Bligh moved the discussion into the area of responding to memory pressure and related topics. One problem he pointed out was a remaining issue of balance: how much of memory should be devoted to kernel caches, and how much of it should hold user data? That balancing, he notes, still does not always come out right.

Page allocation latency is an issue: how long will it take a process which has just incurred a page fault to get the required page and continue executing? This time is partially determined by how page reclaim and writeback is handled. If a page must be written out to satisfy a page fault, should that writeback be performed by the faulting process, or by a system process like kswapd? The "direct reclaim" method (making the faulting process perform the write) has the advantage of throttling thrashing processes, reducing their impact on the rest of the system. It restrains the behavior of memory-hog processes, and makes them pay more of the cost of their resource requirements. It does, however, also increase allocation latency.

In some cases, direct reclaim can have a strong impact on the performance of memory-intensive processes; as a result, this technique is unpopular in some areas. Linus suggested that the real problem is that we still have not learned how to do direct reclaim right. Rather than tossing it out altogether, we should put more effort into figuring out how to get the right degree of throttling without ruining performance altogether.

Memory fragmentation issues, especially with kernel memory, came up again. A number of kernel structures - dentries, inodes, and address_space structures in particular - are widespread, interdependent, and hard to get rid of. There are ways to ask the system to reduce the numbers of such structures, but it's still hard to free full pages (much less groups of pages) this way. Some time was spent discussing whether pulling the file name out of the dentry structure (thus making it smaller) would help the situation. Since that name would then have to be allocated elsewhere, the consensus was that things might even get worse.

More work needs to go into dealing with failure; if page writeback starts to fail, "everything goes to hell." The system starts to thrash, and the dreaded out-of-memory killer might get involved. Unfortunately, if writeback is happening to a remote device (via iSCSI, say, or NFS), the chances of this sort of failure increase. Any writeback method which involves networking will require memory allocations to work; it is, thus, subject to failure when memory is tight - just when it is most needed. Solving this problem will require marking high priority sockets, and continuing to process network traffic from those sockets (using pre-allocated memory pools) even when memory is tight. (See this Kernel Page article from last March for more details on this issue.)

Large page handling in Linux still needs work; the current mechanism satisfies the current needs of a certain large database vendor, but does not go much farther. The large page subsystem should support more than anonymous memory; it should handle program text and data areas as well. There needs to be swapping for large pages, even if the swapper is forced to break such pages up. Finally, large page allocation should be automatic and transparent. A certain large Linux vendor recently lost a sale to Solaris, which won by virtue of its more advanced, transparent large page implementation.

Linus responded that mixing large pages and a sane virtual memory subsystem is just not possible. He sees the commercial pressures driving large pages, however. Solve the fragmentation problems that come with large pages, says Linus, and he'll have no arguments with the rest.

Finally, Martin entered a plea for better instrumentation of the virtual memory subsystem. Many problems come up at customer sites; they are hard (or impossible) to reproduce in the lab. There needs to be information available on where pages have been allocated, so that developers can begin to track down problems when they are reported.

The final portion of the memory management discussion was led by Christoph Lameter and Nick Piggin; their topic was scalability. Christoph's page table scalability patches (see this article from last December) were discussed; he seems to have gotten a little tired of maintaining and revising those patches, and would like to know when he might see them merged. There are still developers who do not like this patch; Hugh Dickins described it as "a hack for a special case" driven by one vendor. Linus, however, sees it as a relatively simple solution, and sees no reason why it should not be merged.

Nick Piggin's lockless page cache patches were briefly discussed. These patches are relatively new, relatively complex, and relatively scary. Not too many people have taken a detailed look at them yet, and there was no real discussion of them.


(Log in to post comments)

Virtual memory topics

Posted Jul 20, 2005 6:10 UTC (Wed) by xoddam (subscriber, #2322) [Link]

A number of kernel structures - dentries, inodes, and address_space structures in particular - are widespread, interdependent, and hard to get rid of. There are ways to ask the system to reduce the numbers of such structures, but it's still hard to free full pages (much less groups of pages) this way.

The kernel data which contributes most to memory fragmentation is particularly these long-lived, heap-allocated objects. Rather than adding a fixed-size 'zone' for all kernel heap memory, would it not make sense to allocate such long-lived structures in a virtual address space (thus relocatable and resizable, if not swappable) rather than in the kernel's direct-mapped space? Such a space could be treated like user memory and require copy_*_user(), or (at the cost of some complexity) the kernel mapping could acquire a new virtual region.

*Particular* objects -- the inodes of swap devices, some address_spaces -- must be available at all times on pain of deadlock, but there's no reason to privilege all objects of the same type with the ability to pin memory for the lifetime of the kernel!

Re: allocate some long-lived kernel structures in a virtual address space

Posted Jul 22, 2005 1:30 UTC (Fri) by sweikart (guest, #4276) [Link]

This seems appropriate for data structures that cache disks (dentry, inodes), since the extra overhead for the cached data is a small percentage of the cost for a fetch from disk.

-scott

Kernel Summit 2005: Virtual memory topics

Posted Jul 31, 2005 21:15 UTC (Sun) by daniel (subscriber, #3181) [Link]

In some cases, direct reclaim can have a strong impact on the performance of memory-intensive processes; as a result, this technique is unpopular in some areas. Linus suggested that the real problem is that we still have not learned how to do direct reclaim right. Rather than tossing it out altogether, we should put more effort into figuring out how to get the right degree of throttling without ruining performance altogether.

Linus has already been wrong once in the last three years, this would make it twice ;-)

So-called direct reclaim, i.e., letting any process that fails to get the memory it needs, magically transform itself into a memory scanner was a bad idea from day one, and slowly, the incriminating evidence has built. For me, the glaringly obvious deficiency is the tendency to end up with a thundering herd of memory scanners that interact with each other in no sane way.

This problem reminds me of how the VM was before rmap: we would walk process page tables one mm at a time and unmap a physical page just from one table at a time, and hope that we would manage to catch all shared mappings before some process remaps the page, undoing all our hard work. This approach did not scale: the tendency to livelock increased with the size of memory, and memory, like the universe, always expands. Rmap (and a simpler, more obviously correct scanner) fixed this.

Now the "direct reclaim" idea (I prefer the name "thundering herd of scanners") is starting to crack under the strain of steadily multiplying tasks. It is the same effect, really. It is a random algorithm that hopes to achieve its intended result by side effects of the user space load. It is usually a bad idea to hope that the usage pattern will always cooperate.

The right thing to do is to block a task that cannot get memory, or gets too much. This throttles the task and furthermore makes it easy to apply a true throttling policy to the list of blocked processes if we want to (we probably do). Scanning should be done only by dedicated, non-blocking scanner tasks. We might want to have as many as one scanner per cpu, but it makes absolutely zero sense to have more.

Scanner tasks will initiate writeout as now, but since we will no longer co-opt a user task to drive the actual io, it needs to be driven by another mechanism, a work queue, say. Completing the aio work to make it really, truly non-blocking (even in the case of filesystem metadata read-before-write) would mean that writeout can be driven directly by a scanning task as a non-blocking state machine, and everything would get that much nicer. But that is just an optimization, the key thing is to get the scanning right.

Implementing this is left as an exercise for the interested reader.

Regards,

Daniel

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds