LWN.net Logo

Kernel Summit: Virtual Memory

This article is part of LWN's 2004 Kernel Summit coverage.
No kernel summit is complete without a lengthy virtual memory session, perhaps because the VM problem will never be solved to everybody's satisfaction.

The first topic was non-uniform memory access (NUMA) systems. At this point, the biggest impediment to better NUMA support would appear to be the kernel's ability to handle discontiguous memory. As NUMA systems get larger, and with features like hotpluggable memory, the physical memory address space is becoming increasingly fragmented and sparse. The current discontiguous memory support in the kernel is not up to the task.

The solution would appear to be the nonlinear memory patch proposed by Daniel Phillips back in 2002 (see LWN's coverage). Following the classic rule that almost every problem in computer science can be solved by adding another layer of indirection, this patch creates a new class of "logical" addresses. Most of the kernel can work with logical addresses, which present a nice, contiguous memory space. The kernel page tables map these addresses into the real physical space, which is rather more complex.

It was asked whether the current virtual memory area (VMA) structures could be used to represent discontiguous memory. They are, after all, already set up for the task of managing sparse address spaces. The answer that comes back is that VMAs, being intended for the management of virtual spaces, are too complex and bring a lot of unneeded baggage to the discontiguous memory problem. Nonetheless, there may yet be an attempt to split apart the VMA structure and use it for this task.

NUMA also brings problems in finding the physical location of devices and communicating that to user space. A fair amount of discussion went into the design of some sort of textual specification which could be communicated by sysfs. Linus, in the end, however, suggested that it was best to simply represent the system hierarchy directly in sysfs. Administrators can then simply browse the space with a file manager, and much of the problem is solved.

CPU scheduling issues may be calming down in general, but they still come up on NUMA systems. The scheduling domains mechanism is still not being used to its full potential on these systems. The domain setup code needs to be reworked, ripping out much duplicated code while still allowing the architecture code to tweak things as needed for the actual hardware at hand.

Imawoto Toshiro led a discussion on support for hotpluggable memory. Hotplug is interesting to people working in the high-availability space; if a particular bank of memory is beginning to show errors, it can be ripped out and replaced on the fly. Supporting this capability is not easy, however.

Actually, adding memory is easy to support. The new pages are simply added to the free list and promptly filled with GNOME or KDE theme data. Removing memory is a little harder, as something really should be done with the contents of that memory before the administrator is given the go-ahead to yank it out. That means finding a new home for everything contained in the outgoing memory.

User-space pages are relatively easy to relocate. They are accessed by way of page tables; if they can be moved, and the page tables pointed to the new locations, user space will never even notice the change. Kernel space is different, however; the kernel is full of direct pointers to its own memory, and it is not paged the way user space is. There is currently no realistic way to evacuate kernel structures from doomed memory.

The solution in the short term is to avoid allocating kernel structures in hot-removable memory. One idea being considered is to create a new memory zone, ZONE_HOTREMOVABLE, which would be managed separately. Adding this zone would make it easy to keep kernel allocations away, but it brings problems of its own. Memory zones have always been accompanied by balancing problems, so there is great reluctance to add new ones. This is especially true for the 64-bit architectures, which, for all practical purposes, have managed to move all of memory back into a single zone.

Linus's point of view is that the real purpose for hot-removable memory is to swap out a failing bank. In such cases, a fresh memory stick will be inserted to replace the old one. If the hardware would handle memory remapping, pages would not have to be relocated at all and the software would not have to jump through so many hoops to make things work. So he suggests that the right thing is to push the hardware manufacturers to add memory remapping to their products.

Meanwhile, there was also interest in how best to deal with dirty user pages in memory which is to be removed. The natural thing to do is to simply force them out to their backing store; they can then be unmapped and there is no need to find a new home for them (until they are faulted back in). It turns out, however, that forcing this I/O can take a very long time - even if problems like network filesystems are not part of the picture. In general, copying the dirty page to another location in RAM seems to yield better performance.

The problem there is that the page may be part of some filesystem's I/O buffers, and the move may be difficult to do. So Linus has suggested enhancing the filesystem API to include a "move this page" method. That would allow each filesystem to respond to a move request in whatever way made the most sense.

Rusty Russell talked a bit about CPU hotplugging. There was not a whole lot to say; the CPU patches have been in the mainline for a while now. There is some lingering concern that the code has seen little testing; hotplug CPU setups are still relatively rare. One way of addressing that could be changing the SMP code to use the hotplug subsystem; when the system boots, all but the boot CPU would be brought up via hotplug events. The software suspend code, too, could perhaps make some use of the hotplug subsystem.

Hugh Dickins led the last part of the discussion, which was meant to cover any remaining virtual memory topics. The main topic was page clustering.

The physical page size used by the VM subsystem is determined by the hardware; some processors are more flexible than others in this regard. In the end, Linux ends up using 4K pages on most architectures. The hardware page size need not be the same as the internal allocation size used by the kernel; in fact, there are certain advantages to be gained by clustering physical pages into large units internally. These include:

  • A reduction in the size of the memory map. This data structure (an array of struct page structures) is used by the kernel to track the physical memory in the system. On the x86 architecture, struct page takes up 32 bytes; as a result, a 1GB system loses 8MB to the system memory map. It is a relatively small tax (and smaller than it was in 2.4), but losing that memory is still irritating.

  • Kernel algorithms which handle memory would have fewer pages to deal with, and would, thus, have less work to do.

  • On some architectures, the hardware translation buffer (which remembers virtual-to-physical address mappings) can deal with larger pages. On such systems, the TLB could cover more of the current working set, and performance would improve.

Back in the 2.4 days, the benefits of the page clustering patch seemed compelling. Low memory was a limiting factor in the kernel's scalability, and it seemed that things would only get worse. Perspectives have changed, however. The performance increases to be had from page clustering seem small, while the patch remains big, intrusive (it touches over 200 files), and scary. There are also some interesting user-space compatibility issues to be dealt with. Certain Linux applications expect to be able to create memory mappings with a 4K resolution; the VM hackers would have to either figure out a way to preserve that capability, or accept breaking some applications.

While no final decisions have been made, it would appear that page clustering is not the hot topic it once was. Without users clamoring for it, page clustering may not find its way into the 2.7 kernel.

>> Next: Software suspend


(Log in to post comments)

Kernel Summit: Virtual Memory

Posted Jul 20, 2004 9:01 UTC (Tue) by jmshh (guest, #8257) [Link]

Is there any talk about the little brother of hotplug memory: ECC memory scrubbing? Most memory failures seem to be random bit flips. System lifetime could be extended seriously by fixing those before the next flip happens at the same address.

Kernel Summit: Virtual Memory

Posted Jul 21, 2004 23:38 UTC (Wed) by brouhaha (subscriber, #1698) [Link]

ECC memory scrubbing is best done by the hardware. The AMD 760MP and 760MPX chip sets do this, as does the memory controller built into the Opteron and Athlon64. I'm not sure about scrubbing on Intel-based hardware (though the first system I ever used with ECC memory scrubbing was an Intel iAPX-432 system in the early 1980s).

If the hardware does not have scrubbing support, it can be done by a user level process with no special kernel support needed, but it will thrash the data cache. You should be able to scrub slowly enough that the overhead is minimal, while still maintaining the benefit of scrubbing.

Kernel Summit: Virtual Memory

Posted Jul 23, 2004 12:44 UTC (Fri) by chip (subscriber, #8258) [Link]

Scrubbing isn't as interesting to me as monitoring. The ecc module is little-known and seldom used, sadly.

Kernel Summit: Virtual Memory

Posted Jul 20, 2004 12:15 UTC (Tue) by lolando (subscriber, #7139) [Link]

> Rusty Russell talked a bit about CPU hotplugging. There was not a whole lot
> to say; the CPU patches have been in the mainline for a while now. There is
> some lingering concern that the code has seen little testing; hotplug CPU
> setups are still relatively rare. One way of addressing that could be
> changing the SMP code to use the hotplug subsystem; when the system boots,
> all but the boot CPU would be brought up via hotplug events. The software
> suspend code, too, could perhaps make some use of the hotplug subsystem.

Maybe power management, too? Oooh, oooh, my laptop battery won't last during this whole plane trip... Quick, let's switch off the second CPU! I'm only doing offline IMAP anyway (damn these planes for not having broadband wifi).

(Never mind that if I can afford a dual-processor laptop I should also be able to afford a larger battery)

CPU hot plug will matter for more laptops in future

Posted Jul 29, 2004 11:37 UTC (Thu) by ringerc (subscriber, #3071) [Link]

I wouldn't be too surprised to eventually see the day where the "celeron"
laptops of the world were the only ones with only one core. In this sort
of situation, I can imagine it being rather nice to be able to painlessly
"hot-swap" CPUs for power management purposes - preferably BEFORE the day
comes where the lack is a major problem.

Memory hotplugging vs. nonlinear memory

Posted Jul 21, 2004 9:00 UTC (Wed) by alonz (subscriber, #815) [Link]

Has no-one proposed using the nonlinear memory extension to also address memory hotplugging? After all, this will enable pages to be relocated in physical memory while still keeping their "logical" addresses (which are what the kernel would see), so the whole problem of remapping kernel memory (except, perhaps, anything used for DMA) should become much simpler...

Kernel Summit: Virtual Memory

Posted Jul 30, 2004 15:29 UTC (Fri) by metalwheaties (guest, #2136) [Link]

It is actually true that nearly EVERY system already has hardware remapping of memory on precisely the right granularity required for memory hot removal. It's the same mechanism the BIOS/boot firmware uses to map the DRAM contiguously in the first place. The memory controller must be told at boot time how much DRAM is associated with each chip select and to what physical address this chip select should respond. Changing these values while the memory controller is in use is possible - at least for the memory controllers I am familiar with.

Unfortunately, some CPU architectures have traditionally been closed with regard to these registers and their usage. Now AMD has opened up with Opteron. Only the Intel chipset world remains closed. If Intel compatible the "chip set" manufacturers would open up their dirty little secrets and tell people how to program these registers without the double-secret-probation handshake they have with the small group of elite BIOS vendors, these problems would be easily solved by the smart people in the Linux community.

By the way, none of the other CPU architectures have EVER suffered from this secrecy problem, AFAIK.

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds