No kernel summit is complete without a lengthy virtual memory session,
perhaps because the VM problem will never be solved to everybody's
satisfaction.
The first topic was non-uniform memory access (NUMA) systems. At this
point, the biggest impediment to better NUMA support would appear to be the
kernel's ability to handle discontiguous memory. As NUMA systems get
larger, and with features like hotpluggable memory, the physical memory
address space is becoming increasingly fragmented and sparse. The
current discontiguous memory support in the kernel is not up to the task.
The solution would appear to be the nonlinear memory patch proposed by
Daniel Phillips back in 2002 (see LWN's coverage). Following the classic
rule that almost every problem in computer science can be solved by adding
another layer of indirection, this patch creates a new class of "logical"
addresses. Most of the kernel can work with logical addresses, which
present a nice, contiguous memory space. The kernel page tables map these
addresses into the real physical space, which is rather more complex.
It was asked whether the current virtual memory area (VMA) structures could
be used to represent discontiguous memory. They are, after all, already
set up for the task of managing sparse address spaces. The answer that
comes back is that VMAs, being intended for the management of virtual
spaces, are too complex and bring a lot of unneeded baggage to the
discontiguous memory problem. Nonetheless, there may yet be an attempt to
split apart the VMA structure and use it for this task.
NUMA also brings problems in finding the physical location of devices and
communicating that to user space. A fair amount of discussion went into
the design of some sort of textual specification which could be
communicated by sysfs. Linus, in the end, however, suggested that it was
best to simply represent the system hierarchy directly in sysfs.
Administrators can then simply browse the space with a file manager, and
much of the problem is solved.
CPU scheduling issues may be calming down in general, but they still come
up on NUMA systems. The scheduling domains mechanism is still not being
used to its full potential on these systems. The domain setup code needs
to be reworked, ripping out much duplicated code while still allowing the
architecture code to tweak things as needed for the actual hardware at
hand.
Imawoto Toshiro led a discussion on support for hotpluggable memory.
Hotplug is interesting to people working in the high-availability space; if
a particular bank of memory is beginning to show errors, it can be ripped
out and replaced on the fly. Supporting this capability is not easy,
however.
Actually, adding memory is easy to support. The new pages are
simply added to the free list and promptly filled with GNOME or KDE theme
data. Removing memory is a little harder, as something really should be
done with the contents of that memory before the administrator is given the
go-ahead to yank it out. That means finding a new home for everything
contained in the outgoing memory.
User-space pages are relatively easy to relocate. They are accessed by way
of page tables; if they can be moved, and the page tables pointed to the
new locations, user space will never even notice the change. Kernel space
is different, however; the kernel is full of direct pointers to its own
memory, and it is not paged the way user space is. There is currently no
realistic way to evacuate kernel structures from doomed memory.
The solution in the short term is to avoid allocating kernel structures in hot-removable
memory. One idea being considered is to create a new memory zone,
ZONE_HOTREMOVABLE, which would be managed separately. Adding this
zone would make it easy to keep kernel allocations away, but
it brings problems of its own. Memory zones have always been accompanied
by balancing problems, so there is great reluctance to add new ones.
This is especially true for the 64-bit architectures, which, for all
practical purposes, have managed to move all of memory back into a single
zone.
Linus's point of view is that the real purpose for hot-removable memory is
to swap out a failing bank. In such cases, a fresh memory stick will be
inserted to replace the old one. If the hardware would handle memory
remapping, pages would not have to be relocated at all and
the software would not have to jump through so many hoops to
make things work. So he suggests that the right thing is to push the
hardware manufacturers to add memory remapping to their products.
Meanwhile, there was also interest in how best to deal with dirty user
pages in memory which is to be removed. The natural thing to do is to
simply force them out to their backing store; they can then be unmapped and
there is no need to find a new home for them (until they are faulted back
in). It turns out, however, that forcing this I/O can take a very long
time - even if problems like network filesystems are not part of the
picture. In general, copying the dirty page to another location in RAM
seems to yield better performance.
The problem there is that the page may be part of some filesystem's I/O
buffers, and the move may be difficult to do. So Linus has suggested
enhancing the filesystem API to include a "move this page" method. That
would allow each filesystem to respond to a move request in whatever way
made the most sense.
Rusty Russell talked a bit about CPU hotplugging. There was not a whole
lot to say; the CPU patches have been in the mainline for a while now.
There is some lingering concern that the code has seen little testing;
hotplug CPU setups are still relatively rare. One way of addressing that
could be changing the SMP code to use the hotplug subsystem; when the
system boots, all but the boot CPU would be brought up via hotplug events.
The software suspend code, too, could perhaps make some use of the hotplug
subsystem.
Hugh Dickins led the last part of the discussion, which was meant to cover
any remaining virtual memory topics. The main topic was page clustering.
The physical page size used by the VM subsystem is determined by the
hardware; some processors are more flexible than others in this regard. In
the end, Linux ends up using 4K pages on most architectures. The hardware
page size need not be the same as the internal allocation size used by the
kernel; in fact, there are certain advantages to be gained by clustering
physical pages into large units internally. These include:
- A reduction in the size of the memory map. This data structure (an
array of struct page structures) is used by the kernel to
track the physical memory in the system. On the x86 architecture,
struct page takes up 32 bytes; as a result, a 1GB system
loses 8MB to the system memory map. It is a relatively small tax (and
smaller than it was in 2.4), but losing that memory is still
irritating.
- Kernel algorithms which handle memory would have fewer pages to deal
with, and would, thus, have less work to do.
- On some architectures, the hardware translation buffer (which
remembers virtual-to-physical address mappings) can deal with larger
pages. On such systems, the TLB could cover more of the current
working set, and performance would improve.
Back in the 2.4 days, the benefits of the page clustering patch seemed
compelling. Low memory was a limiting factor in the kernel's scalability,
and it seemed that things would only get worse. Perspectives have changed,
however. The performance increases to be had from page clustering seem
small, while the patch remains big, intrusive (it touches over 200 files), and scary.
There are also some interesting user-space compatibility issues to be dealt
with. Certain Linux applications expect to be able to create memory
mappings with a 4K resolution; the VM hackers would have to either figure
out a way to preserve that capability, or accept breaking some
applications.
While no final decisions have been made, it would appear that page
clustering is not the hot topic it once was. Without users clamoring for
it, page clustering may not find its way into the 2.7 kernel.
>> Next: Software suspend
(
Log in to post comments)