At LinuxCon
Japan, Yasuaki Ishimatsu of Fujitsu talked about the status of memory
hotplug, with a focus on what still needs to be done to fully support both
hot adding and hot removing memory. If a memory device is broken in a
laptop or desktop, you can just replace that memory, but for servers, especially
ones that need to stay running, it is more difficult. In addition,
having a way to add and remove memory would allow for dynamic
reconfiguration on systems where the hardware has been partitioned into two
or more virtual machines.
The focus of the memory hotplug work is for both scenarios: broken memory
hardware and dynamic reconfiguration. Memory hotplug will be supported in
KVM, Ishimatsu said. It is currently supported by several operating
systems, but Linux does not completely support it yet. Fixing that is the
focus this work.
There are two phases to memory hotplug: physically adding or removing
memory (hot add or hot remove) and logically changing the amount of memory
available to the system (onlining or offlining memory). Both phases have
to be completed before Linux can use any new memory, and taking the memory
offline (so that Linux is no longer using it) is required before it can be
removed.
The memory management subsystem manages physical memory by using two
structures, he said. The page tables hold a direct mapping for virtual to
physical addresses. The virtual memory map manages page structures. In
order to offline memory, any data needs to be moved out of the memory and
those data structures need to be updated. Likewise, when adding memory,
new page table and
virtual memory map entries must be added.
Pages are managed in zones and, when using the sparse memory model that is
needed for memory hotplug systems, zones are broken up into sections that are
128M in size. Sections can be switched from online to offline and vice
versa using the /sys/devices/system/memory/memoryX/state file. By
echoing offline or online into that file, the pages in
that section have their state changed to unusable or usable respectively.
In the 3.2 kernel, hot adding memory and onlining it were fully
supported. Offlining memory was supported with limitations, and hot
removing it was not supported at all. Work started in July 2012 to remove
the offline limitations and to add support for hot remove, Ishimatsu said.
The work for hot remove has been merged for the 3.9 kernel. It will invalidate
page table and virtual memory map entries that correspond to the memory
being removed. But, since the memory must be taken offline before it is
removed, the limitations on memory offline still make it impossible to
remove arbitrary memory hardware from the system.
When memory that is to be offlined has data in it, that data is migrated to
other memory in the system. But the only pages that are migratable this
way are the page cache and anonymous pages, which are known as "movable"
pages. If the memory contains non-movable memory, which Ishimatsu called
"kernel memory",
the section cannot be offlined.
There are two ways to handle that problem that are being considered. The
first is to support moving kernel memory when offlining pages
that contain it. The advantages to that are that all memory can be
offlined and there is no additional performance impact for NUMA systems
since there are no restrictions on the types of allocations that can be
made. On the
downside, though, the kernel physical to virtual address relationship will
need to change completely. The other alternative is to make all of a
node's memory movable, which would reuse the existing movable memory
feature, but means that only page cache and anonymous pages can be stored
there, which will impact the performance of that NUMA node.
Ishimatsu said that he prefers the first solution personally, but, as a
first step they are implementing the second: creating a node that consists
only of movable memory. Linux has the idea of a movable zone
(i.e. ZONE_MOVABLE), but zones of that type are not created
automatically. If a node consists only of movable memory, all of it can be
migrated elsewhere so that the node can be taken offline.
A new boot option, movablecore=acpi, is under development that will use
the memory affinity structure in the ACPI static resource affinity table
(SRAT) to choose which nodes will
be constructed of movable memory. The existing use for
movablecore allows setting aside a certain amount of memory that
will be
movable in the system, but it spreads it evenly across all of the nodes
rather than concentrating it only on the nodes of interest. The "hotpluggable"
bit for a node in the SRAT will be used to choose the target nodes in the
new mode.
Using the online_movable flag to the sysfs memory state
file (rather than just online) allows an administrator to tell the
system to make that memory movable. Without that, the onlined memory is
treated as ZONE_NORMAL, so it may contain kernel memory and thus
not be able to be offlined. The online_movable feature was merged
for 3.8. That reduces the limitations on taking memory offline, but there
is still work to do.
Beyond adding the movablecore=acpi boot option (and possibly a
vm.hotadd_memory_treat_as_movable sysctl), there are some other
plans. Finding a way to put the page tables and virtual memory map into the
hot-added memory is something Ishimatsu would like to see, because it would
help performance on that node, but would not allow that memory to be
offlined unless those data structures can be moved. He is thinking about
solutions for
that. Migrating vmalloc() data to other nodes when offlining a
node is another feature under consideration.
Eventually, being able to migrate any kernel memory out of a node is
something he would like to see, but solutions to that problem are still somewhat
elusive. He encouraged those in attendance to participate in the
discussions and to help find solutions for these problems.
[I would like to thank the Linux Foundation for travel assistance to Tokyo
for LinuxCon Japan.]
(
Log in to post comments)