LWN.net Logo

Plans for hot adding and removing memory

By Jake Edge
June 12, 2013
LinuxCon Japan 2013

At LinuxCon Japan, Yasuaki Ishimatsu of Fujitsu talked about the status of memory hotplug, with a focus on what still needs to be done to fully support both hot adding and hot removing memory. If a memory device is broken in a laptop or desktop, you can just replace that memory, but for servers, especially ones that need to stay running, it is more difficult. In addition, having a way to add and remove memory would allow for dynamic reconfiguration on systems where the hardware has been partitioned into two or more virtual machines.

The focus of the memory hotplug work is for both scenarios: broken memory hardware and dynamic reconfiguration. Memory hotplug will be supported in KVM, Ishimatsu said. It is currently supported by several operating systems, but Linux does not completely support it yet. Fixing that is the focus this work.

There are two phases to memory hotplug: physically adding or removing memory (hot add or hot remove) and logically changing the amount of memory available to the system (onlining or offlining memory). Both phases have to be completed before Linux can use any new memory, and taking the memory offline (so that Linux is no longer using it) is required before it can be removed.

The memory management subsystem manages physical memory by using two structures, he said. The page tables hold a direct mapping for virtual to physical addresses. The virtual memory map manages page structures. In order to offline memory, any data needs to be moved out of the memory and those data structures need to be updated. Likewise, when adding memory, new page table and virtual memory map entries must be added.

Pages are managed in zones and, when using the sparse memory model that is needed for memory hotplug systems, zones are broken up into sections that are 128M in size. Sections can be switched from online to offline and vice versa using the /sys/devices/system/memory/memoryX/state file. By echoing offline or online into that file, the pages in that section have their state changed to unusable or usable respectively.

In the 3.2 kernel, hot adding memory and onlining it were fully supported. Offlining memory was supported with limitations, and hot removing it was not supported at all. Work started in July 2012 to remove the offline limitations and to add support for hot remove, Ishimatsu said.

The work for hot remove has been merged for the 3.9 kernel. It will invalidate page table and virtual memory map entries that correspond to the memory being removed. But, since the memory must be taken offline before it is removed, the limitations on memory offline still make it impossible to remove arbitrary memory hardware from the system.

When memory that is to be offlined has data in it, that data is migrated to other memory in the system. But the only pages that are migratable this way are the page cache and anonymous pages, which are known as "movable" pages. If the memory contains non-movable memory, which Ishimatsu called "kernel memory", the section cannot be offlined.

There are two ways to handle that problem that are being considered. The first is to support moving kernel memory when offlining pages that contain it. The advantages to that are that all memory can be offlined and there is no additional performance impact for NUMA systems since there are no restrictions on the types of allocations that can be made. On the downside, though, the kernel physical to virtual address relationship will need to change completely. The other alternative is to make all of a node's memory movable, which would reuse the existing movable memory feature, but means that only page cache and anonymous pages can be stored there, which will impact the performance of that NUMA node.

Ishimatsu said that he prefers the first solution personally, but, as a first step they are implementing the second: creating a node that consists only of movable memory. Linux has the idea of a movable zone (i.e. ZONE_MOVABLE), but zones of that type are not created automatically. If a node consists only of movable memory, all of it can be migrated elsewhere so that the node can be taken offline.

A new boot option, movablecore=acpi, is under development that will use the memory affinity structure in the ACPI static resource affinity table (SRAT) to choose which nodes will be constructed of movable memory. The existing use for movablecore allows setting aside a certain amount of memory that will be movable in the system, but it spreads it evenly across all of the nodes rather than concentrating it only on the nodes of interest. The "hotpluggable" bit for a node in the SRAT will be used to choose the target nodes in the new mode.

Using the online_movable flag to the sysfs memory state file (rather than just online) allows an administrator to tell the system to make that memory movable. Without that, the onlined memory is treated as ZONE_NORMAL, so it may contain kernel memory and thus not be able to be offlined. The online_movable feature was merged for 3.8. That reduces the limitations on taking memory offline, but there is still work to do.

Beyond adding the movablecore=acpi boot option (and possibly a vm.hotadd_memory_treat_as_movable sysctl), there are some other plans. Finding a way to put the page tables and virtual memory map into the hot-added memory is something Ishimatsu would like to see, because it would help performance on that node, but would not allow that memory to be offlined unless those data structures can be moved. He is thinking about solutions for that. Migrating vmalloc() data to other nodes when offlining a node is another feature under consideration.

Eventually, being able to migrate any kernel memory out of a node is something he would like to see, but solutions to that problem are still somewhat elusive. He encouraged those in attendance to participate in the discussions and to help find solutions for these problems.

[I would like to thank the Linux Foundation for travel assistance to Tokyo for LinuxCon Japan.]


(Log in to post comments)

Plans for hot adding and removing memory

Posted Jun 17, 2013 11:49 UTC (Mon) by nix (subscriber, #2304) [Link]

Another potential use for hotpluggable memory which has been pointed out here now and then is to speed up boot for machines with lots of ECCRAM (generally servers, but in my ideal world every machine would use ECCRAM exclusively, and in the end increasing RAM volumes may make it essential). Right now, the more ECCRAM you've got, the slower the machine boots, because it has to initialize all that RAM at bootup, which is constrained by CPU->RAM speeds, which remain relatively low and are scaling more slowly than RAM size. The 24Gb ECCRAM in my relatively small server takes about 30s to initialize, and I've heard of machines that take 15 minutes.

What if the BIOS could be told to initialize only, say, 1Gb of it, and then hand the rest off to the OS, which could initialize the remaining memory in the background after booting and hotplug it in when it's ready? The boot time suddenly stops getting worse and worse, and the cost (a machine that starts with relatively little memory and gains more as it goes) is harmless because by definition you could never fill the RAM faster than it's being initialized anyway, since that initialization is constrained by the CPU->RAM interconnect speed.

But obviously before anyone could implement this scheme memory hotplug needs to work!

Plans for hot adding and removing memory

Posted Jun 17, 2013 15:38 UTC (Mon) by etienne (subscriber, #25256) [Link]

> initialize all that RAM at bootup, which is constrained by CPU->RAM speeds
> 24Gb ECCRAM ... takes about 30s to initialize

http://en.wikipedia.org/wiki/List_of_device_bandwidths#Dy...
says that any PC3-* bandwidth is over 10 GBytes/s, and 24 Gb shall be less than 2.4 seconds at max CPU->RAM speeds.
Your CPU is probably busy doing something else, and the PC lacks a good DMA to handle this kind of low level tasks.

Plans for hot adding and removing memory

Posted Jun 18, 2013 12:48 UTC (Tue) by nix (subscriber, #2304) [Link]

This is DDR3-1066. 8GiB/s. So, yes, it should take only a couple of seconds, if it were physically possible to write to RAM that fast. It isn't: just the CAS switches between all those cells would take longer than that. (Sure, perhaps this should be done by the memory controller, in parallel, since 'initialize to a known state' shouldn't actually require umpty Gb of data to be transferred as long as all the RAM agreed on what that known state was. But that requires substantial hardware changes and standard changes to agree on the known state, much harder than a mere firmware change, in theory anyway...)

Plans for hot adding and removing memory

Posted Jun 17, 2013 19:01 UTC (Mon) by dlang (✭ supporter ✭, #313) [Link]

you are mixing up the time needed to initialize the RAM and the time needed to do tests on the RAM.

Every BIOS I've seen has a setting that will let you bypass the detailed memory tests at boot time.

Very few people boot their systems frequently enough to bother changing this setting through.

Plans for hot adding and removing memory

Posted Jun 17, 2013 20:34 UTC (Mon) by etienne (subscriber, #25256) [Link]

Because he has ECC RAM he needs to write it once to initialise the ECC bits.
Well, theoretically no memory should be read before being written, but in practice someone may do it (obviously in another OS, never happen on Linux).
In linux, if you get an ECC error, you directly assume the RAM is faulty, not that some area is read before being written.

Plans for hot adding and removing memory

Posted Jun 17, 2013 20:52 UTC (Mon) by dlang (✭ supporter ✭, #313) [Link]

right, but even on servers with 128G of ram that I have, this only takes a few seconds (and I suspect that a noticeable chunk of the time is spent updating the display to report progress :-)

nowhere near the several minute figures mentioned above.

A full memory check on these 128G systems does take a few minutes.

Plans for hot adding and removing memory

Posted Jun 18, 2013 12:44 UTC (Tue) by nix (subscriber, #2304) [Link]

It's not doing a memory check, that takes ages and I turned it off.

But it is quite true that BIOSes are so opaque (and so badly written) that it could very well be spending its time doing something else, probably terribly inefficiently!

But, still, the 'need to write everything' and the ever reducing ratio of memory bandwidth to RAM volume *is* eventually going to have the effects I suggest above, even if it isn't now. So it's good that Linux already has the machinery necessary to fix it, if the BIOSes would please catch up. (Actually, looking at the article more closely it had this from 3.2 or thereabouts, since all this needs is plugging, not unplugging.)

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds