Preserving guest memory across kexec

By Jonathan Corbet
May 20, 2022

The final session in the memory-management track at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM) was run remotely by James Gowans and David Woodhouse. It was titled "user-space control of memory mappings", with a subtitle of "letting guest memory and state survive kexec". Some options were discussed, but the real work is clearly yet to be done.

The use case in question, Gowans began, is a live update of a hypervisor done with the kernel's kexec functionality. To carry this out, the state of all running virtual machines is serialized to persistent storage, then kexec is used to boot into the updated hypervisor. After that, the virtual machines can all be restarted. The desire is to preserve the state of guest memory over the reboot, which means this memory cannot be managed by the host kernel in the traditional way; instead, the kernel should stay away from that memory and let user space manage its allocation to virtual machines. They have been looking at "sidecar virtual machines" as a way to implement this functionality.

Most of guest memory, Gowans said, should not be touched by the new kernel, meaning that the kernel will only manage a small part of the memory given to guest systems. The userfaultfd() system call is used to manage the rest; this is a change, since userfaultfd() only works with anonymous memory currently. Future requirements will include keeping I/O memory-management unit (IOMMU) mappings in sync, keeping DMA operations running while the update happens, and improving the speed of kexec by passing more state to the new kernel.

John Hubbard asked if memory managed in this way needs to have associated page structures; the answer was that they are not needed.

A few implementation options were presented. The first was a full filesystem, implemented in the kernel, that is used to manage allocations of reserved ranges of memory. The kernel would reconstruct this filesystem after a kexec. The PKRAM mechanism, which preserves RAM contents over a kexec, would probably be used for this purpose; the PKRAM patches were posted last year, but have not been merged. How to handle other types of memory, such as PCI memory-mapped I/O (MMIO) registers, is an open question as well.

The next implementation option was a FUSE-based filesystem; mapping of guest memory to page-frame numbers could then be handled from user space. A special control process could handle many of the details, and this solution would support mapping to PCI MMIO spaces.

Finally, this feature could be implemented using a raw memory device, something along the lines of /dev/mem. The control process could use ioctl() calls to create and revoke mappings to pages in the guest process. User space would be charged with keeping mappings in place over the kexec call. There is evidently an implementation of this option running now.

Jan Kara observed that there are a number of other things that need to be restored after a kexec, including open files and more. This task resembles Checkpoint/Restore in User space (CRIU), which already exists. The response was that this solution does not try to recreate everything automatically; instead, hypervisor processes will be responsible for opening files again after the kexec. Woodhouse compared it to live migration to the same host. Gowans said that guests won't notice this happening; they will be paused and serialized, and their previous state pushed back into KVM by the new hypervisor.

Returning to the implementation options, Gowans said that the full-filesystem approach offers the best latency and introspection, but it's not clear how MMIO regions can be handled. The FUSE approach gives full control to user space and solves the MMIO problem. The raw-memory version is the most flexible, but it requires reconstructing everything after the kexec, and is the least transparent to introspection.

Next steps include figuring out how to handle IOMMU mappings, then picking an approach to pursue. The preferred approach looks like the FUSE version, so the plan is to put together an RFC patch implementing it and to have a polished version by the KVM Forum in September.

Dan Williams said that the FUSE and raw-memory options look like the least scary ones. That said, PKRAM does look scary; he asked about the status of those patches. David Hildenbrand answered that the last posting of that work "didn't inspire joy".

The attendees were tired and the session wound down fairly quickly. The final question had to do with the existence of other use cases for this functionality. Hildenbrand suggested that databases could be a candidate. Specifically, huge, in-memory databases can take hours to boot and load up all of the data; a mechanism like this could possibly accelerate the process.

Index entries for this article
Kernel	Kexec
Kernel	userfaultfd()
Kernel	Virtualization
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2022

Preserving guest memory across kexec

Posted May 20, 2022 16:32 UTC (Fri) by developer122 (guest, #152928) [Link] (11 responses)

Reminds me a bit of the problem of preserving userspace across reboots of a microkernel.

The solution tends to look a lot like a filesystem. When a computer is rebooted, contents on disk are still there and are rediscovered by the new instance of the filesystem driver. So, extend that metaphor to memory, where the "files" and "directories" are old processes and memory mappings. Page tables <-> filesystem trees.

Of course, there are some details like making sure the previous version's mappings can be updated to reflect a new "on disk format" and that calls to now-nonexistant system calls can be handled appropriately. Also that some ram is available for the new kernel to set itself up before walking the old page and process tables.

The beauty of a microkernel is that the userspace drivers probably still remember the state of hardware devices, even after the microkernel itself is rebooted.

Preserving guest memory across kexec

Posted May 20, 2022 16:34 UTC (Fri) by developer122 (guest, #152928) [Link] (5 responses)

I wonder, if at some point in the future we will be asking how the linux kernel might preserve a select few processes across reboots?

Preserving guest memory across kexec

Posted May 20, 2022 16:47 UTC (Fri) by mb (subscriber, #50428) [Link] (4 responses)

Checkpoint + restore?
https://en.wikipedia.org/wiki/CRIU

Preserving guest memory across kexec

Posted May 21, 2022 6:30 UTC (Sat) by developer122 (guest, #152928) [Link] (3 responses)

That saves it to disk, so I guess you could fetch it from the filesystem after rebooting, but the point is really to leave processes in memory and largely unmolested while the OS reboots around them.

Preserving guest memory across kexec

Posted May 21, 2022 12:37 UTC (Sat) by mb (subscriber, #50428) [Link] (2 responses)

Yes. Processes have lots of kernel state that's not easily retained across reboots. That's basically what checkpoint/restore solves.
If you don't want to access the disk, then the only additional thing needed would be a ramdisk that survives reboots where the checkpoints are stored. Such a ramdisk sounds way easier to implement than reinventing checkpoint/restore just for the kexec case.

Preserving guest memory across kexec

Posted May 26, 2022 2:35 UTC (Thu) by developer122 (guest, #152928) [Link]

checkpoint and restore means a ton of copying application data around and packing and unpacking. But I guess you have to do it at all before you can do it zero-copy.

Preserving guest memory across kexec

Posted May 31, 2022 14:51 UTC (Tue) by simcop2387 (subscriber, #101710) [Link]

That actually sounds a lot like a big use case for the whole NVDIMM stuff that has been appearing lately. Faster than disk, not as fast as memory but it's also made to persist between reboots, power offs, etc.

Preserving guest memory across kexec

Posted May 21, 2022 15:05 UTC (Sat) by developer122 (guest, #152928) [Link] (4 responses)

The extreme example is one where the OS can replace itself while userspace continues on running CPU-bound load. Having the microkernel replace itself and pick back up without any downtime at all, unless perhaps a program tries to make a call.

Dunno that it would be possible on current hardware though, with existing interrupt controllers and memory paging.

Preserving guest memory across kexec

Posted May 22, 2022 7:37 UTC (Sun) by Wol (subscriber, #4433) [Link] (3 responses)

Bear in mind you are describing 1980s technology ...

I've stood next to a system where my friend said "if I pull out this board no-one will notice except the engineers monitoring the system ..." he was pointing at the CPU ...

and the OS was just another replaceable component same as ...

Cheers,
Wol

Preserving guest memory across kexec

Posted May 26, 2022 2:54 UTC (Thu) by developer122 (guest, #152928) [Link] (2 responses)

Perhaps, but we stopped building anything to support that *in the 80's.* All the Dell and HP and Supermicro boxes are glorified PCs.

At the same time, the only reason that was possible was because of virtualization. All the "OSs" were running in logical partitions aka hardware VMs. If you lost the hypervisor then the system went down, and the base hypervisor firmware was impossible to replace (often because it's in ROM).

Preserving guest memory across kexec

Posted May 26, 2022 9:19 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

Well, yes there was a lot of - not exactly custom - hardware in there, but I guess to upgrade the hypervisor they could quite possibly just have replaced the CPU boards. I don't remember any reference to rom-based hypervisors, though ...

Cheers,
Wol

Preserving guest memory across kexec

Posted May 30, 2022 0:42 UTC (Mon) by developer122 (guest, #152928) [Link]

From what I recall, the LPARS on IBM or domains on Sun SPARC were managed by firmware loaded from the system board itself, into either a main CPU or a service processor.

Preserving guest memory across kexec

Posted May 20, 2022 16:46 UTC (Fri) by flussence (guest, #85566) [Link]

Oh this looks really interesting! Sounds like it'd allow one to Ship-of-Theseus update a single system with minimal service interruption. If it works well enough, maybe it could be extended to the host kernel's page cache too?

Preserving guest memory across kexec

Posted May 20, 2022 20:53 UTC (Fri) by gpiccoli (subscriber, #109098) [Link]

Indeed, very interesting! Is there a recording of the session?

It reminded me about this work from Pasha Tatashin: https://lpc.events/event/11/contributions/1078/
I couldn't find patches though...

Preserving guest memory across kexec

Posted May 23, 2022 10:11 UTC (Mon) by roc (subscriber, #30627) [Link]

Reminds me of the Rio filesystem from 1996. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1...