Preserving guest memory across kexec
The use case in question, Gowans began, is a live update of a hypervisor done with the kernel's kexec functionality. To carry this out, the state of all running virtual machines is serialized to persistent storage, then kexec is used to boot into the updated hypervisor. After that, the virtual machines can all be restarted. The desire is to preserve the state of guest memory over the reboot, which means this memory cannot be managed by the host kernel in the traditional way; instead, the kernel should stay away from that memory and let user space manage its allocation to virtual machines. They have been looking at "sidecar virtual machines" as a way to implement this functionality.
Most of guest memory, Gowans said, should not be touched by the new kernel, meaning that the kernel will only manage a small part of the memory given to guest systems. The userfaultfd() system call is used to manage the rest; this is a change, since userfaultfd() only works with anonymous memory currently. Future requirements will include keeping I/O memory-management unit (IOMMU) mappings in sync, keeping DMA operations running while the update happens, and improving the speed of kexec by passing more state to the new kernel.
John Hubbard asked if memory managed in this way needs to have associated page structures; the answer was that they are not needed.
A few implementation options were presented. The first was a full filesystem, implemented in the kernel, that is used to manage allocations of reserved ranges of memory. The kernel would reconstruct this filesystem after a kexec. The PKRAM mechanism, which preserves RAM contents over a kexec, would probably be used for this purpose; the PKRAM patches were posted last year, but have not been merged. How to handle other types of memory, such as PCI memory-mapped I/O (MMIO) registers, is an open question as well.
The next implementation option was a FUSE-based filesystem; mapping of guest memory to page-frame numbers could then be handled from user space. A special control process could handle many of the details, and this solution would support mapping to PCI MMIO spaces.
Finally, this feature could be implemented using a raw memory device, something along the lines of /dev/mem. The control process could use ioctl() calls to create and revoke mappings to pages in the guest process. User space would be charged with keeping mappings in place over the kexec call. There is evidently an implementation of this option running now.
Jan Kara observed that there are a number of other things that need to be restored after a kexec, including open files and more. This task resembles Checkpoint/Restore in User space (CRIU), which already exists. The response was that this solution does not try to recreate everything automatically; instead, hypervisor processes will be responsible for opening files again after the kexec. Woodhouse compared it to live migration to the same host. Gowans said that guests won't notice this happening; they will be paused and serialized, and their previous state pushed back into KVM by the new hypervisor.
Returning to the implementation options, Gowans said that the full-filesystem approach offers the best latency and introspection, but it's not clear how MMIO regions can be handled. The FUSE approach gives full control to user space and solves the MMIO problem. The raw-memory version is the most flexible, but it requires reconstructing everything after the kexec, and is the least transparent to introspection.
Next steps include figuring out how to handle IOMMU mappings, then picking an approach to pursue. The preferred approach looks like the FUSE version, so the plan is to put together an RFC patch implementing it and to have a polished version by the KVM Forum in September.
Dan Williams said that the FUSE and raw-memory options look like the least scary ones. That said, PKRAM does look scary; he asked about the status of those patches. David Hildenbrand answered that the last posting of that work "didn't inspire joy".
The attendees were tired and the session wound down fairly quickly. The
final question had to do with the existence of other use cases for this
functionality. Hildenbrand suggested that databases could be a candidate.
Specifically, huge, in-memory databases can take hours to boot and load up
all of the data; a mechanism like this could possibly accelerate the
process.
Index entries for this article | |
---|---|
Kernel | Kexec |
Kernel | userfaultfd() |
Kernel | Virtualization |
Conference | Storage, Filesystem, Memory-Management and BPF Summit/2022 |
Posted May 20, 2022 16:32 UTC (Fri)
by developer122 (guest, #152928)
[Link] (11 responses)
The solution tends to look a lot like a filesystem. When a computer is rebooted, contents on disk are still there and are rediscovered by the new instance of the filesystem driver. So, extend that metaphor to memory, where the "files" and "directories" are old processes and memory mappings. Page tables <-> filesystem trees.
Of course, there are some details like making sure the previous version's mappings can be updated to reflect a new "on disk format" and that calls to now-nonexistant system calls can be handled appropriately. Also that some ram is available for the new kernel to set itself up before walking the old page and process tables.
The beauty of a microkernel is that the userspace drivers probably still remember the state of hardware devices, even after the microkernel itself is rebooted.
Posted May 20, 2022 16:34 UTC (Fri)
by developer122 (guest, #152928)
[Link] (5 responses)
Posted May 20, 2022 16:47 UTC (Fri)
by mb (subscriber, #50428)
[Link] (4 responses)
Posted May 21, 2022 6:30 UTC (Sat)
by developer122 (guest, #152928)
[Link] (3 responses)
Posted May 21, 2022 12:37 UTC (Sat)
by mb (subscriber, #50428)
[Link] (2 responses)
Posted May 26, 2022 2:35 UTC (Thu)
by developer122 (guest, #152928)
[Link]
Posted May 31, 2022 14:51 UTC (Tue)
by simcop2387 (subscriber, #101710)
[Link]
Posted May 21, 2022 15:05 UTC (Sat)
by developer122 (guest, #152928)
[Link] (4 responses)
Dunno that it would be possible on current hardware though, with existing interrupt controllers and memory paging.
Posted May 22, 2022 7:37 UTC (Sun)
by Wol (subscriber, #4433)
[Link] (3 responses)
I've stood next to a system where my friend said "if I pull out this board no-one will notice except the engineers monitoring the system ..." he was pointing at the CPU ...
and the OS was just another replaceable component same as ...
Cheers,
Posted May 26, 2022 2:54 UTC (Thu)
by developer122 (guest, #152928)
[Link] (2 responses)
At the same time, the only reason that was possible was because of virtualization. All the "OSs" were running in logical partitions aka hardware VMs. If you lost the hypervisor then the system went down, and the base hypervisor firmware was impossible to replace (often because it's in ROM).
Posted May 26, 2022 9:19 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (1 responses)
Cheers,
Posted May 30, 2022 0:42 UTC (Mon)
by developer122 (guest, #152928)
[Link]
Posted May 20, 2022 16:46 UTC (Fri)
by flussence (guest, #85566)
[Link]
Posted May 20, 2022 20:53 UTC (Fri)
by gpiccoli (subscriber, #109098)
[Link]
It reminded me about this work from Pasha Tatashin: https://lpc.events/event/11/contributions/1078/
Posted May 23, 2022 10:11 UTC (Mon)
by roc (subscriber, #30627)
[Link]
Preserving guest memory across kexec
Preserving guest memory across kexec
Preserving guest memory across kexec
Preserving guest memory across kexec
Preserving guest memory across kexec
If you don't want to access the disk, then the only additional thing needed would be a ramdisk that survives reboots where the checkpoints are stored. Such a ramdisk sounds way easier to implement than reinventing checkpoint/restore just for the kexec case.
Preserving guest memory across kexec
Preserving guest memory across kexec
Preserving guest memory across kexec
Preserving guest memory across kexec
Wol
Preserving guest memory across kexec
Preserving guest memory across kexec
Wol
Preserving guest memory across kexec
Preserving guest memory across kexec
Preserving guest memory across kexec
I couldn't find patches though...
Preserving guest memory across kexec