Live migration of virtual machines over CXL
Traditional migration, Stancevic began, is a four-step process. As much of the virtual machine as possible is pre-copied to its new home while that machine continues to run, after which the virtual machine is quiesced — stopped so that its memory no longer changes. A second copy operation is done to update anything that may have changed during the pre-copy phase then, finally, the moved machine is relaunched in its new home. Stancevic's goal is to create a "nil-migration" scheme that takes much of the work — and the need to quiesce the target machine — out of the picture.
Specifically, this scenario is meant to work in situations where both
physical hosts have access to the same pool of CXL shared memory. In such
a setting, migrate_pages()
can be used to move the virtual machine's pages to the shared-memory pool
without disturbing the operation of the machine itself; at worst, its
memory accesses slow down slightly. Once the memory
migration is complete, the virtual machine can be quickly handed over to
the new host, which also has access to that memory; the machine should be
able to begin executing there almost immediately. The goal is to make
virtual-machine migration as fast as task switching on a single host — an
action that could happen transparently between that machine's normal time
slices. Eventually, the new host could migrate the virtual machine's
memory into its own directly attached RAM.
This work, he said, is still in an early stage. It does, however, have a mailing list and a web site at nil-migration.org.
David Hildenbrand asked about pass-through devices — devices on the host computer that are made directly available to a virtual machine. Those, he said, cannot be migrated through CXL memory, or in any other way, for that matter. Stancevic agreed that such configurations simply would not work. Dan Williams asked whether migration through CXL memory was really necessary or if, instead, virtual machines could just live in CXL memory all the time. In Stancevic's use case, CXL shared-memory pools are only used for virtual-machine migration, but other configurations are possible.
Another audience member asked whether it would be possible to do a pre-copy phase over the network first, and only use CXL for any remaining pages just before the move takes place. Stancevic answered that it could work, but would defeat the purpose of keeping the virtual machine running at all times.
Yet another attendee pointed out that CXL memory may not be mapped at the same location on each physical host, and wondered how the nil-migration scheme handles that. The answer is that, so far, this scheme has only been tested with QEMU-emulated hardware (CXL 3.0 hardware won't be available for a while yet), and it is easy to make the mappings match in that environment. This will be a problem when real hardware arrives, though, and a solution has not yet been worked out.
The final question, from Williams, was whether the nil-migration system
would need new APIs to identify the available CXL devices. Stancevic
answered that having CXL resources show up as available as NUMA nodes is
the best solution, but that it would be good to have some metadata show up
in sysfs to help with figuring out the paths between the hosts.
| Index entries for this article | |
|---|---|
| Kernel | Compute Express Link (CXL) |
| Kernel | Virtualization |
| Conference | Storage, Filesystem, Memory-Management and BPF Summit/2023 |
Posted May 15, 2023 17:47 UTC (Mon)
by Paf (subscriber, #91811)
[Link] (20 responses)
So, CXL has been possible for a long time. It’s going to have huge performance limitations, but also some real use cases. My question is, if anyone knows, why now? What forces are driving it in to being now and not 10 or 15 years ago?
Posted May 15, 2023 18:09 UTC (Mon)
by jbowen (subscriber, #113501)
[Link]
Posted May 15, 2023 18:47 UTC (Mon)
by MattBBaker (guest, #28651)
[Link]
I've just been assuming it's been patent related. The Cray T3E came out in 1995 and with 20 year patents that means that anything patented in that machine would expire in 2015. The Gen-Z consortium went public in 2016. I don't have any proof of this, but the coincidences really line up nicely.
Posted May 15, 2023 19:18 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (4 responses)
But the deeper idea goes beyond cloud. The real point of a technology like this is to make reliability a fungible asset that you can buy and sell. If you want more nines of uptime, you can buy them, at the cost of either more expensive hardware or reduced performance (or both). In practice, you can offload that cost to your cloud provider and let them deal with the economies of scale (or you can deal with it yourself, if you really want to).
CXL is just one way of doing it, of course. There's also the option of explicitly tracking and accounting for disruptions, as k8s does.[1] This is cheaper and arguably a superior model overall (you figure out how many nines you want, your cloud provider figures out how to do that and how much they want to charge you for it, money changes hands, the end). The thing about VMs is, from the cloud provider's perspective, a VM is a black box. If we shut down a customer's VM, we have to reboot it, and that's (potentially, for some workloads) much more noticeable than "one of our k8s pods was briefly unavailable during a maintenance event" (because k8s explicitly contemplates that as a possibility, and users are encouraged to design around it). So we really would prefer to avoid disrupting the VM if at all possible, and the only way to do that is some sort of live migration technology. CXL is then about being able to tell the consumer "we slowed down your VM" instead of "we paused your VM."
Disclaimer: My employer (Google) is a large cloud provider, and some of my work is related to GCP's backend (but GCP's backend is very large and I only work on a tiny piece of it).
[1]: https://kubernetes.io/docs/concepts/workloads/pods/disrup...
Posted May 16, 2023 15:24 UTC (Tue)
by Paf (subscriber, #91811)
[Link]
Posted May 17, 2023 5:34 UTC (Wed)
by nilsmeyer (guest, #122604)
[Link] (2 responses)
Posted May 17, 2023 6:57 UTC (Wed)
by atnot (guest, #124910)
[Link]
Posted May 18, 2023 23:57 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link]
Posted May 15, 2023 20:48 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (11 responses)
Part of what's going on is that sizing memory to CPU correctly has turned out to be a really tough challenge - you can't just overprovision memory by "enough", and accept waste, because the cost (not just capital cost, but power as well) is too high. And it turns out that you can implement the shared coherency protocols in such a way that if there's no sharing, the cost of coherency is zero - you only pay the coherency cost when two or more nodes on the CXL network are trying to access the same memory node.
This is attractive to big providers - instead of having to provision some hosts with 8 GiB RAM per CPU core to allow for workloads that need that much RAM, and others with 512 MiB per CPU core for "normal" workloads, you can provision all hosts with 512 MiB per CPU core, and have CXL-attached RAM that allows you to add extra RAM per CPU core for RAM intensive workloads. In turn, this means that all your hosts are "normal" size hosts, and the CXL-attached RAM is given to a host that's running a RAM intensive workload, which can access it at an affordable penalty, and in theory reduces your total costs, because you don't have your spare high-RAM hosts wasting RAM running normal workloads.
Posted May 16, 2023 15:30 UTC (Tue)
by Paf (subscriber, #91811)
[Link] (10 responses)
It's fascinating to me that that the latency cost of this (because even if you're not paying inter-node coherency related costs, the RAM is still physically distant) is acceptable for those high RAM apps, but I suppose even RAM 'over there' is a lot faster than any (realistic, cheap, available - sorry pmem) storage technology. That and my bias is towards compute jobs, for which memory latency is an absolute killer. Most memory intensive jobs are probably more akin to databases (or just literally are databases) handling transactions that may well be user facing, and so your latency window is way higher. (Relative to a compute job trying to get to the next step of computation)
Posted May 17, 2023 14:41 UTC (Wed)
by willy (subscriber, #9762)
[Link] (9 responses)
Posted May 17, 2023 15:42 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (4 responses)
Okay, chips are still getting more powerful, there's ways round this (at least, as an outsider, that's how it appears to me). But the days of speeding the clock up are over. Unless, of course, you can simultaneously shrink the size and increase the efficiency of the heat sink ...
Cheers,
Posted May 17, 2023 19:36 UTC (Wed)
by willy (subscriber, #9762)
[Link] (3 responses)
CPUs have been operating on _local_ data for about thirty years (once logic became faster than DRAM). This is why there are now dozens to hundreds of instructions in flight; they're mostly waiting on data to arrive.
Posted May 17, 2023 23:07 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (2 responses)
In other words, as I said, for an electrical signal (travelling at near light speed) to traverse a chip from one side to the other and back gives you a maximum clock speed of 5GHz - roughly the maximum clock speed available today.
The stuff you mention is all the tricks to squeeze more FLOPS or whatever out of the same clock speed, which is what chip manufacturers have been doing for maybe 40 years - which is why cpu power has (mostly) been growing faster than clock speed, if we ignore the MegaHurtz wars, for the last 40 years if not more.
Cheers,
Posted May 17, 2023 23:24 UTC (Wed)
by willy (subscriber, #9762)
[Link] (1 responses)
As a point of reference, cores on an Intel Haswell are connected by a ring bus. Each core contains a slice of the L3 cache. Intel had the choice between making the stops on the ring bus equidistant (2 clocks per hop) or alternately 1 and 3 clocks apart. They chose the latter. Note that the ring bus clocks are not the same as the ALU clock speed.
You really need to update your mental model of how a CPU works. We're a long way from the 6502. Try https://en.wikichip.org/wiki/intel/microarchitectures/sky...
Posted May 18, 2023 6:32 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
Posted May 18, 2023 21:07 UTC (Thu)
by zev (subscriber, #88455)
[Link] (2 responses)
Genuine question (I've got no particular horse in this race): is there some critical difference between off-chip switches and on-chip ones? Since by my understanding there are already plenty of the latter between CPU cores and DRAM...
Posted May 18, 2023 21:13 UTC (Thu)
by willy (subscriber, #9762)
[Link] (1 responses)
A switch adds noticable latency -- tens to hundreds of nanoseconds -- to a DRAM access. Not to mention that there can be congestion at the switch adding yet more latency. And CXL switches are not simple animals; they have _firmware_.
Posted May 18, 2023 22:17 UTC (Thu)
by zev (subscriber, #88455)
[Link]
Or more generally just the on-chip networks (of whatever topology) implementing the cache hierarchy/coherence and such on big multi-core chips. Though I suppose simply being on-chip just allows them to operate with a much smaller latency penalty?
> And CXL switches are not simple animals; they have _firmware_.
If discussion of D-DIMMs I've seen (https://www.devever.net/~hl/omi) is to be believed, DRAM itself might too in the not-too-distant future...
Posted May 22, 2023 18:56 UTC (Mon)
by Paf (subscriber, #91811)
[Link]
Posted May 17, 2023 8:01 UTC (Wed)
by joib (subscriber, #8541)
[Link]
I think it's more due to A) As the amount of cores and memory per socket have grown, fewer and fewer people need these really large shared memory machines, thus less return on investment in R&D for coherency fabrics (and chip design costs increasing exponentially sure doesn't help either), and B) MPI has been around for 30 years, so lots of time for application writers to rewrite their apps (or start new ones from scratch) to use MPI and be usable on distributed memory clusters which nowadays is practically 100% of available HPC resources.
As for CXL, note it's not really about creating symmetric coherent memory fabrics like you'd want for connecting together a lot of CPU's. CXL is (or at least was when I looked into it a while ago) based on a master-slave model. It's more for things like providing coherency between main memory and an accelerator like a GPU (such things have been done before, like nvlink and CAPI, ostensibly CXL is a vendor-independent standard though in practice AFAIU largely driven by Intel). I think people are also looking into using CXL for attaching chunks of memory, in order to give some flexibility to allocate memory between hosts (though surely latency would be terrible compared to "normal" directly attached memory).
Posted May 16, 2023 13:20 UTC (Tue)
by tlamp (subscriber, #108540)
[Link]
Hmm, why would that defeat the purpose? One probably wouldn't map this 1:1 to pre- and post-copies, but it would rather be one (or more) copy stage(s) with a dirty-bitmap setup, and then call `migrate_pages()` on the remaining pages and do the handover. Sure, in the worst case (high page churn by the guest) the result might be the same, but for guests that aren't producing such patterns one would get away with a much smaller CXL-shared memory pool.
It would somewhat work like if one combines QEMU's existing pre- and post-copy migration now – which is already possible. I.e., copy over most, but then hand the VM over to the target host earlier and the remaining RAM from the source host is page faulted into the destination over time. Using CXL and shared memory for the latter should be more performant and add no extra risk (CXL probably is even less likely to fail than an Ethernet network).
Posted May 18, 2023 3:15 UTC (Thu)
by DemiMarie (subscriber, #164188)
[Link]
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Wol
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Wol
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Live migration of virtual machines over CXL
Could CXL be used as a really really fast network for any application needing a high-performance interconnect?
CXL as cross-machine interconnect?
