Live migration of virtual machines over CXL

By Jonathan Corbet
May 15, 2023

Virtual-machine hosting can be a fickle business; once a virtual machine has been placed on a physical host, there may arise a desire to move it to a different host. The problem with migrating virtual machines, though, is that there is a period during which the machine is not running; that can be disruptive even if it is brief. At the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Dragan Stancevic, presenting remotely, showed how CXL shared memory can be used to migrate virtual machines with no offline time.

Traditional migration, Stancevic began, is a four-step process. As much of the virtual machine as possible is pre-copied to its new home while that machine continues to run, after which the virtual machine is quiesced — stopped so that its memory no longer changes. A second copy operation is done to update anything that may have changed during the pre-copy phase then, finally, the moved machine is relaunched in its new home. Stancevic's goal is to create a "nil-migration" scheme that takes much of the work — and the need to quiesce the target machine — out of the picture.

Specifically, this scenario is meant to work in situations where both physical hosts have access to the same pool of CXL shared memory. In such a setting, migrate_pages() can be used to move the virtual machine's pages to the shared-memory pool without disturbing the operation of the machine itself; at worst, its memory accesses slow down slightly. Once the memory migration is complete, the virtual machine can be quickly handed over to the new host, which also has access to that memory; the machine should be able to begin executing there almost immediately. The goal is to make virtual-machine migration as fast as task switching on a single host — an action that could happen transparently between that machine's normal time slices. Eventually, the new host could migrate the virtual machine's memory into its own directly attached RAM.

This work, he said, is still in an early stage. It does, however, have a mailing list and a web site at nil-migration.org.

David Hildenbrand asked about pass-through devices — devices on the host computer that are made directly available to a virtual machine. Those, he said, cannot be migrated through CXL memory, or in any other way, for that matter. Stancevic agreed that such configurations simply would not work. Dan Williams asked whether migration through CXL memory was really necessary or if, instead, virtual machines could just live in CXL memory all the time. In Stancevic's use case, CXL shared-memory pools are only used for virtual-machine migration, but other configurations are possible.

Another audience member asked whether it would be possible to do a pre-copy phase over the network first, and only use CXL for any remaining pages just before the move takes place. Stancevic answered that it could work, but would defeat the purpose of keeping the virtual machine running at all times.

Yet another attendee pointed out that CXL memory may not be mapped at the same location on each physical host, and wondered how the nil-migration scheme handles that. The answer is that, so far, this scheme has only been tested with QEMU-emulated hardware (CXL 3.0 hardware won't be available for a while yet), and it is easy to make the mappings match in that environment. This will be a problem when real hardware arrives, though, and a solution has not yet been worked out.

The final question, from Williams, was whether the nil-migration system would need new APIs to identify the available CXL devices. Stancevic answered that having CXL resources show up as available as NUMA nodes is the best solution, but that it would be good to have some metadata show up in sysfs to help with figuring out the paths between the hosts.

Index entries for this article
Kernel	Compute Express Link (CXL)
Kernel	Virtualization
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2023

Live migration of virtual machines over CXL

Posted May 15, 2023 17:47 UTC (Mon) by Paf (subscriber, #91811) [Link] (20 responses)

This isn’t directly related to the article, but can someone shed some more light on why CXL is happening now? I come from the supercomputer world and lots of versions of shared coherent memory have come and gone over the years, notably in SGI machines, but largely in recent decades they’ve *gone* because as systems have grown, the cost of coherency has gone up and up.

So, CXL has been possible for a long time. It’s going to have huge performance limitations, but also some real use cases. My question is, if anyone knows, why now? What forces are driving it in to being now and not 10 or 15 years ago?

Live migration of virtual machines over CXL

Posted May 15, 2023 18:09 UTC (Mon) by jbowen (subscriber, #113501) [Link]

I think part of the difference is that the community is converging on CXL as a technology rather than each having their own proprietary implementations. Over the last few years assets of competing or related technologies (e.g. Gen-Z, OpenCAPI) have been transferred to the CXL Consortium, so it feels safer to develop against.

Live migration of virtual machines over CXL

Posted May 15, 2023 18:47 UTC (Mon) by MattBBaker (guest, #28651) [Link]

Those SGI UV machines were fun. Too bad it seems to have really only caught on with the bioinformatics crowd. I made a Smith-Waterman code that parallelized on the anti-diagonal.

I've just been assuming it's been patent related. The Cray T3E came out in 1995 and with 20 year patents that means that anything patented in that machine would expire in 2015. The Gen-Z consortium went public in 2016. I don't have any proof of this, but the coincidences really line up nicely.

Live migration of virtual machines over CXL

Posted May 15, 2023 19:18 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (4 responses)

I'm pretty sure that cloud computing existed in 2008, but I don't think it was on the same scale as it is today. To my understanding, while there are non-cloud use cases for this sort of thing, it's very much intended to be used by cloud providers.

But the deeper idea goes beyond cloud. The real point of a technology like this is to make reliability a fungible asset that you can buy and sell. If you want more nines of uptime, you can buy them, at the cost of either more expensive hardware or reduced performance (or both). In practice, you can offload that cost to your cloud provider and let them deal with the economies of scale (or you can deal with it yourself, if you really want to).

CXL is just one way of doing it, of course. There's also the option of explicitly tracking and accounting for disruptions, as k8s does.[1] This is cheaper and arguably a superior model overall (you figure out how many nines you want, your cloud provider figures out how to do that and how much they want to charge you for it, money changes hands, the end). The thing about VMs is, from the cloud provider's perspective, a VM is a black box. If we shut down a customer's VM, we have to reboot it, and that's (potentially, for some workloads) much more noticeable than "one of our k8s pods was briefly unavailable during a maintenance event" (because k8s explicitly contemplates that as a possibility, and users are encouraged to design around it). So we really would prefer to avoid disrupting the VM if at all possible, and the only way to do that is some sort of live migration technology. CXL is then about being able to tell the consumer "we slowed down your VM" instead of "we paused your VM."

Disclaimer: My employer (Google) is a large cloud provider, and some of my work is related to GCP's backend (but GCP's backend is very large and I only work on a tiny piece of it).

[1]: https://kubernetes.io/docs/concepts/workloads/pods/disrup...

Live migration of virtual machines over CXL

Posted May 16, 2023 15:24 UTC (Tue) by Paf (subscriber, #91811) [Link]

Thank you, that's very interesting - so you see CXL itself as *primarily* for use cases like the one described in this article. Interesting; I've been seeing vague fluff around GPU-CPU compute memory convergence and I keep thinking "an interconnect doesn't solve the physics and the lack of one probably isn't why those memory pools were separated to begin with, this doesn't seem persuasive to me"

Live migration of virtual machines over CXL

Posted May 17, 2023 5:34 UTC (Wed) by nilsmeyer (guest, #122604) [Link] (2 responses)

I'm not sure if all cloud providers even do live migration, I noticed in the past that sometimes the provider would notify that some machines will be turned off. This seems to be mostly interesting for legacy applications that don't support any form of HA/replication.

Live migration of virtual machines over CXL

Posted May 17, 2023 6:57 UTC (Wed) by atnot (guest, #124910) [Link]

You'd still get that occasionally even with live migration, just much more rarely. It's kind of finnicky and only generally works well between identical hardware that's physically close. It also usually does not allow upgrades of some of the underlying software and settings. So while 99% of routine maintenance can be done by live migrating machines off, occasionally there will still be a need for rebooting machines.

Live migration of virtual machines over CXL

Posted May 18, 2023 23:57 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

Nobody can guarantee with absolute certainty that you will not experience a hardware failure while your VM is running on a given machine. Anyone who claims otherwise is lying to you. Do not give them your money.

Live migration of virtual machines over CXL

Posted May 15, 2023 20:48 UTC (Mon) by farnz (subscriber, #17727) [Link] (11 responses)

Part of what's going on is that sizing memory to CPU correctly has turned out to be a really tough challenge - you can't just overprovision memory by "enough", and accept waste, because the cost (not just capital cost, but power as well) is too high. And it turns out that you can implement the shared coherency protocols in such a way that if there's no sharing, the cost of coherency is zero - you only pay the coherency cost when two or more nodes on the CXL network are trying to access the same memory node.

This is attractive to big providers - instead of having to provision some hosts with 8 GiB RAM per CPU core to allow for workloads that need that much RAM, and others with 512 MiB per CPU core for "normal" workloads, you can provision all hosts with 512 MiB per CPU core, and have CXL-attached RAM that allows you to add extra RAM per CPU core for RAM intensive workloads. In turn, this means that all your hosts are "normal" size hosts, and the CXL-attached RAM is given to a host that's running a RAM intensive workload, which can access it at an affordable penalty, and in theory reduces your total costs, because you don't have your spare high-RAM hosts wasting RAM running normal workloads.

Live migration of virtual machines over CXL

Posted May 16, 2023 15:30 UTC (Tue) by Paf (subscriber, #91811) [Link] (10 responses)

OK, so improved resource sharing, and less of a 'large shared coherent memory' use case than a 'memory over there' use case (with the support of hardware guaranteed coherency to make things tractable and reduce programming overhead (because of course you *could* do all of this with libraries and remote memory, but it's a pain)). Interesting.

It's fascinating to me that that the latency cost of this (because even if you're not paying inter-node coherency related costs, the RAM is still physically distant) is acceptable for those high RAM apps, but I suppose even RAM 'over there' is a lot faster than any (realistic, cheap, available - sorry pmem) storage technology. That and my bias is towards compute jobs, for which memory latency is an absolute killer. Most memory intensive jobs are probably more akin to databases (or just literally are databases) handling transactions that may well be user facing, and so your latency window is way higher. (Relative to a compute job trying to get to the next step of computation)

Live migration of virtual machines over CXL

Posted May 17, 2023 14:41 UTC (Wed) by willy (subscriber, #9762) [Link] (9 responses)

I think it's far from determined that memory latency is going to be acceptable. There are a lot of people involved in the more ... visionary parts of CXL who don't seem to understand that the speed of light is real. The serious people working on CXL have much more limited systems in mind where they can talk confidently about how many nanoseconds things cost. My general rule is that if somebody uses the word "switch" when talking about how a CPU accesses DRAM, you can safely tune them out.

Live migration of virtual machines over CXL

Posted May 17, 2023 15:42 UTC (Wed) by Wol (subscriber, #4433) [Link] (4 responses)

Yup. There's a reason why CPUs have stopped getting any faster. It takes approximately ONE clock cycle for a signal to round-trip from one side of the chip to the other.

Okay, chips are still getting more powerful, there's ways round this (at least, as an outsider, that's how it appears to me). But the days of speeding the clock up are over. Unless, of course, you can simultaneously shrink the size and increase the efficiency of the heat sink ...

Cheers,
Wol

Live migration of virtual machines over CXL

Posted May 17, 2023 19:36 UTC (Wed) by willy (subscriber, #9762) [Link] (3 responses)

This is entirely wrong. It takes many clock ticks for data to be transferred from one side of the die to another. Load latency for a hit in L1 D$ is on the order of five ticks. L2 is around twenty. L3 is a hundred+. An L3 cache miss is thousands.

CPUs have been operating on _local_ data for about thirty years (once logic became faster than DRAM). This is why there are now dozens to hundreds of instructions in flight; they're mostly waiting on data to arrive.

Live migration of virtual machines over CXL

Posted May 17, 2023 23:07 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

And how many operations does it do on the way? The speed of light is roughly 1ft/ns. Given that your typical mobo is 30cm square, that means it will take roughly 2ns for a signal to bounce from one side and back again. That's a frequency of 500MHz. Make a chip ten times smaller, that makes the frequency 5GHz, which is approximately the speed of the fastest chips.

In other words, as I said, for an electrical signal (travelling at near light speed) to traverse a chip from one side to the other and back gives you a maximum clock speed of 5GHz - roughly the maximum clock speed available today.

The stuff you mention is all the tricks to squeeze more FLOPS or whatever out of the same clock speed, which is what chip manufacturers have been doing for maybe 40 years - which is why cpu power has (mostly) been growing faster than clock speed, if we ignore the MegaHurtz wars, for the last 40 years if not more.

Cheers,
Wol

Live migration of virtual machines over CXL

Posted May 17, 2023 23:24 UTC (Wed) by willy (subscriber, #9762) [Link] (1 responses)

CPUs do not use photons for transmitting information, so the speed of light in a vacuum is irrelevant (other than as a speed limit which we aren't even close to).

As a point of reference, cores on an Intel Haswell are connected by a ring bus. Each core contains a slice of the L3 cache. Intel had the choice between making the stops on the ring bus equidistant (2 clocks per hop) or alternately 1 and 3 clocks apart. They chose the latter. Note that the ring bus clocks are not the same as the ALU clock speed.

You really need to update your mental model of how a CPU works. We're a long way from the 6502. Try https://en.wikichip.org/wiki/intel/microarchitectures/sky...

Live migration of virtual machines over CXL

Posted May 18, 2023 6:32 UTC (Thu) by Wol (subscriber, #4433) [Link]

https://xkcd.com/386/

Live migration of virtual machines over CXL

Posted May 18, 2023 21:07 UTC (Thu) by zev (subscriber, #88455) [Link] (2 responses)

> My general rule is that if somebody uses the word "switch" when talking about how a CPU accesses DRAM, you can safely tune them out.

Genuine question (I've got no particular horse in this race): is there some critical difference between off-chip switches and on-chip ones? Since by my understanding there are already plenty of the latter between CPU cores and DRAM...

Live migration of virtual machines over CXL

Posted May 18, 2023 21:13 UTC (Thu) by willy (subscriber, #9762) [Link] (1 responses)

I don't think there are any switches between my CPU and my DRAM. There're several SERDES, perhaps that's what you're thinking of? Or maybe you're thinking about the ring connecting the CPU core to the L3 and DRAM controller?

A switch adds noticable latency -- tens to hundreds of nanoseconds -- to a DRAM access. Not to mention that there can be congestion at the switch adding yet more latency. And CXL switches are not simple animals; they have _firmware_.

Live migration of virtual machines over CXL

Posted May 18, 2023 22:17 UTC (Thu) by zev (subscriber, #88455) [Link]

> Or maybe you're thinking about the ring connecting the CPU core to the L3 and DRAM controller?

Or more generally just the on-chip networks (of whatever topology) implementing the cache hierarchy/coherence and such on big multi-core chips. Though I suppose simply being on-chip just allows them to operate with a much smaller latency penalty?

> And CXL switches are not simple animals; they have _firmware_.

If discussion of D-DIMMs I've seen (https://www.devever.net/~hl/omi) is to be believed, DRAM itself might too in the not-too-distant future...

Live migration of virtual machines over CXL

Posted May 22, 2023 18:56 UTC (Mon) by Paf (subscriber, #91811) [Link]

Thanks, Willy - I was trying to be kind but honestly this was my impression; there’s some noise here when you look it up that seems a bit silly.

Live migration of virtual machines over CXL

Posted May 17, 2023 8:01 UTC (Wed) by joib (subscriber, #8541) [Link]

> shared coherent memory have come and gone over the years, notably in SGI machines, but largely in recent decades they’ve *gone* because as systems have grown, the cost of coherency has gone up and up.

I think it's more due to A) As the amount of cores and memory per socket have grown, fewer and fewer people need these really large shared memory machines, thus less return on investment in R&D for coherency fabrics (and chip design costs increasing exponentially sure doesn't help either), and B) MPI has been around for 30 years, so lots of time for application writers to rewrite their apps (or start new ones from scratch) to use MPI and be usable on distributed memory clusters which nowadays is practically 100% of available HPC resources.

As for CXL, note it's not really about creating symmetric coherent memory fabrics like you'd want for connecting together a lot of CPU's. CXL is (or at least was when I looked into it a while ago) based on a master-slave model. It's more for things like providing coherency between main memory and an accelerator like a GPU (such things have been done before, like nvlink and CAPI, ostensibly CXL is a vendor-independent standard though in practice AFAIU largely driven by Intel). I think people are also looking into using CXL for attaching chunks of memory, in order to give some flexibility to allocate memory between hosts (though surely latency would be terrible compared to "normal" directly attached memory).

Live migration of virtual machines over CXL

Posted May 16, 2023 13:20 UTC (Tue) by tlamp (subscriber, #108540) [Link]

> Another audience member asked whether it would be possible to do a pre-copy phase over the network first, and only use CXL for any remaining pages just before the move takes place. Stancevic answered that it could work, but would defeat the purpose of keeping the virtual machine running at all times.

Hmm, why would that defeat the purpose? One probably wouldn't map this 1:1 to pre- and post-copies, but it would rather be one (or more) copy stage(s) with a dirty-bitmap setup, and then call `migrate_pages()` on the remaining pages and do the handover. Sure, in the worst case (high page churn by the guest) the result might be the same, but for guests that aren't producing such patterns one would get away with a much smaller CXL-shared memory pool.

It would somewhat work like if one combines QEMU's existing pre- and post-copy migration now – which is already possible. I.e., copy over most, but then hand the VM over to the target host earlier and the remaining RAM from the source host is page faulted into the destination over time. Using CXL and shared memory for the latter should be more performant and add no extra risk (CXL probably is even less likely to fail than an Ethernet network).

CXL as cross-machine interconnect?

Posted May 18, 2023 3:15 UTC (Thu) by DemiMarie (subscriber, #164188) [Link]

Could CXL be used as a really really fast network for any application needing a high-performance interconnect?