Guest-first memory for KVM

By Jonathan Corbet
November 2, 2023

One of the core objectives of any confidential-computing implementation is to protect a guest system's memory from access by actors outside of the guest itself. The host computer and hypervisor are part of the group that is to be excluded from such access; indeed, they are often seen as threat in their own right. Hardware vendors have added features like memory encryption to make memory inaccessible to the host, but such features can be difficult to use and are not available on all CPUs, so there is ongoing interest in software-only solutions that can improve confidentiality. The guest-first memory patch set, posted by Sean Christopherson and containing work by several developers, looks poised to bring some software-based protection to an upcoming kernel release.

Protecting memory from the host in the absence of encryption tends to rely on address-space isolation — arranging things so that the host has no path by which to access a guest's memory. The protection in this case is less complete — an overtly hostile host kernel can undo it — but it can be effective against many host-side exploits. Back in 2020, the KVM protected memory work created a new hypercall with which a guest could request that the host unmap a range of memory in use by that guest; that would render the host system (at both the kernel and user-space levels) unable to access that memory. That work ran into a number of problems, though, and never found its way into the mainline.

The guest-first-memory work takes a similar approach, but it moves the control to the host and reduces the available protection. Specifically, it adds a new KVM ioctl() command, called KVM_CREATE_GUEST_MEMFD, that takes a size in bytes as a parameter and returns a new file descriptor. The operation is similar to memfd_create(), in that the returned descriptor refers to an anonymous file, with the requested size, that lives entirely in memory. The differences are that this memfd is tied to the virtual machine for which it was created, and it cannot be mapped into user space on the host (or into any other virtual machine). This memory can be mapped into the guest's "physical" address space, though, with a variant on the usual KVM memory-management operations.

With this operation, the hypervisor can allocate memory resources for a guest without being able to access that memory itself. That protects the guest from having its memory contents disclosed or modified, either by accident or by malicious behavior on the part of a (possibly compromised) hypervisor. Unlike some previous attempts (including KVM protected memory), this operation does not take the affected memory out of the host kernel's direct memory map. Thus, while a guest using this memory is protected from user-space threats on the host, it could still be attacked by a compromised kernel. The bar to a successful attack has been raised significantly, but the protection is not total.

There are a number of advantages to using guest-first memory, according to the patch description. Currently, KVM does not allow guests to have a higher level of access to memory than the hypervisor does; if memory is to be mapped writable in the guest, it must be both mapped and writable in the hypervisor as well, even if the hypervisor has no need to be able to write that memory. Guest-first memory, by dispensing with the hypervisor mapping entirely, clearly gets around that problem.

Guest-first memory can also be useful in the presence of hardware-based memory encryption. Encrypted memory is already protected from access by the hypervisor; should the hypervisor attempt to do so anyway, the CPU will generate a trap, which is likely to lead to the hypervisor's demise. If that memory is not mapped into the hypervisor to begin with, though, it cannot be touched by accident. Unmappable memory can also be useful for the development and debugging of hypervisors meant to work with hardware-based confidential-computing features, even on hardware lacking those features.

Longer term, this feature may also be useful for the management of dedicated memory pools; a guest memfd could be set up on the pool without the need for access from the host. It could, perhaps, allow memory for guest systems to be managed (on the host) without using struct page at all, reducing overhead on the host and increasing the isolation of that memory. Also with an eye on the longer term, this patch series creates a more general concept of a "protected virtual machine" that is intended to be a container for confidential-computing mechanisms within KVM.

Meanwhile, though, guest-first memory has the downside that it cannot be migrated, meaning that host-side memory-management processes (such as compaction) will have to work around it. This limitation was seen as a significant problem when KVM protected memory was under discussion, but it has not been addressed in this series and will not be "at least not in the foreseeable future".

Even so, Paolo Bonzini (the KVM maintainer) has let it be known that he plans to apply this series after the 6.7 merge window with the idea of getting it into linux-next and, later, pushing it upstream for the 6.8 kernel release. He also said that he intends to apply the series to a future RHEL kernel, meaning that guest-only memory will show up in an RHEL release at some point in the (presumably not too distant) future. That is still unlikely to happen, though, before guest-only memory has landed in the mainline and the API has settled down.

Some settling may be required; this is a 35-part patch series adding nearly 3,000 lines of code, so it would not be surprising if, even after 13 revisions, there were some adjustments needed. Still, it looks like progress is being made on a multi-year effort to increase the amount of address-space isolation afforded to guest systems. With luck, users of shared cloud systems (of whom there are a lot) will all benefit from this sort of hardening.

Index entries for this article
Kernel	Confidential computing
Kernel	Memory management/Address-space isolation
Kernel	Releases/6.8

Guest-first memory for KVM

Posted Nov 2, 2023 17:15 UTC (Thu) by pbonzini (subscriber, #60935) [Link]

> The differences are that this memfd is tied to the virtual machine for which it was created, and it cannot be mapped into user space on the host (or into any other virtual machine). This memory can be mapped into the guest's "physical" address space, though, with a variant on the usual KVM memory-management operations.

Just to complicate things further, guestmemfd _files_ are tied to the VM but there are plans to allow moving the _inode_ from one VM to another (with only one of file being usable at a given time). This is needed for in-place upgrade of the userspace process that runs the VM.

Also, protected KVM will probably want to mmap() parts of the memory into the host virtual address space. However, most of the memory will be private to the VM and will cause a SIGSEGV if accessed from the host.

Guest-first memory for KVM

Posted Nov 3, 2023 12:15 UTC (Fri) by spacefrogg (subscriber, #119608) [Link] (17 responses)

I have never understood what is the actual attack scenario that these kinds of measures want to defend against.

When I, as a tenant, want to run code in a rented guest, I still have to trust the host system. There is simply no way to deliver the secret key into the guest system and making sure the host is unable to spy or manipulate. And the guest will need a secret key or there is no encrypted guest memory to begin with.

Is this about plausible deniability of the hoster against law enforcement? This also only goes so far. Several jurisdictions require that a hoster must provide lawful interception interfaces.

Guest-first memory for KVM

Posted Nov 3, 2023 12:25 UTC (Fri) by danpb (subscriber, #4831) [Link] (14 responses)

> When I, as a tenant, want to run code in a rented guest, I still have to trust the host system. There is simply no way to deliver the secret key into the guest system and making sure the host is unable to spy or manipulate. And the guest will need a secret key or there is no encrypted guest memory to begin with.

Confidential computing provides way to boot the guest with encrypted memory, validate the integrity of that guest, and then securely released secrets to the guest without the host having any ability to spy them. This will greatly reduce the amount of trust that needs to be placed on the host and/or cloud in general.

Guest-first memory for KVM

Posted Nov 3, 2023 13:41 UTC (Fri) by bluca (subscriber, #118303) [Link] (5 responses)

It will also be possible to encrypt a VM against the public key of the TPM of a specific instance, that the cloud vendor won't have access to, so that even the provisioning process can be protected

Guest-first memory for KVM

Posted Nov 3, 2023 14:10 UTC (Fri) by spacefrogg (subscriber, #119608) [Link] (4 responses)

That's complete nonsense, I am sorry for being so direct. A tenant has no way of verifying that the hoster actually provided the public key of the TPM. For this, he would need physical access to the host. He still needs to trust the hoster completely. The easiest counter-argument is, that qemu provides a soft-TPM for testing purposes. Tell me how the guest (or tenant) knows it is not running against that but rather a hardware TPM.

Guest-first memory for KVM

Posted Nov 3, 2023 14:20 UTC (Fri) by farnz (subscriber, #17727) [Link] (3 responses)

Fundamentally, because the soft TPM will not have the expected public key.

The hoster does not provide the public key to the tenant; the TPM vendor provides the public key to all and sundry, usually via a CA type system that verifies that the public key you're being offered is one from the TPM vendor (not the hoster). The TPM vendor also guarantees that the TPM, if installed correctly, will produce certain hashes of data that goes between the TPM and CPU during operation; if the hoster changes the hardware so that different data goes between the TPM and CPU, those hashes change.

From this, we can build a setup where the hoster does not need to be trusted, but the TPM vendor and CPU vendor do. The CPU vendor guarantees that there will be certain exchanges between the TPM and the CPU, such that if it's a real TPM and not a soft TPM, the hashes the TPM will sign and disclose can be predicted. As the tenant, you can take those signed hashes, validate that the TPM's public key is signed by the TPM vendor's CA (so isn't compromised by the hoster), and then send a secret, encrypted using the TPM's private key if, and only if, the hashes match what you expect from the platform.

If you want the gory details, look up how Intel TDX and AMD SEV work; they're not perfect (implementations are buggy), but the mathematics of the verification side work out, and reduce the trust domain from "hoster + CPU maker + TPM maker" to "CPU maker + TPM maker".

Guest-first memory for KVM

Posted Nov 5, 2023 4:03 UTC (Sun) by ssmith32 (subscriber, #72404) [Link] (2 responses)

Unless you're expecting a live human person to validate all these things, a malicious host can just re-write the memory of the guest OS and alter what it expects. Or just read the public key and expected hashes out of the guest's memory.

An analog: when running malware that attempted to check if it was running in a VM, I would just open it up in a disassembler (perhaps after going through some unpacking steps), and patch the check for a VM to return false. No reason a host can't do this to a guest. There's some kind of integrity check on the code, you say? Well, patch that out too.

The reality is you *have* to trust the host at some point.

What this helps with is not a fully Byzantine malicious failure scenario, but, rather, a host that is trusted at startup, but is not trusted to remain trusted after.

That is, it helps with a host that may eventually be compromised. As far as helping with a fully malicious host, this does absolutely nothing.

Guest-first memory for KVM

Posted Nov 5, 2023 5:30 UTC (Sun) by mjg59 (subscriber, #23239) [Link]

Attestation to the guest would fail for the reasons you describe. That's why this generally involves remote attestation - the guest proves to something else that it's in a trustworthy state. So you have some infrastructure that *you* own that can verify the attestations from all your guest instances running in the cloud, which allows you to prove at least a subset of the state of that guest. Once you move onto something like SEV the attestation is coming from the CPU itself, at which point you have (in theory) entirely decoupled yourself from the cloud provider. And, well, if you can't trust the thing that's executing your instructions, it feels pretty difficult to say that anything is trustworthy (which is, of course, an entirely valid position!)

Guest-first memory for KVM

Posted Nov 5, 2023 10:20 UTC (Sun) by farnz (subscriber, #17727) [Link]

Note that a CPU that's designed for this also prevents the host from reading or writing the memory of the guest OS without the guest's co-operation (memory encryption). You still have to trust the CPU and the TPM, but that's better than having to trust the entire hosting platform.

And yes, the attestation has to be to something outside the guest; but the idea is that I can stand up a microserver in my office, which I can trust because I have full control of it, and then not have to trust the hosting company.

Guest-first memory for KVM

Posted Nov 3, 2023 14:15 UTC (Fri) by spacefrogg (subscriber, #119608) [Link] (7 responses)

Tell me how, please, I want to believe! Yet, nobody could show me how to securely deliver the initial secrets. As it stands, you are merely citing the deus ex machina. I'd like to see it one time.

Telling me that confidential computing works by confidential booting, tells me nothing. Especially, when I need to suspect that confidential booting needs confidential computing itself in order to be set up securely, in which case we have come full circle in our argument.

Guest-first memory for KVM

Posted Nov 3, 2023 23:57 UTC (Fri) by stefanha (subscriber, #55072) [Link] (6 responses)

Look into remote attestation. The guest cannot establish trust itself (for the reasons you described), but it's possible to establish trust remotely using an entity not under the hoster's control.

I don't know the details, but my understanding is that remote attestation works because only the TPM/CPU vendor has the private key underpinning the attestation report. Therefore the hoster cannot forge the attestation report.

This allows a remote entity to conclude that the environment hasn't been tampered with. Together with the CPU protections (memory encryption, virtualization instruction set for confidential computing, etc) that stop the hypervisor or host kernel from accessing the guest, it becomes possible to trust that the host does not have access to the guest...until design or implementation flaws break the whole thing again.

Guest-first memory for KVM

Posted Nov 5, 2023 4:07 UTC (Sun) by ssmith32 (subscriber, #72404) [Link] (5 responses)

Given that every interaction the guest has with any input is mediated by the host, and a fully malicious host has the ability to rewrite any code or memory in the guest, remote attestation of any type is merely just another hoop a malicious host has to jump through to trick the guest.

Guest-first memory for KVM

Posted Nov 5, 2023 5:39 UTC (Sun) by mjg59 (subscriber, #23239) [Link] (3 responses)

There's a few layers to this. Yes, a hostile hypervisor can either modify the state of a guest or misconfigure security state such that another piece of hardware can do so. But think of attack surface here - there's a lot more to a cloud environment than just the hypervisor. The entire control plane is full of things that have security implications (you say "Boot this image", the control plane actually boots a backdoored version instead). And there are many more people with the ability to commit code to a public cloud provider's control plane than there are able to modify the hypervisor. Measured boot gets you to the point where you don't have to trust the control plane any more, and that's already a large reduction in attack surface.

But sure what if the hypervisor *is* hostile? That's what things like AMD's SEV and Intel's TDX are meant to achieve. The CPU knows whether it's executing in the context of the hypervisor or a guest. So, when a customer requests it, the hypervisor asks the CPU to generate a memory encryption key, and the CPU gives an opaque encrypted version of this key to the hypervisor. Whenever control is passed to the guest, the hypervisor gives the CPU that key, and the memory controller transparently encrypts or decrypts all memory accesses. Any event that passes control back to the hypervisor will trigger a context switch that the CPU is aware of, and so it stops doing that. The hypervisor has no way to view guest memory, and no way to modify it in any sort of predictable way. And since in this case the initial remote attestation comes from the CPU, it can prove that the guest is in this state and the hypervisor has no way to interfere with that.

Obviously we're still trusting the CPU, and so if when you say "a fully malicious host" you're taking into account the possibility that the CPU vendor backdoored the chip before shipping it, yes, that could still be an issue. But before you were trusting your CPU vendor *and* your cloud vendor, and now you only need to trust your CPU vendor, so it's still an improvement.

Guest-first memory for KVM

Posted Nov 5, 2023 19:49 UTC (Sun) by DemiMarie (subscriber, #164188) [Link]

You still need to trust that the hosting entity is not physically tampering with the CPU, as neither SEV-SNP nor TDX include physical tamper protection. There is even a glitch attack that allowed completely compromising SEV-SNP by taking over the AMD Secure Processor.

Guest-first memory for KVM

Posted Nov 6, 2023 16:07 UTC (Mon) by taladar (subscriber, #68407) [Link] (1 responses)

Sounds to me like all you would need is a stolen key at the TPM vendor to break that whole system permanently since there is no way to get from a compromised key to a secure state again and so the TPM likely has no way to update the key.

Guest-first memory for KVM

Posted Nov 6, 2023 17:17 UTC (Mon) by mjg59 (subscriber, #23239) [Link]

The TPM vendor never sees any private keys generated on the TPM, they just provide certificates for them (same as how CA vendors never see client private keys). Those certificates can be revoked and are stored in nvram so can also (in theory) be updated.

Guest-first memory for KVM

Posted Nov 6, 2023 1:26 UTC (Mon) by stefanha (subscriber, #55072) [Link]

> a fully malicious host has the ability to rewrite any code or memory in the guest

This is incorrect. See the "CPU protections (memory encryption, virtualization instruction set for confidential computing, etc) that stop the hypervisor or host kernel from accessing the guest" that I mentioned. mjg59's paragraph about AMD SEV and Intel TDX describes them more.

That said, confidential computing isn't a silver bullet so I think you're right that it's another hoop that a malicious host needs to jump through. There will still be security holes in implementations and sometimes in the specific design for a given system.

Guest-first memory for KVM

Posted Nov 3, 2023 22:19 UTC (Fri) by neggles (subscriber, #153254) [Link] (1 responses)

It's about protecting the guest from a *compromised* host, i.e. in the event of another guest using a hypervisor escape exploit, other guests would still be safe

Guest-first memory for KVM

Posted Nov 5, 2023 4:08 UTC (Sun) by ssmith32 (subscriber, #72404) [Link]

This, exactly. Compromised after a trusted boot.

But a fully malicious host? Yeah, this isn't going to do anything.

Guest-first memory for KVM

Posted Nov 3, 2023 13:41 UTC (Fri) by bluca (subscriber, #118303) [Link]

This is great, especially for being able to implement these workflows without letting the CPU manufacturers run the show

Guest-first memory for KVM

Posted Dec 25, 2023 6:16 UTC (Mon) by wenqian (guest, #162897) [Link]

“That work ran into a number of problems, though, and never found its way into the mainline.”
Has memfd_secret() not been merged into the kernel? why I found it exists in v6.7-rc6, refer to https://git.kernel.org/pub/scm/linux/kernel/git/stable/li...