Private memory for KVM guests

By Jonathan Corbet
April 7, 2022

Cloud computing is a wonderful thing; it allows efficient use of computing systems and makes virtual machines instantly available at the click of a mouse or API call. But cloud computing can also be problematic; the security of virtual machines is dependent on the security of the host system. In most deployed systems, a host computer can dig through its guests' memory at will; users running guest systems have to just hope that doesn't happen. There are a number of solutions to that problem under development, including this KVM guest-private memory patch set by Chao Peng and others, but some open questions remain.

A KVM-based hypervisor runs as a user-space process on the host system. To provide a guest with memory, the hypervisor allocates that memory on the host, then uses various KVM ioctl() calls to map it into the guest's "physical" address space. But the hypervisor retains its mapping to the memory as well, with no constraints on how the memory can be accessed. Sometimes that access is necessary for communication between the guest and the hypervisor, but the guest would likely want to keep much of that memory to itself.

Providing private memory

The proposed solution to this problem makes use of the kernel's memfd mechanism. The hypervisor can set up a private memory area for its guest by calling memfd_create() with the new MFD_INACCESSIBLE flag. That creates a special type of memfd that the creator can do little with; attempts to read from or write to it will fail, as will attempts to map it into memory. The creator can, though, use fallocate() to allocate (inaccessible) pages to this memfd. If the MEMFD_SECRET flag is also used at creation time, then the host's direct mapping for the affected pages will be removed, meaning that the host kernel, too, will have no mapping for that memory, making it difficult to access even if the host kernel is compromised.

The one other thing that can be done with it is to pass it to KVM to map into the guest's virtual address space. The guest will then have full access to this memory, even though the host (which set it up) does not. Enabling this functionality requires setting up callbacks in both directions between KVM and the backing store (probably shmfs) that actually provides the memory. The first set of operations is provided on the KVM side:

    struct memfile_notifier_ops {
	void (*invalidate)(struct memfile_notifier *notifier,
			   pgoff_t start, pgoff_t end);
	void (*fallocate)(struct memfile_notifier *notifier,
			  pgoff_t start, pgoff_t end);
    };

The fallocate() function will be called whenever memory is mapped into this memory range — when the fallocate() system call is used on the host side. It's worth noting that Dave Chinner objected to this name, so this callback is likely to end up being named something like notify_populate() instead. The invalidate() callback, instead, is used to indicate that a range of pages has been removed and can no longer be accessed from the guest.

The other callbacks are supplied by the backing-store implementation to provide KVM with access to the memory in this otherwise inaccessible memfd:

    struct memfile_pfn_ops {
	long (*get_lock_pfn)(struct inode *inode, pgoff_t offset, int *order);
	void (*put_unlock_pfn)(unsigned long pfn);
    };

KVM will call get_lock_pfn() to obtain the host page-frame number(s) for one or more pages. When KVM unmaps pages, it calls put_unlock_pfn() to inform the backing store that those pages are no longer being used.

This mechanism, along with the requisite plumbing in KVM, is enough to provide private memory to a guest. The hypervisor will allocate that memory for the guest, but will not be able to access it in any way.

Conversion

Quentin Perret raised a relevant question: what happens when the guest wants to share some of its private memory with the host? This happens reasonably frequently (to set up I/O buffers, for example), so most solutions in this area provide a mechanism for the "conversion" of memory pages between the private and shared states. Perret asked how that was meant to be handled with this mechanism.

The answer, as provided by Sean Christopherson, is that the guest indicates the desire to convert pages by exiting back into the hypervisor with a KVM_EXIT_MEMORY_ERROR status. That status will be passed back to the hypervisor process [Update: Thanks to Paolo Bonzini, we have a corrected version of this explanation below. ]

The answer, as provided by Sean Christopherson, is that the guest indicates the desire to convert pages with a hypercall. The KVM_RUN ioctl() immediately returns with a KVM_EXIT_MEMORY_ERROR status to the user-space hypervisor process; if it concurs with the request, it responds by unmapping the relevant section of the inaccessible memfd. That, too, is done with fallocate(), using the "hole-punch" functionality. The hypervisor can then map ordinary memory into the newly created hole, resulting in a range that is accessible to both sides.

An important thing to note is that sharing pages back to the host is, by design, a destructive operation; the hole-punch operation will cause the data that was stored there to go away. As Christopherson described, this behavior matches what is done by a number of hardware implementations; pages must be shared with the host before being filled with the data to be shared. Perret, who is working on a similar mechanism for Android ("protected KVM" or pKVM), would rather have an in-place conversion mechanism available; without that, he said, this solution "might not suit pKVM all that well". He gave a list of reasons why that would be useful, including:

One goal of pKVM is to migrate some things away from the Arm Trustzone environment (e.g. DRM and the likes) and into protected VMs instead. This will give Linux a fighting chance to defend itself against these things -- they currently have access to _all_ memory. And transitioning pages between Linux and Trustzone (donations and shares) is fast and non-destructive, so we really do not want pKVM to regress by requiring the hypervisor to memcpy things.

Christopherson questioned the need for non-destructive conversions, suggesting that reworking pKVM to handle destructive conversions "doesn't seem too onerous". Andy Lutomirski was also unclear on the benefits of that capability, and worried about the difficulty of implementing it correctly:

If we actually wanted to support transitioning the same page between shared and private, though, we have a bit of an awkward situation. Private to shared is conceptually easy -- do some bookkeeping, reconstitute the direct map entry, and it's done. The other direction is a mess: all existing uses of the page need to be torn down. If the page has been recently used for DMA, this includes IOMMU entries.

Perret reiterated his feeling that in-place conversion would perform better, but admitted that he (like all other participants in the discussion) has not yet collected the numbers to prove that one way or the other. He also doesn't have the details of in-place conversion worked out, though he proposed an outline for how it could work.

As of this writing the conversation is ongoing with no clear resolution in sight. The developers involved all have an interest in creating a mechanism that will work for all use cases; there is little interest in adding several private-memory implementations. But they all want to also get the best performance they can while avoiding excess complexity. Reconciling objectives like these is at the core of system programming (and, for that matter, most other types of programming) and is something that the kernel community is usually reasonably good at — at least, if all of the interested parties are participating in the discussion. The developers have begun to talk so, with luck, a workable solution can be expected to emerge from this conversation, but it may take a while yet.

Index entries for this article
Kernel	Memfd
Kernel	Memory management/Address-space isolation
Kernel	Virtualization/KVM

Private memory for KVM guests

Posted Apr 8, 2022 0:45 UTC (Fri) by bartoc (guest, #124262) [Link] (7 responses)

Even here you're still trusting the host not to be malicious right? As it could just .... not do all this stuff and tell you (the guest) that it did.

Seems useful to defend against bugs though.

Private memory for KVM guests

Posted Apr 8, 2022 3:17 UTC (Fri) by ras (subscriber, #33059) [Link]

I don't know, but my guess is "yes". It reminds me a polices forces assurances they will police their own wrong doings.

I was hoping for a discussion on AMD's Secure Encrypted Virtualization.

Private memory for KVM guests

Posted Apr 8, 2022 6:23 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (4 responses)

There are two parts in this:

1) for pKVM (the Arm one) yes, you are. In that case the idea is just that the vendor can see the hypervisor's code and trusts it to protect the precious DRM'd movies

2) for TDX (Intel) and SEV-SNP (AMD) the guest gets an attestation signed by the processor vendor, and can check that the contents of the memory are as intended and that any debug mode is not enabled. In that case the hypervisor can mess with encrypted memory but that will result in either the host or the guest crashing. That's fine because these confidential virtualization mechanisms don't promise availability (the host can just decide not to run the guest at all).

For devices (block, network) the solution is to encrypt all the things and keep keys for the encrypted disk in the initramfs (which is included in the above-mentioned attestation).

Private memory for KVM guests

Posted Apr 10, 2022 17:42 UTC (Sun) by ssmith32 (subscriber, #72404) [Link] (1 responses)

For (2) , doesn't it still assume some guest instructions are allowed to run on the processor directly by the host?

What prevents the host from loading guest programs into an entirely virtualized CPU?

Other than the complexity involved in emulating an entire set of CPU behaviours..

Private memory for KVM guests

Posted Apr 10, 2022 19:35 UTC (Sun) by excors (subscriber, #95769) [Link]

I think the simplified version is: The host can't emulate the TDX/etc hardware, because that contains a private key known only to Intel/etc. The host could trick its guest into accepting a fake TDX attestation by cleverly patching the guest's verification code, but that doesn't matter because it's meant for *remote* attestation: the guest sends the signed attestation report over the network to some already-trusted machine, which verifies it against Intel's public key and checks that the TDX hardware says the hypervisor has properly enabled memory encryption etc, before sending sensitive information (e.g. encryption keys for the guest's disks) back to the now-trusted guest.

Private memory for KVM guests

Posted Apr 11, 2022 7:11 UTC (Mon) by LtWorf (subscriber, #124958) [Link] (1 responses)

But for 2) the hypervisor could just replace the instructions to perform the check with whatever else places the wanted value into the wanted register no?

Private memory for KVM guests

Posted Apr 11, 2022 8:06 UTC (Mon) by farnz (subscriber, #17727) [Link]

No, it can't, because you're doing a remote attestation protocol.

The protocol described for TPM-based remote attestation in this PDF is the sort of thing that's going on, with the TPM role being taken by the processor. The core to stopping the hypervisor from playing silly games is that you don't directly send back the attested value - instead, you use the processor's security enclave as the equivalent of a client certificate in a TLS type protocol.

Because the processor mixes in various details of the guest to the final secret it generates, you know that the guest hasn't been modified - modification would result in a guest certificate that didn't match what you were expecting - and thus you're protected against the hypervisor replacing instructions. In the event that the hypervisor does replace instructions, you end up sending the remote server the "wrong" certificate, and it can identify that you're not a trusted code version - either a known code version that's got security bugs, or an unknown locally modified version of code.

This all, of course, assumes that Intel and AMD have implemented the security parts of their processor correctly, such that attacks like the Spectre family don't work against a secure VM.

Private memory for KVM guests

Posted Apr 9, 2022 2:20 UTC (Sat) by jhoblitt (subscriber, #77733) [Link]

The same issue applies to memfd in general as there is no protection against a malicious kernel. Does it make things more difficult for an attacker? Smart people seem to think so. As a user of public cloud, I would not consider this to mitigate any risk from the hypervisor. I would consider hardware support for encrypted memory a risk reduction. However, that would only be for the case where you assume the hypervisor was not comprised before the guest booted. I don't see anyway to provide concrete security guarantees against a malicious hypervisor that could be trapping on any special security enclave instructions.