The Linux SVSM project

January 30, 2023

This article was contributed by Carlos Bilbao

If legacy networks are like individual homes with a few doors where a handful of people have the key, then cloud-based environments are like apartment complexes that offer both higher density and greater flexibility, but which include more key holders and potential entry points. The importance of protecting virtual machines (VMs) running in these environments — from both the host and other tenants — has become increasingly clear. The Linux Secure VM Service Module (SVSM) is a new, Rust-based, open-source project that aims to help preserve the confidentiality and integrity of VMs on AMD hardware.

The resource sharing that makes multi-tenant cloud environments so efficient can also expose the memory, caches, and registers of VMs to unauthorized access. As a response, confidential computing seeks to preserve the confidentiality and integrity of VMs from other VMs as well as from the host-machine owners. This is of particular concern for cloud providers that must meet their clients' stringent security requirements in order to sell their services. Availability is not usually part of the security goals because untrusted providers (potential attackers in these threat models) usually have direct physical access to the hosts themselves.

When performing sensitive operations on an untrusted cloud infrastructure, many resources, including the host BIOS, hypervisor, device drivers, virtual machine manager (VMM), and other VMs, cannot be fully trusted. With such a reduced trusted computing base (TCB), the root of trust usually falls to dedicated hardware components that are separate from the CPU and the rest of the hardware. The SVSM acts as an intermediary between the guest hypervisor and the firmware of these components on AMD processors. Within the context of operating systems, a "service module" can be defined as a separate entity whose main goal is to perform operations on behalf of the kernel. Since the kernel itself does not need to be able to perform such operations anymore, its ability to do so can be limited by the hardware, stopping a potential abuse from attackers.

In particular, Linux SVSM offers services to interact with the AMD Secure Processor (ASP), which is a key component of AMD's Secure Encrypted Virtualization (SEV) technology. The "Zen 3" architecture introduced with third-generation AMD EPYC processors uses the ASP to protect both the memory and register states of secured guests; the services Linux SVSM provides take advantage of these hardware capabilities. Linux SVSM provides secure services in accordance with the SVSM specification to help minimize the attack surface on guest machines. Its release was announced on the linux-coco confidential-computing mailing list, where the community is actively discussing development-related topics. Linux SVSM is an effort in the direction of virtualized confidential computing. Understanding this requires an introduction to the most recent SEV features.

SNP features used by Linux SVSM

The AMD Secure Nested Paging (SNP) feature is one of the confidential-computing extensions introduced with the "Zen 3" microarchitecture. Linux SVSM makes extensive use of two SNP features: the Reverse Map Table (RMP) and the Virtual Machine Privilege Levels (VMPLs); it also makes use of a special area known as the Virtual Machine Saving Area (VMSA). The VM state, which is a complete snapshot of the running guest's CPU registers, is saved in the VMSA whenever the VM exits back to the hypervisor.

SNP provides memory-integrity protection using a DRAM-loaded, per-host RMP. The RMP contains an entry for every physical page on the system and keeps track of the ownership and permissions of each so as to (for example) trigger a page fault when a third-party attempts to write where it should not. The RMP thus acts as an extra step in the page-table walking sequence. Some of the RMP use cases include preventing data corruption, data aliasing, and page-remapping attacks. The RMP holds the mapping for each physical page and its corresponding guest page; therefore, only one guest page can be mapped per physical page. Further, an attacker may attempt to change the physical page mapped to a guest page behind the guest's back; the RMP will thwart such attacks.

Before using a page, the guest must first validate its RMP mapping (the RMP entries include a valid bit, that is checked by hardware in the last step of the nested page walk). This is usually done during initial boot as part of the kernel's page-table preparation with the PVALIDATE instruction. The hypervisor is responsible for managing the RMP in cooperation with the SVSM and hardware checks have been implemented to ensure that the hypervisor does not misuse the RMP.

SNP also introduces the concept of Virtual Machine Privilege Levels (VMPLs), which range from zero to three, for enhanced hardware security control within VMs; VMPL0 is the highest level of privilege and VMPL3 the lowest, resembling x86 protection rings. VMPLs increase access-control granularity and can trigger exits from the VM when some virtual CPU (vCPU) attempts to access a resource that it should not. A new page that is assigned to, and validated by, a guest receives all permissions at VMPL0. The guest can later use the RMPADJUST instruction to change this for higher privilege levels. For example, a guest running at VMPL1 can remove the execute permission for that page from vCPUs running at VMPL2 or higher. Again, this type of operation normally occurs during boot. The VMSA of each guest contains its VMPL level, which cannot be modified after launch unless the SVSM directly modifies the VMSA.

Linux SVSM makes use of these (and other) new SNP features. It runs at VMPL0 while the guest OS runs at VMPL1, meaning that the SVSM will perform all guest operations that require VMPL0 on behalf of the OS. The SVSM could also provide other services (e.g. potentially live migration) in a secure manner. All requests from the guest use an API defined in the SVSM specification and must follow protocol specifications for each service type. Relying on Linux SVSM to handle certain operations drastically hardens the TCB because the sensitive work is offloaded from large programs (such as the Linux kernel) that have many attack vectors to the smaller SVSM. Further, multiple subsystems (such as kernel randomization) that are now targets due to the expansion of cloud virtualization will not require the same levels of auditing because they become unprivileged.

The Linux SVSM execution flow

Linux SVSM is not an operating system; rather, it is a standalone binary loaded by the hypervisor. The SVSM benefits from the strong static guarantees of the Rust language, from both a security and memory perspective and for safe synchronization. The Linux SVSM logic comprises both its internal setup and VM guest request handling. Analyzing the Linux SVSM execution flow is an effective way to get a better understanding. This flow consists of the following four phases:

Jump to Rust. The SVSM is the first guest code executed by the hypervisor after a VM is launched. The boot process starts at VMPL0 within the bootstrap processor (BSP). A small amount of assembly code performs basic initialization before quickly jumping to higher-level, standalone Rust code. Even in Rust, though, some operations need to be executed from within unsafe blocks (e.g. writing to MSRs or dereferencing pointers). Linux SVSM relies on the x86_64 Rust crate for most of its page handling.

Kernel components initialization. SVSM, running on the BSP, performs some checks to verify that the provided memory addresses are correct and that it is indeed running at VMPL0 with proper SEV features. The SVSM also comes with serial output support and its own dynamic memory allocator (a slab allocator for allocations up to 2KB and a buddy scheme for allocations greater than that). All of these components are initialized and other OS housekeeping occurs as well.

Launch of APs and OVMF. When running the guest under SMP, the BSP initializes the rest of auxiliary processors (APs), preparing a VMSA for each of them. Upon start, the APs jump to the SVSM request loop. The BSP locates the Open Virtual Machine Firmware (OVMF) BIOS, prepares its VMSA to run at VMPL1, and then requests the hypervisor to use the new VMSA to run the OVMF code. OVMF eventually starts the execution of the guest Linux kernel, which also runs at VMPL1. The SVSM is contained in the guest's address space, but it is not accessible by it. Whenever the guest OS needs to perform a privileged VMPL operation (such as validating its pages) it will communicate with the SVSM following one of the predefined protocols. At this point the SVSM is out of the picture while the guest kernel runs, at least until that kernel makes a service request. The initialization process is complete.

Request loop. Once everything is up and running, the process of handling requests within the SVSM begins. When the guest needs to execute something at VMPL0 (such as updating the RMP with a page validation) or to request other services from the SVSM (like virtual TPM operations), it follows the SVSM API and requests the hypervisor to run the VMPL0 VMSA that is associated with the SVSM, triggering a context switch. At that point, the hypervisor resumes the SVSM by issuing a VMRUN instruction via the VMPL0 VMSA of the SVSM. The request is processed; upon completion, the SVSM instructs the hypervisor to resume the guest VMPL1 VMSA.

Throughout this process, the SVSM executes with the SEV "Restricted Injection" feature active. This feature disables virtual interrupt queuing and limits the event-injection interface to just the #HV ("hypervisor injection") exception. The SVSM runs with interrupts disabled and does not expect any event injection, which would result in the SVSM double-faulting and terminating. This mode of operation is enforced to further reduce the security exposure within the SVSM and simplifies the handling of interruptions.

What's next?

Linux SVSM requires updated versions of the host and guest KVM, QEMU, and OVMF subsystems. These modifications are currently either under development or making their way upstream. As of this writing, the SVSM repository includes initialization scripts that clone repositories with needed changes to ease the process for developers. The current focus is on Linux support; however, the SVSM specification itself is OS-independent.

Linux SVSM is an open-source project under active development. As such, it is accepting public contributions. Support for the ability to run under different x86 privilege levels is currently being developed. Once the SVSM is able to offload all the security operations, we will be able to provide additional services, such as live VM migration. The SVSM privilege-separation model also permits the existence of a virtual Trusted Platform Module (virtual TPM). You can find recent discussions regarding design possibilities for a potential vTPM on the linux-coco mailing list. The Linux SVSM may also benefit from finer security granularity, documentation, community participation, etc. There are many open development fronts and opportunities to be part of the process and learn Rust from a systems perspective along the way. We welcome all contributions to the project.

Index entries for this article
Kernel	Architectures/x86
Kernel	Confidential computing
GuestArticles	Bilbao, Carlos

The Linux SVSM project

Posted Jan 30, 2023 18:31 UTC (Mon) by tux3 (subscriber, #101245) [Link] (2 responses)

A problem with SEV-SNP is that it can be hard to actually provision hardware that supports it.

Google's GCP has a checkbox that magically turns on SEV for you, but doesn't let you talk to the hardware to get an attestation, defeating the point.

AWS wants people to use their custom Nitro and graviton hardware. There are rumors you can use SNP if you order the whole bare-metal AMD machine instead of a VM slice, though that sets you back about $10 an hour.

And then there's Azure which tries to ship TDX and SNP (and SGX). But... it's Azure. It has bugs. It has Azure Active Directory. It features the frequently broken az cli tool, and earned a few jokes about its uptime ("Azure 360"). But hey, in fairness, there are still worse fates than being stick on Azure. at least it's not IBM CLOUD® or ORACLE. I hear some people even like Azure.

The Linux SVSM project

Posted Jan 30, 2023 22:08 UTC (Mon) by unixbhaskar (guest, #44758) [Link]

Bloody good. Precisely explained the fallacy about all these "cloudy things" . Thanks.

The Linux SVSM project

Posted Jan 30, 2023 22:30 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

> AWS wants people to use their custom Nitro and graviton hardware. There are rumors you can use SNP if you order the whole bare-metal AMD machine instead of a VM slice, though that sets you back about $10 an hour.

I can confirm that you can do that if you get a bare-metal instance.

The Linux SVSM project

Posted Jan 30, 2023 21:37 UTC (Mon) by Siosm (subscriber, #86882) [Link]

Great introduction, thanks!

The Linux SVSM project

Posted Jan 31, 2023 0:37 UTC (Tue) by andresfreund (subscriber, #69562) [Link]

This might be the most acronym dense post I've read on LWN so far :)

The Linux SVSM project

Posted Jan 31, 2023 23:02 UTC (Tue) by intgr (subscriber, #39733) [Link] (9 responses)

As I understand, the threat model includes that the hypervisor of its kernel may be compromised.

It still doesn't make sense to me, how all of this prevents the hypervisor from attacking the SVSM itself, or swapping it out for a compromised version. Or swapping out some other part of the guest before starting it.

The Linux SVSM project

Posted Feb 1, 2023 17:19 UTC (Wed) by tux3 (subscriber, #101245) [Link] (8 responses)

The keystone is the attestation. You let an attacker access and tamper with everything above the security chip. The attacker can tamper with the host kernel, or run a compromised version of the VM instead of the real one. But then the security chip will hash everything above, and if it has been modified the signature that comes out of that won't be the one you expect.

So both the VM and the host can be tampered with, but it is tamper-evident. As the customer/developer/operator, you will refuse to provision your secrets onto the compromised VM. Your end-users will also check the signature remotely and should refuse to send any requests or user data to the VM. You can check remotely because the CPU vendor tells you which security chips are genuine and which are not, so an attacker can't very well replace the chip with a counterfeit either.

Only if you extract real private signature key from a real chip, or compromise the CPU vendor, or are the CPU vendor, or find a vulnerability in the whole machinery, you can forge signatures and impersonate the real VM.

The Linux SVSM project

Posted Feb 1, 2023 17:40 UTC (Wed) by Zildj1an (guest, #152565) [Link]

That's a really good answer.

The Linux SVSM project

Posted Feb 2, 2023 12:51 UTC (Thu) by jgg (subscriber, #55211) [Link] (6 responses)

The security chip isn't the main point.

All these confidential compute solutions include an entire other "hypervisor" software layer that sits below the hypervisor linux in a more privledged CPU mode. (eg ARM calls it the Realm manager, AMD has the "AMD Secure Processor", I forget what Intel calls it)

This software layer is responsible for partitioning the machine and co-ordinating with the attestation mechanism to allow the machine partitions to be measured as they boot, including the lowest level hypervisor software in the measurement. In many ways it reminds me of Xen.

Once it partitions the machine then the hypervisor Linux is unable to reach into other partitions.

It is a weird dance where the hypervisor linux largely controls the machine, but when it wants to create a partition it, for instance, gathers up a bunch of its own memory and donates it to the low level hypervisor which makes the memory into a partition and removes it from the hypervisor Linux.

So the whole security design relies on the low level hypervisor being secure against a compromised hypervisor linux, and then the attestation mechanism allowing the VM itself to proof cryptographically that it is running under the control of a secure low level hypervisor.

As a customer of this stuff in a cloud you'd want to be saying things like:
- Show me the source for the low level hypervisor so it can be audited
- Prove to me the source you are showing me matches the measurement (ie reproducible builds)
- That your attestation keys are actually secure
- That the CPU HW you selected doesn't have bugs that let the measurement be forged

The general concept is that the CPU vendors are saying 'trust us more than your cloud provider' and they promise if you get a measured boot into a VM with an Intel/AMD/ARM signature on the low level hypervisor then their CPU HW will protect the VM from the cloud operator, whoever it is. It puts a lot of burden on the VM side to have a policy for what trust means. eg Do you want to trust an AWS CC instance if the low level hypervisor is also signed by AWS?

The Linux SVSM project

Posted Feb 2, 2023 15:26 UTC (Thu) by farnz (subscriber, #17727) [Link] (5 responses)

The general concept is that the CPU vendors are saying 'trust us more than your cloud provider' and they promise if you get a measured boot into a VM with an Intel/AMD/ARM signature on the low level hypervisor then their CPU HW will protect the VM from the cloud operator, whoever it is. It puts a lot of burden on the VM side to have a policy for what trust means. eg Do you want to trust an AWS CC instance if the low level hypervisor is also signed by AWS?

Bear in mind as you think about this that you already have to trust the CPU vendor; their silicon could be backdoored to detect your workload and compromise it. The offer they're making is "you can reduce your trusted base from us and the cloud provider to just us"; this reduction in threat surface is of value, as long as they can ensure that their low-level hypervisor can be trusted at all.

The Linux SVSM project

Posted Feb 7, 2023 10:16 UTC (Tue) by JanC_ (guest, #34940) [Link] (4 responses)

It’s a lot easier (and a huge lot cheaper) for the CPU manufacturer to backdoor the software that runs on the "security processor" than to add a special backdoor targetting you to the actual hardware design though…

The Linux SVSM project

Posted Feb 7, 2023 11:49 UTC (Tue) by farnz (subscriber, #17727) [Link] (3 responses)

But the security processor is part of the hardware design - if I can backdoor the software on the security processor, then I can write software that puts you into a VM it controls, and backdoors you that way. After all, nothing stops the CPU "booting" by running a hypervisor that leaves the security processor in total control, exposing what appears to be a "bare metal" interface, but in fact indirecting everything through the security processor's control.

The Linux SVSM project

Posted Feb 7, 2023 14:28 UTC (Tue) by JanC_ (guest, #34940) [Link] (2 responses)

And that's why jgg wants to be able to verify what code runs on it…

The Linux SVSM project

Posted Feb 7, 2023 14:33 UTC (Tue) by farnz (subscriber, #17727) [Link] (1 responses)

Right, but the processor he verifies could itself have a security processor that's hidden, and that runs the verified code in a hypervisor allowing the CPU manufacturer to backdoor it. There is simply no affordable route, short of trusting the CPU manufacturer, to verify that the hardware they have described to you is the hardware that's running in your system. The best you can do is to destroy a random sample of CPUs, reverse-engineering them with an electron microscope, to confirm that there's nothing hidden - and this is both expensive, and also depends on the CPU manufacturer including backdoored CPUs in the set you destroy.

And even then, you have to trust that the electron microscope is not backdoored, and that the reverse engineers are honest…

The Linux SVSM project

Posted Mar 11, 2023 1:57 UTC (Sat) by ghane (guest, #1805) [Link]

> And even then, you have to trust that the electron microscope is not backdoored, and that the reverse engineers are honest…

... also that farnz hasn't backdoored the problem statement.

... and that corbet hasn't backdoored the comment which told us exactly what to do to check.

My head hurts. :-)