Rethinking the Linux cloud stack for confidential VMs
There is an inherent limit to the privacy of the public cloud. While Linux can isolate virtual machines (VMs) from each other, nothing in the system's memory is ultimately out of reach for the host cloud provider. To accommodate the most privacy-conscious clients, confidential computing protects the memory of guests, even from hypervisors. But the Linux cloud stack needs to be rethought in order to host confidential VMs, juggling two goals that are often at odds: performance and security.
Isolation is one of the most effective ways to secure the system by containing the impact of buggy or compromised software components. That's good news for the cloud, which is built around virtualization — a design that fundamentally isolates resources within virtual machines. This is achieved through a combination of hardware-assisted virtualization, system-level orchestration (like KVM, the hypervisor integrated into the kernel), and higher-level user-space encapsulation.
On the hardware side, mechanisms such as per-architecture privilege levels (e.g., rings 0-3 in x86_64 or Exception Levels on ARM) and the I/O Memory Management Unit (IOMMU) provide isolation. Hypervisors extend this by handling the execution context of VMs to enforce separation even on shared physical resources. At the user-space level, control groups limit the resources (CPU, memory, I/O) available to processes, while namespaces isolate different aspects of the system, such as the process tree, network stack, mount points, MAC addresses, etc. Confidential computing adds a new layer of isolation, protecting guests even from potentially compromised hosts.
In parallel to the work on security, there is a constant effort to improve the performance of Linux in the cloud — both in terms of literal throughput and in user experience (typically measured by quality-of-service metrics like low I/O tail latency). With the knowledge that there is room to improve, the cloud providers increasingly turn to I/O passthrough to speed up Linux: bypassing the host kernel (and sometimes the guest kernel) to expose physical devices directly to guest VMs. This can be done with user-space libraries like the Data Plane Development Kit (DPDK), which bypasses the guest kernel, or hardware-access features such as virtio Data Path Acceleration (vDPA), which allow paravirtualized drivers to send packets straight to the smartNIC hardware.
But hardware offloading exemplifies a fundamental friction in virtualization, where security and performance often pull in opposite directions. While it is true that offloading provides a faster path for network traffic, it has some downsides, such as limiting visibility and auditing, increasing reliance on hardware and firmware, and circumventing OS-based security checks of flows and data. The uncomfortable reality is that it's tricky for Linux to provide fast access to resources while concurrently enforcing the strict separation required to secure workloads. As it happens, the strongest isolation isn't the most performant.
A potential solution to this tension is extending confidential computing to the devices themselves by making them part of the VM's circle of trust. Hardware technologies like AMD's SEV Trusted I/O (SEV-TIO) allow a confidential VM to cryptographically verify (and attest to) a device's identity and configuration. Once trust is established, the guest can interact with the device and share secrets by allowing direct memory access (DMA) to its private memory, which is encrypted with its confidential VM key. This avoids bounce buffers — temporary memory copies used when devices, like GPUs when they are used to train AI models, need access to plaintext data — which significantly slow down I/O operations.
The TEE Device Interface Security Protocol (TDISP), an industry standard published by PCI SIG, defines how a confidential VM and device establish mutual trust, secure their communications, and manage interface attachment and detachment. A common way to implement TDISP is using a device with single root I/O virtualization (SR-IOV) support — a PCIe feature that a physical device can use to expose multiple virtual devices.
In those setups, the host driver manages the physical device, and each virtual device assigned to a guest VM acts as a separate TEE device interface. Unfortunately, TDISP requires changes in the entire software stack, including the device's firmware and hardware, host CPU, and the hypervisor. TDISP also faces headwinds because not all of the vendors are on board. Interestingly, NVIDIA, one of the biggest players in the GPU arena, sells GPUs with its own non-TDISP architecture.
Secure Boot
Beyond devices, many other parts of the Linux cloud stack must change to accommodate confidential computing, starting right at boot. To understand how, we need to look at Secure Boot. A typical sequence is shown in the area outlined in red in the figure below. First, the firmware verifies the shim pre-bootloader using a cryptographic key embedded in the firmware's non-volatile memory by the OEM, along with a database of valid signatures (DB) and a revocation list (DBX) to reject known-bad binaries, such as a first-stage bootloader, and revoked certificates. Once verified, shim is loaded into system memory and execution jumps to it.
Shim then does a similar check on the next step, the bootloader (usually GRUB), using a key provided by the Linux distribution. Finally, the bootloader verifies and loads the kernel inside the guest VM. The guest kernel can read the values of the Platform Configuration Registers (PCRs) stored in a virtual Trusted Platform Modules (TPM) that the hypervisor provides (e.g. using swtpm) to get the digests of all previously executed components and verify that they match known-good values.
Extra steps need to take place during boot to set up for confidential computing. In the figure above, a secure VM service module (SVSM) on the left becomes the first component to execute, verifying the firmware itself while running in a special hardware mode known as VMPL0 (Intel's equivalent is VTL0). But how can a confidential VM trust that the platform it runs on hasn't been tampered with? In traditional Secure Boot, the chain of trust relies on a virtual TPM (vTPM) provided by the host. However, the hypervisor itself is now untrusted, so the guest cannot rely on a TPM controlled by it. Instead, the SVSM, or other trusted component isolated from the host, must provide a vTPM that supplies measurements for remote attestation. This allows the guest OS to verify the integrity of the platform and decide whether it is safe to run.
The details of remote attestation can vary depending on the model followed; the most well-known is the Remote ATtestation procedureS (RATS) architecture. In this model, three actors play a role:
- Attester: Dedicated hardware like AMD's Platform Security Processor (PSP) that generates evidence about its current state (e.g., firmware version) by signing measurements with a private key stored within it.
- Verifier: A remote entity that evaluates the evidence's integrity and trustworthiness. To do so, it consults an endorser to validate that the signing key and reported measurements (digests) are legitimate. The verifier can also be configured to enforce appraisal policies — for example, rejecting systems with outdated firmware versions from receiving secrets.
- Endorser: A trusted third party, typically the hardware vendor, provides certificates confirming the signing key belongs to genuine cryptographic hardware. The endorser also supplies reference measurement values used by the verifier for validation.
The final product is an attestation result prepared by the verifier, confirming that the measured platform components match expected good values. A Linux confidential VM can use this report — including a vTPM quote with the current PCR values signed by a vTPM private key and a nonce supplied by the guest (to prevent replay attacks) — to decide whether to continue booting.
Secure Boot helps prevent malicious code from executing early in the boot sequence, but it can also increase boot time by a few seconds. Adding confidential computing to the equation slows down things even more. For most Linux users, the slight delay of Secure Boot is negligible and well worth the security benefits. But, in cloud environments, even a few extra seconds for guest boot can be consequential — small delays quickly add up at fleet scale. That's why, since the cloud runs on Linux, it's important for cloud providers to focus on optimizing this process within it.
To complicate things even more, there are different flavors of confidential computing. For example, instead of using an SVSM, Microsoft's Linux Virtualization-Based Security (LVBS) opts for a paravisor, as shown in the figure below. In LVBS, the paravisor is a small Linux kernel that runs in a special hardware mode (e.g. VTL0) after the bootloader. This design has the advantage of being vendor-neutral, but also has drawbacks, such as a significantly larger attack surface than the SVSM. Even though there are many ways to implement confidential VMs in Linux, we still lack a clear, shared understanding of the trade-offs between them.
Once the confidential VM is booted, two major sources of runtime overhead are DRAM encryption and decryption, as well as enforcing memory access permissions from the hardware. That said, because this happens inline within the memory controller, the delay is usually small; this impact can vary depending on the workload, particularly for cache-sensitive applications.
A separate, more significant performance hit comes from the process of accepting memory pages. Before a confidential VM can access DRAM, each page must be explicitly accepted by the guest. This step binds the guest physical address (gPA) of the page to a system physical address (sPA), preventing remapping — that is, once validated, the hardware enforces this mapping, and any attempt by the hypervisor to remap the gPA to a different sPA via nested page tables will trigger a page fault (#PF). The validation process is slow and requires the guest kernel to spend virtual-CPU cycles issuing hypercalls and causing VMEXITs, since it cannot directly execute privileged instructions like PVALIDATE on x86 processors. Only components running in special hardware modes — such as the SVSM at VMPL0 — can call them directly. To avoid this overhead cost at runtime, the SVSM (or whatever component is used) should pre-accept all memory pages early during the boot process.
Scaling
Fleet scalability — meaning how many guest VMs can be created — is also impacted by confidential computing. The most significant hardware limitations come from architectural constraints: for example, the number of available address-space identifiers (ASIDs). Each confidential VM requires a unique ASID in order to be tagged and isolated; without a unique ASID, the hardware cannot differentiate between encrypted memory regions belonging to different VMs. The maximum number of ASIDs that Linux can use is typically capped by the BIOS and limited to a few hundred. That might seem enough, but modern multicore processors can have hundreds of cores, each hosting one or even two virtual CPUs with simultaneous multithreading. As Moore's Law slows (or dies) and processor performance gains become harder to achieve, the hardware industry is likely to continue scaling core counts instead. Thus, without scalable support in Linux for confidential VMs, the cloud risks underutilizing cores.
A possible solution to the hardware scalability problems would be hybrid systems, where Linux could run both confidential and conventional VMs side by side. Today, kernel-configuration options enforce an all-or-nothing approach — either the system hosts only encrypted VMs or it hosts no encrypted VMs. Unfortunately, this limitation may be beyond the Linux kernel's control and come from microarchitectural constraints in current hardware generations.
In confidential VMs, swap memory needs to be encrypted to preserve the confidentiality of data even when moved to disk. Likewise, when the VMs communicate over the network — particularly through host-managed NICs — they must establish secure end-to-end sessions to maintain data integrity and confidentiality across untrusted host networks. Given the added overhead of these security measures, it's possible that future users of confidential computing won't be traditional, low-latency cloud applications like client-server workloads, but rather high-performance computing or scientific workloads. While these batch-oriented applications may still experience some performance impact, they generally have a higher tolerance for latency — not because they are inherently less sensitive to it, but because they lack realtime human interaction (e.g., there are no users sitting in front of a browser waiting for a reply).
Live migration is another important aspect of the cloud, allowing VMs to move between hosts (such as during maintenance in specific regions of the fleet) with minimal impact on the VMs — ideally without a noticeable disruption, as IP addresses can be preserved using virtual LAN technologies like VXLAN. However, after migration, the attestation process must be repeated on the destination node. While pre-attesting a destination node (as a plan B option) can help reduce overhead, unexpected emergencies in the fleet may force the VM to migrate again shortly after arrival. Worse still, because the guest VM no longer implicitly trusts the host, it must also verify that its memory and execution context were correctly preserved during migration, and that any changes were properly tracked throughout the live migration. To facilitate all of this, a migration agent running in a separate confidential VM can help coordinate and secure live migration.
In conclusion
Hardware offloading has always implied a tradeoff in virtualization: it improves I/O performance but weakens security. Thanks to confidential computing, Linux can now achieve the former without sacrificing the latter. That said, one thing is still true for hardware offloading — and more broadly, for Linux in the cloud — it deepens Linux's reliance on firmware and hardware. In that sense, trust doesn't grow or shrink, it simply shifts. In this case, it shifts toward OEMs (hardware and device manufacturers).
But what happens if (or when) an attacker exploits vulnerabilities or backdoors in hardware or firmware? Unlike software, hardware is difficult to verify, leaving open the risk of hidden compromises that can undermine the entire security model. Open architectures like RISC-V may offer a solution with hardware designs that can be inspected and audited. This speaks to the security value of transparency and openness — ultimately the only way to eliminate the need to trust third parties.
Cloud providers are already expected to respect user privacy, but confidential computing turns that promise into more than just a leap of faith taken in someone else's computer. That shift puts the guest Linux kernel in an awkward spot. Cooperation with the host can be genuinely useful — say, synchronizing schedulers to make the most of NUMA layouts, or avoiding guest deadlocks. But the host is also, unavoidably, untrusted.
This means that Linux finds itself trying to work with something it's supposed to be protected from. As a consequence, a lot has to change in the Linux cloud stack to truly accommodate cloud confidential computing. Is this a worthwhile investment for the overall kernel community? As the foundation of the modern public cloud, Linux is in a good position to explore the potential of confidential VMs.
Index entries for this article | |
---|---|
GuestArticles | Bilbao, Carlos |
Posted Jul 26, 2025 13:29 UTC (Sat)
by hailfinger (subscriber, #76962)
[Link] (3 responses)
A more realistic description would be: "To soothe the qualms of privacy-conscious clients without deep technical knowledge, confidential computing claims to protect the memory of guests, even from hypervisors." As an alternative, "[...] confidential computing protects [...], but no complete implementation of it exists as of now."
I have talked to a few tech companies advertising confidential computing, some of them acting as providers, others selling hardware/software for confidential computing.
Questions to ask if you want to know whether a confidential computing service/product has the security level you desire:
Posted Jul 28, 2025 17:41 UTC (Mon)
by DemiMarie (subscriber, #164188)
[Link]
1. You can validate that the attestation report came from one of the CPUs in your physically secure data center.
Posted Jul 31, 2025 12:18 UTC (Thu)
by yaap (subscriber, #71398)
[Link]
The cloud money stack includes cloud operators and chip makers (and some, but the rest is lower margins). Both sides will try to commoditize each other.
The cloud companies are a step ahead, with their internal chips like Graviton for AWS, which puts pressure on the CPU vendors. What are the CPU vendors to do to commoditize the clound vendors? They must enable commodity, low margin cloud vendors.
For this, they need to provide software (open source) and turn-key solutions (see NVidia AI racks) to enable the core functions. But then, there is the problem of trust: the big customers are ready to trust AWS, Google and Azure. They would be ready too to trust Intel, AMD and NVidia. But they may not trust ACME Cloud Computing, HQ'd in the Barbados. Enter confidential computing.
That's the target IMHO, and it's fair to say it's not reached yet.
Posted Aug 6, 2025 2:14 UTC (Wed)
by vonbrand (subscriber, #4458)
[Link]
Posted Jul 26, 2025 23:36 UTC (Sat)
by SLi (subscriber, #53131)
[Link] (1 responses)
I'm not an expert on this by any means, but are you sure this is how it works? My general understanding is that the Guest OS cannot trust its own decision-making because if the platform is compromised, the platform can tamper with the guest. This means that the "do I trust the platform" decision can never be safely made by the guest.
What I thought is done instead is that the hardware is designed to attest the state of the platform. The guest just takes these signed measurements and passes them on to some outside service that you really trust. This service then makes the decision on whether the platform can be trusted, and on a positive decision, sends the guest keys to decrypt its secrets so that it can continue.
Posted Jul 27, 2025 0:17 UTC (Sun)
by Zildj1an (subscriber, #152565)
[Link]
Posted Aug 5, 2025 17:29 UTC (Tue)
by Liam_Ma (subscriber, #153162)
[Link]
1. Private Hugepages (Default): These function as standard hugepages within the TD guest, remaining encrypted and protected by the TDX module. They are used for all guest-internal memory allocations, ensuring confidentiality and integrity.
2. Shared Hugepages (New Type): This new hugepage type is specifically designed for I/O buffers (e.g., DPDK mempools).
3. Add the User Space MMIO vmexit support for TD guest kernel.
This architecture dramatically reduces the frequency of `vmexits`, leading to a substantial increase in I/O throughput and a reduction in latency. With the initial Intel TDX support having been merged in kernel version 6.16, we believe now is the opportune time to upstream our enhancements.
Please Ref to https://dpdksummit2024.sched.com/event/1iAtD/boosting-the...
Posted Aug 23, 2025 1:59 UTC (Sat)
by Curan (subscriber, #66186)
[Link] (1 responses)
> The guest kernel can read the values of the Platform Configuration Registers (PCRs) stored in a virtual Trusted Platform Modules (TPM) that the hypervisor provides (e.g. using swtpm) to get the digests of all previously executed components and verify that they match known-good values.
fully admits to me, that a guest can't trust a cloud environment. And how could it be different? If you control the hardware and/or (at least) the software stack below you, I can pretend pretty much anything to the next higher level. There is no guarantee for me, that the hardware (or any part in-between) is not lying to me. And there never will be.
If you need to trust your data or computations, you need to run your own systems. Anything else seems like self-delusion.
Posted Aug 23, 2025 4:08 UTC (Sat)
by himi (subscriber, #340)
[Link]
Security is the most glaring loss - security is impossible without control, running your stack on a cloud service forces you to trust that those controlling that service aren't lying to you about what they're doing. And *control* of the service doesn't simply mean legal ownership - if your cloud service provider suffers a security compromise, some unknown and probably malicious third party may be controlling the service you're relying on.
But the same applies to everything else in your stack. Your app has to trust the RDBMS when it returned success on COMMIT; the database service has to trust the kernel when it said that data hit persistent storage; the kernel has to trust the storage driver; the storage driver has to trust whatever storage system it's talking to - as soon as you get to the component you have no control over, you're reliant on your cloud service provider being honest (and not screwing up).
There are plenty of reasons to run things in the cloud - it really is more flexible than having responsibility from the bare metal up, and (broadly speaking) simpler; it's even cheaper sometimes. But you need to understand what you're giving up in exchange for that flexibility.
The only potential value I can see of confidential computing would be where organisations want to use cloud services, but they can't trust the existing security boundaries provided by the cloud service. Problem being, if it's relying on hardware which is outside your organisation's control then nothing's actually changed - you still have to trust whoever controls the cloud service, and the hardware they run their service on. If confidentiality is a critical requirement of your system, running it on hardware you don't control simply isn't an option.
That said, Amazon/Google/Microsoft/etc *are* very trustworthy in a way - you can absolutely trust them to do whatever they can get away with to make money; any actual services they provide in the process are side-effects, which can be trusted exactly as far as they decide they're useful for making money. And that's not a good foundation for building security- or confidentiality-critical systems; it's not a good foundation for building *any* critical systems.
Misleading promises
Some of them state it's only about raising the bar for attackers. Others hope to combine confidential computing with strict physical access control at a data center to achieve protection roughly equivalent to a server room protected by a cheap door at your local office. The ludicrous end of the spectrum will claim absolute protection against the operator of the infrastructure.
1. If your product needs any special features of the CPU, does your CPU vendor have a public statement that these special features provide exactly the properties you claim? (Background: I have asked multiple CPU vendors if they support the claims made by trusted computing vendors about their CPUs, and the usual answer was a resounding "no".)
2. Does your product offer protection against attackers with physical access? If not, why do you claim protection against a cloud provider or data center operator?
3. Does your product offer protection against microarchitectural attacks? If yes, how do you implement that if the operator of the hardware has full logical control over the hardware?
4. Are you willing to back up your claims by accepting unlimited liability for all the damages incurred by a customer due to a breach of your claimed protection?
5. (Optional, mostly useful in a government context) If your product offers complete protection even against attackers with physical access, would you advocate outsourcing government computing tasks to a data center running your product, operated by an enemy intelligence service on their soil?
Defense in depth
2. You are using confidential computing to make accidental data leaks less likely.
Commoditize you complement
All you can ever hope to accomplish security-wise is to raise the bar for an attacker. What you want is to make that avenue of attack too costly, so they give up and go bother somebody else.
Misleading promises
The guest decides?
The guest decides?
There is a Work Around to accelerate the TEE IO performance
Smokes and mirrors
Smokes and mirrors