|
|
Subscribe / Log in / New account

5.10 Merge window, part 1

5.10 Merge window, part 1

Posted Oct 17, 2020 4:53 UTC (Sat) by Cyberax (✭ supporter ✭, #52523)
In reply to: 5.10 Merge window, part 1 by roc
Parent article: 5.10 Merge window, part 1

I've heard from Amazon engineers at the last re:Invent that they reasonably believe that nested virtualization is unsafe on the current Intel CPUs. Their _current_ (with emphasis on "current") Graviton2 ARM family also doesn't support it.


to post comments

5.10 Merge window, part 1

Posted Oct 17, 2020 11:40 UTC (Sat) by mss (subscriber, #138799) [Link] (6 responses)

> that they reasonably believe that nested virtualization is unsafe on the current Intel CPUs.

Ohh, that's interesting.

Do you know whether they meant the current KVM nVMX implementation (which is tricky to get right with issues getting fixed all the time) or the VMX support itself in the CPU (as the expression "current CPUs" in your comment would suggest)?

5.10 Merge window, part 1

Posted Oct 17, 2020 19:54 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

No idea. Back when I was working at Amazon, there were several huge hardware-level scares that required mass reboots of client VMs (both for legacy Xen and newer KVM-based VMs).

5.10 Merge window, part 1

Posted Oct 18, 2020 16:05 UTC (Sun) by pbonzini (subscriber, #60935) [Link] (4 responses)

Google and Oracle have been offering nVMX for years and have contributed lots of changes. nVMX is definitely ready for production use and nSVM is getting there (though there is one processor erratum that complicates things).

What follows is my guess on what things really look like. First, Amazon plays complicated games with /dev/mem and memremap for EC2 in order to save the price of "struct page" for guest memory (that's 1.5% so nothing to sneeze at), and that makes nested virtualization slower. Second, their kernel is probably based on older versions of Linux and thus it lacks a lot of the improvements made to nested virtualization lately. And finally, Amazon sells bare metal instances at a higher price so they have no interest in covering virtualization workloads.

5.10 Merge window, part 1

Posted Oct 18, 2020 21:35 UTC (Sun) by roc (subscriber, #30627) [Link] (1 responses)

VMX seems so complicated to me that I expect there to be lots of hardware bugs in it accessible from the hypervisor. That's not a big deal if you trust the hypervisor, which you mostly do at the root, but it's deeply problematic for nested virtualization.

*Maybe* so much bug hunting has been done by people with variously-coloured hats, and so many bugs fixed, that this risk has been reduced to an acceptably low level. But if this has been done then I would expect some of those bugs to have been published, and I haven't seen that, not like we have for other attack surfaces.

5.10 Merge window, part 1

Posted Oct 18, 2020 22:02 UTC (Sun) by pbonzini (subscriber, #60935) [Link]

It is not *that* complicated actually, once you get familiar with it. Also, large parts of the state is not passed through to the processor by the root hypervisor, so the configuration for the nested guest ends up being not very different from that for the non-nested case.

5.10 Merge window, part 1

Posted Oct 19, 2020 4:44 UTC (Mon) by josh (subscriber, #17465) [Link] (1 responses)

> though there is one processor erratum that complicates things

What's the nSVM-related errratum on AMD?

5.10 Merge window, part 1

Posted Oct 19, 2020 9:25 UTC (Mon) by pbonzini (subscriber, #60935) [Link]

VMLOAD/VMRUN/VMSAVE instructions check their operand against a range of restricted addresses (such as the SMM TSeg) and generate an instruction if the operand is within that address. When nesting is on, the address should be checked after it has gone through nested page tables, but instead it is checked as is.

If the nested hypervisor is unlucky enough to place its VMCB at an address that the processor rejects, it will fail to enter the nested guest. There are various possible workarounds though (the simplest is to reduce the amount of memory below 4GB in the nested hypervisor to 1GB, because usually SMM TSeg is somewhere between 0x40000000 and 0xC0000000).

Apparently it's been there since the first SVM processors but we only noticed last year and it took a few months to find the root cause.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds