|
|
Log in / Subscribe / Register

Direct host system calls from KVM

By Jonathan Corbet
July 29, 2022
As a general rule, virtualization mechanisms are designed to provide strong isolation between a host and the guest systems that it runs. The guests are not trusted, and their ability to access or influence anything outside of their virtual machines must be tightly controlled. So a patch series allowing guests to execute arbitrary system calls in the host context might be expected to be the cause of significantly elevated eyebrows across the net. Andrei Vagin has posted such a series with the expected results.

The use case for Vagin's work is gVisor, a container-management platform with a focus on security. Like a full virtualization system, gVisor runs containers within a virtual machine (using KVM), but the purpose is not to fully isolate those containers from the system. Instead, KVM is used to provide address-space isolation for processes within containers, but the resulting virtual machines do not run a normal operating-system kernel. Instead, they run a special gVisor kernel that handles system calls made by the contained processes, making security decisions as it goes.

That kernel works in an interesting way; it maps itself into each virtual machine's address space to match its layout on the host, then switches between the two as needed. The function to go to the virtual-machine side is called, perhaps inevitably, bluepill(). The execution environment is essentially the same on either side, with the same memory layout, but the guest side is constrained by the boundaries placed on the virtual machine.

Many of the application's system calls can be executed by gVisor within the virtual machine, but some of them must be handled in the less-constrained context of the host. It certainly works for gVisor to simply perform a virtual-machine exit to have the controlling process on the host side execute the call, then return the result back into the virtual machine, but exits are slow. Performing a lot of exits can badly hurt the performance of the workload overall; since part of the purpose of a system like gVisor is to provide better performance than pure virtualization, that is seen as undesirable.

The proposed solution is to provide a new hypercall (KVM_HC_HOST_SYSCALL) that the guest kernel can use to run a system call directly on the host. It takes two parameters: the system-call number and a pt_regs structure containing the parameters for that system call. After executing the call in the host context (without actually exiting from the virtual machine), this hypercall will return the result back to the caller. This interface only works if the guest knows enough about the host's memory layout to provide sensible system-call parameters; in the gVisor case, where the memory layout is the same on both sides, no special attention is required.

Internally, this functionality works by way of a new helper called do_ksyscall_64(), which can invoke any system call from within the kernel. Given that invoking system calls in this way is generally frowned upon, this functionality seems sure to be a lightning rod for criticism and, indeed, Thomas Gleixner duly complained: "this exposes a magic kernel syscall interface to random driver writers. Seriously no". While he acknowledged that the series overall is "a clever idea", he made it clear that exposing system calls in this way was not going to fly.

Meanwhile, the ability to invoke host-side system calls directly from a KVM guest pokes a major hole in the isolation between virtual machines and the host. Indeed, the cover letter describes it as "a backdoor for regular virtual machines". Thus, as one would expect, the direct system-call feature is disabled by default; processes that want to use it must enable it explicitly when creating a virtual machine. Most hypervisors, it is to be expected, will not do that.

The kernels running deep within companies like Google often contain significant changes that are not found in the upstream code; this patch set gives a hint of what one of those changes looks like:

In the Google kernel, we have a kvm-like subsystem designed especially for gVisor. This change is the first step of integrating it into the KVM code base and making it available to all Linux users.

That led Sean Christopherson to ask about what the following steps would be. "It's practically impossible to review this series without first understanding the bigger picture". Merging this first step could be a mistake if the following steps turn out not to be acceptable; at that point, the kernel community could find itself supporting a partial feature that is not actually being used. As it turns out, Vagin said, this is the only feature that is needed. gVisor works on top of KVM now, he said; the current patch series just improves its performance.

Christopherson also asked about alternatives, noting that "making arbitrary syscalls from within KVM is mildly terrifying". Vagin provided a few, starting with the current scheme where a virtual-machine exit is used to (slowly) handle each system call. Another approach is to run all of gVisor on the host side, exiting from the virtual machine for every system call. Executing a system call in this mode takes about 2.1µs; the direct system-call mechanism reduces that to about 1.0µs. Or gVisor could use BPF to handle the system calls; that provides similar performance, Vagin said, but would require some questionable changes, like providing BPF programs with the ability to invoke arbitrary system calls. Yet another possibility is to use the once-proposed process_vm_exec() system call, but that can perform poorly in some situations.

KVM maintainer Paolo Bonzini said that his largest objection is the lack of address translation between the guest and the host. In its current form, this mechanism depends on the memory layout being the same on both sides; otherwise any addresses in an argument to a system call would not make sense on the host side. As a result, the new mechanism is highly specialized for gVisor and seems unlikely to be more widely useful. It is not clear that everybody sees that specialization as a disadvantage, though.

All told, gVisor in this mode represents an interesting shift in the security boundary between a host and the containers it runs. Much of the security depends on code that is within the virtual machine, with the host side trusting that code at a fairly deep level. It is a different view of how virtualization with KVM is meant to work, but it seems that the result works well — within Google at least. Whether this mechanism will make it into the mainline remains an open question, though. Making holes in the wall between host and guest is not something to be done lightly, so the developers involved are likely to want to be sure that no better alternatives exist.

Index entries for this article
KernelVirtualization/KVM


to post comments

Direct host system calls from KVM

Posted Jul 29, 2022 18:35 UTC (Fri) by flussence (guest, #85566) [Link] (2 responses)

Interesting; for a while I've wondered if there are possible gains in using virtualization hardware to harden regular process/namespace/container isolation, maybe also useful for things like PTI.

One could imagine a system where each systemd slice (substitute your preferred cgroup managing tool here) is totally isolated not just from the kernel address space but other ring-3 processes. The mainframe crowd's probably had this sort of thing for decades, mind...

Direct host system calls from KVM

Posted Jul 29, 2022 21:40 UTC (Fri) by WolfWings (subscriber, #56790) [Link] (1 responses)

Prior to the heavy containerization and namespaces being available on everything and dog there was a lot of projects using essentially low-level KVM just to package an individual process in a similar way. I think Firecracker would be one of the most well known surviving example of that approach these days.

Direct host system calls from KVM

Posted Jul 30, 2022 6:50 UTC (Sat) by ma4ris5 (guest, #151140) [Link]

For general use, I like from the multi layer security approach, something like:
- Use liburing mechanism (shared ring buffer between host and guest) to avoid context switches. Preferred: avoid pinned memory pitfalls.
- KVM host kernel has security checks enabled, and small attack surface (against KVM guest Kernel) via liburing.
- KVM guest kernel has the same via liburing.
- Keep power consumption low (no unnecessary Kernel side CPU thread busy looping by default).

Direct host system calls from KVM

Posted Aug 2, 2022 20:03 UTC (Tue) by developer122 (guest, #152928) [Link]

This reminds me a LOT of QEMU's user-mode emulation, where it runs a single binary (not an OS) and translates syscalls between CPU architectures. It's a real shame that good documentation on this mode seems few and far between.


Copyright © 2022, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds