LWN: Comments on "Updates in container isolation" https://lwn.net/Articles/754433/ This is a special feed containing comments posted to the individual LWN article titled "Updates in container isolation". en-us Thu, 04 Sep 2025 20:01:27 +0000 Thu, 04 Sep 2025 20:01:27 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net Updates in container isolation https://lwn.net/Articles/756216/ https://lwn.net/Articles/756216/ ZhuYanhai <div class="FormattedComment"> Sentry runs in guest ring0 for kvm platform. So the guest code doesn't trigger a vmexit for its syscall, unless the sentry itself needs some additional syscalls.<br> <p> And sentry runs in ring3 for ptrace platform, which is designed for development and debug purpose only.<br> <p> </div> Fri, 01 Jun 2018 08:47:51 +0000 Updates in container isolation https://lwn.net/Articles/755015/ https://lwn.net/Articles/755015/ bergwolf <div class="FormattedComment"> <font class="QuotedText">&gt; He cited the CVE-2017-1002101 vulnerability, which allows hostile container images to take over a host through specially crafted symbolic links. [Zhengyu He] Like native containers, hypervisors like Kata Containers also allow the guest to mount filesystems across the container boundary, so they are vulnerable to such an attack.</font><br> <p> CVE-2017-1002101 is playing with symlinks and Kata Containers are vulnerable because of 9p, which is also true for gVisor. A simply show case:<br> [macbeth@/]$ls -l /tmp/foo/bar<br> lrwxrwxrwx 1 bergwolf bergwolf 9 May 21 11:15 /tmp/foo/bar -&gt; ../../../<br> [macbeth@/]$docker run -it --runtime gvisor -v /tmp/foo/bar/etc:/vol busybox<br> / # ls -l /vol |head<br> total 258<br> drwxr-xr-x 4 root root 35 Nov 6 2017 X11<br> drwxr-xr-x 3 root root 20 Nov 6 2017 acpi<br> -rw-r--r-- 1 root root 3028 Oct 17 2017 adduser.conf<br> drwxr-xr-x 2 root root 4096 May 2 11:00 alternatives<br> drwxr-xr-x 3 root root 21 Nov 6 2017 apm<br> drwxr-xr-x 3 root root 59 Feb 22 09:32 apparmor<br> drwxr-xr-x 9 root root 272 Feb 22 09:32 apparmor.d<br> drwxr-xr-x 3 root root 45 Feb 22 09:32 apport<br> drwxr-xr-x 6 root root 197 Mar 31 18:01 apt<br> <p> To fix CVE-2017-1002101, both Kata Containers and gVisor should be more careful about what to map in the 9pfs share to the guest.<br> <p> </div> Mon, 21 May 2018 03:26:40 +0000 Updates in container isolation https://lwn.net/Articles/755007/ https://lwn.net/Articles/755007/ bergwolf <div class="FormattedComment"> <font class="QuotedText">&gt; If it's ring 3 somehow, then that would be an additional security layer for gVisor, but worse for performance.</font><br> <p> It's ring 3 and each syscall has to vmexit. Bad news for syscall intensive applications.<br> </div> Mon, 21 May 2018 01:39:24 +0000 Updates in container isolation https://lwn.net/Articles/755006/ https://lwn.net/Articles/755006/ bergwolf <div class="FormattedComment"> Things like procfs, sysfs and various ioctls are the tricky part of Linux kernel APIs and hard to be compatible with.<br> </div> Mon, 21 May 2018 01:35:43 +0000 Updates in container isolation https://lwn.net/Articles/754959/ https://lwn.net/Articles/754959/ prattmic <div class="FormattedComment"> The sandboxed application processes run in guest ring 3. The gVisor kernel runs in guest ring 0 (and host ring 3).<br> </div> Sun, 20 May 2018 06:31:53 +0000 Updates in container isolation https://lwn.net/Articles/754815/ https://lwn.net/Articles/754815/ roc <div class="FormattedComment"> Because of this, I think the article obscures the security advantages of gVisor compared to something like Kata Containers. I don't understand the comparison between the number of lines of code in the Linux kernel and gVisor. Are they talking about the guest kernel size? But the guest kernel size in a VM container isn't an issue for security because it's inside the sandbox. Are they talking about the host kernel size making it hard to audit the security of KVM itself? But that would also affect gVisor-KVM, which is going to be the preferred deployment configuration ... assuming your host is running on bare metal or nested virtualization is available.<br> <p> I think the security argument for gVisor-KVM is that if you have a KVM escape that escapes into the hypervisor's user-level, not the host kernel itself, then you're still in a very restrictive sandbox around the gVisor kernel. Whereas with Kata you'd be in QEMU which probably needs a much less restricted sandbox.<br> <p> Although one interesting question is, do gVisor-KVM guest processes run at ring 0 or ring 3? If it's ring 3 somehow, then that would be an additional security layer for gVisor, but worse for performance.<br> <p> I can see advantages for gVisor in terms of memory and storage usage, because the guest can share the host file system rather than mounting its own on a virtual block device.<br> </div> Thu, 17 May 2018 23:34:39 +0000 Updates in container isolation https://lwn.net/Articles/754783/ https://lwn.net/Articles/754783/ prattmic <div class="FormattedComment"> Regarding /proc, gVisor does have a procfs implementation. Like system calls, procfs files are handled internally and we don't cover the entire surface, but do have the core functionality required by most applications (/proc/PID/maps, /proc/cpuinfo, parts of /proc/PID/status, etc).<br> <p> This doc (<a href="https://github.com/google/gvisor/blob/master/pkg/sentry/fs/proc/README.md">https://github.com/google/gvisor/blob/master/pkg/sentry/f...</a>) tries to explain which parts of procfs are implemented, though it is not 100% complete.<br> </div> Thu, 17 May 2018 16:38:39 +0000 Updates in container isolation https://lwn.net/Articles/754730/ https://lwn.net/Articles/754730/ anarcat <div class="FormattedComment"> Right. I probably got this one backwards, apologies.<br> <p> Still, the way Xen is designed just feels a little backwards to me as the first layer is not actually the hypervisor itself, but a (compatible) kernel that talks with the hypervisor. And yes, that *does* provide an *extra* layer of security at the cost of performance. But Xen's design also means you need a privileged supervisor domain (the dom0 in the case of Xen) is also part of the attack domain now, and I seem to recall that being used as an attack vector in the past, but I could be mistaken there. I think this is where my analogy came from, but I must admit I cannot substantiate this any further and I am forced to recognize that the attack surfaces are comparable with other hypervisor like gVisor, at least conceptually.<br> </div> Thu, 17 May 2018 14:00:17 +0000 Updates in container isolation https://lwn.net/Articles/754712/ https://lwn.net/Articles/754712/ thinxer <div class="FormattedComment"> You don't actually worry about the kernel running inside the sandbox. You worry about the sandbox only, which usually has simpler interfaces than the kernel and thus a reduced attack surface.<br> </div> Thu, 17 May 2018 02:23:14 +0000 Updates in container isolation https://lwn.net/Articles/754710/ https://lwn.net/Articles/754710/ anarcat <div class="FormattedComment"> The point is that instead of having just the kernel to worry about (which is still running inside the container), you now also have the hypervisor (in this case Xen) to worry about as well.<br> <p> <p> </div> Thu, 17 May 2018 01:36:35 +0000 Updates in container isolation https://lwn.net/Articles/754707/ https://lwn.net/Articles/754707/ roc <div class="FormattedComment"> <font class="QuotedText">&gt; Kata Containers relies on a kernel running inside the container, which actually expands the attack surface instead of reducing it.</font><br> <p> This doesn't sound right. The kernel running inside the container is inside the sandbox (the virtual machine interface implemented by the hypervisor), therefore it cannot add to the attack surface.<br> </div> Thu, 17 May 2018 01:22:33 +0000