|From:||ebiederm-AT-xmission.com (Eric W. Biederman)|
|To:||Ted Ts'o <tytso-AT-mit.edu>|
|Subject:||Re: Detecting if you are running in a container|
|Date:||Mon, 10 Oct 2011 23:42:36 -0700|
|Cc:||Matt Helsley <matthltc-AT-us.ibm.com>, Lennart Poettering <mzxreary-AT-0pointer.de>, Kay Sievers <kay.sievers-AT-vrfy.org>, linux-kernel-AT-vger.kernel.org, harald-AT-redhat.com, david-AT-fubar.dk, greg-AT-kroah.com, Linux Containers <containers-AT-lists.osdl.org>, Linux Containers <lxc-devel-AT-lists.sourceforge.net>, "Serge E. Hallyn" <serge-AT-hallyn.com>, Daniel Lezcano <daniel.lezcano-AT-free.fr>, Paul Menage <paul-AT-paulmenage.org>|
Ted Ts'o <firstname.lastname@example.org> writes: > On Mon, Oct 10, 2011 at 07:05:30PM -0700, Matt Helsley wrote: >> Yes, it does detract from the unique advantages of using a container. >> However, I think the value here is not the effeciency of the initial >> system configuration but the fact that it gives users a better place to >> start. >> >> Right now we're effectively asking users to start with non-working >> and/or unfamiliar systems and repair them until they work. > > If things are not working with containers, I would submit to you that > we're doing something wrong(tm). That is what this discussion is about. What we are doing wrong(tm). Mostly it is about the bits that have not yet been namespacified but need to be. I am totally in favor of not starting the entire world. But just like I find it convienient to loopback mount an iso image to see what is on a disk image. It would be handy to be able to just download a distro image and play with it, without doing anything special. We can pair things down farther for the people who are running 1000 copies of apache but not requiring detailed distro surgery before starting up the binaries on a livecd sounds handy. > Things should just work, except that > processes in one container can't use more than their fair share (as > dictated by policy) of memory, CPU, networking, and I/O bandwidth. You have to be careful with the limiters. The fundamental reason why containers are more efficient than hardware virtualization is that with containers we can do over commit of resources, especially memory. I keep seeing implementations of resource limiters that want to do things in a heavy handed way that break resource over commit. > Something which is baked in my world view of containers (which I > suspect is not shared by other people who are interested in using > containers) is that given that kernel is shared, trying to use > containers to provide better security isolation between mutually > suspicious users is hopeless. That is, it's pretty much impossible to > prevent a user from finding one or more zero day local privilege > escalation bugs that will allow a user to break root. And at that > point, they will be able to penetrate the kernel, and from there, > break security of other processes. You don't even have to get to security problems to have that concern. There are enough crazy timing and side channel attacks. I don't know what concern you have security wise, but the problem that wants to be solved with user namespaces is something you hit much earlier than when you worry about sharing a kernel between mutually distrusting users. Right now root inside a container is root rout outside of a container just like in a chroot jail. Where this becomes a problem is that people change things like like /proc/sys/kernel/print-fatal-signals expecting it to be a setting local to their sand box when in fact the global setting and things start behaving weirdly for other users. Running sysctl -a during bootup has that problem in spades. With user namespaces what we get is that the global root user is not the container root user and we have been working our way through the permission checks in the kernel to ensure we get them right in the context of the user namespace. This trivially means that the things that we allow the global root user to do in /proc/ and /sysfs and the like simply won't be allowed as a container root user. Which makes doing something stupid and affecting other people much more difficult. What the user namespace also allows is an escape hatch from the bonds of suid. Right now anything that could confuse an existing app with that is suid root we have to only allow to root, or risk adding a security hole. With the user namespaces we can relax that check and allow it also for container root users as well as global root users. When we are brave enough and certain enough of our code we can allow non-root users to create their own user namespaces. There is the third use for containers where for some reason we have uid assignment overlap. Perhaps one distroy assigns uid 22 to sshd and another to the nobody user. Or perhaps there are two departments who have that have done the silly thing of assigning overlapping uids to their users and we want to accesses filesystems created by both departments at the same time without a chance of confusion and conflict. With my sysadmin hat on I would not want to touch two untrusting groups of users on the same machine. Because of the probability there is at least one security hole that can be found and exploited to allow privilege escalation. With my kernel developer hat on I can't just say surrender to the idea that there will in fact be a privilege escalation bug that is easy to exploit. The code has to be built and designed so that privilege escalation is difficult. Otherwise we might as well assume if you visit a website an stealthy worm has taken over your computer. It is my hope at the end of the day that the user namespaces will be one more line of defense in messing up and slowing down the evil omnicient worms that seem to uneering go for every privilege exploit there is. Eric
Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds