|| ||ebiederm-AT-xmission.com (Eric W. Biederman) |
|| ||Lennart Poettering <mzxreary-AT-0pointer.de> |
|| ||Detecting if you are running in a container |
|| ||Mon, 10 Oct 2011 13:59:10 -0700|
|| ||Matt Helsley <matthltc-AT-us.ibm.com>,
Kay Sievers <kay.sievers-AT-vrfy.org>,
linux-kernel-AT-vger.kernel.org, harald-AT-redhat.com, david-AT-fubar.dk,
greg-AT-kroah.com, Linux Containers <containers-AT-lists.osdl.org>,
Linux Containers <lxc-devel-AT-lists.sourceforge.net>,
"Serge E. Hallyn" <serge-AT-hallyn.com>,
Daniel Lezcano <daniel.lezcano-AT-free.fr>,
Paul Menage <paul-AT-paulmenage.org>|
|| ||Article, Thread
Cc's and subject updated so hopefully we get the correct people
on this discussion to make progress.
Lennart Poettering <firstname.lastname@example.org> writes:
> To make a standard distribution run nicely in a Linux container you
> usually have to make quite a number of modifications to it and disable
> certain things from the boot process. Ideally however, one could simply
> boot the same image on a real machine and in a container and would just
> do the right thing, fully stateless. And for that you need to be able to
> detect containers, and currently you can't.
I agree getting to the point where we can run a standard distribution
unmodified in a container sounds like a reasonable goal.
> Quite a few kernel subsystems are
> currently not virtualized, for example SELinux, VTs, most of sysfs, most
> of /proc/sys, audit, udev or file systems (by which I mean that for a
> container you probably don't want to fsck the root fs, and so on), and
> containers tend to be much more lightweight than real systems.
That is an interesting viewpoint on what is not complete. But as a
listing of the tasks that distribution startup needs to do differently in
a container the list seems more or less reasonable.
There are two questions
- How in the general case do we detect if we are running in a container.
- How do we make reasonable tests during bootup to see if it makes sense
to perform certain actions.
For the general detection if we are running in a linux container I can
see two reasonable possibilities.
- Put a file in / that let's you know by convention that you are in a
linux container. I am inclined to do this because this is something
we can support on all kernels old and new.
- Allow modification to the output of uname(2). The uts namespace
already covers uname(2) and uname is the standard method to
communicate to userspace the vageries about the OS level environment
they are running in.
My list of things that still have work left to do looks like:
- cgroups. It is not safe to create a new hierarchies with groups
that are in existing hierarchies. So cgroups don't work.
- user namespace. We are very close to have something workable
on this one, but until we do all of the users inside and outside
of a container are the same, and pass the same permission checks.
As a result we have to drop most of roots privileges, and we have
to be a bit careful what binaries that can gain privileges (think suid
root) are in the container filesystem.
- Reboot. I know Daniel was working on something not long ago
but I am not certain where he would up.
- device namespaces. We periodically think about having a separate
set of devices and to support things like losetup in a container
that seems necessary. Most of the time getting all of the way
to device namespaces seems unnecessary.
As for tests on what to startup.
- udev. All of the kernel interfaces for udev should be supported in
current kernels. However I believe udev is useless because container
start drops CAP_MKNOD so we can't do evil things. So I would
recommend basing the startup of udev on presence of CAP_MKNOD.
- VTs. Ptys should be well supported at this point. For the rest
they are physical hardware that a container should not be playing with
so I would base which gettys to start up based on which device nodes
are present in /dev.
- sysctls (aka /proc/sys) that is a trick one. Until the user namespace
is fleshed out a little more sysctls are going to be a problem,
because root can write to most of them. My gut feel says you probably
want to base that to poke at sysctls on CAP_SYS_ADMIN. At least that
test will become true when the userspaces are rolled out, and at
that point you will want to set all of the sysctls you have permission
- audit. My memory is very fuzzy on this one. The issue in question is
should we start auditd? I believe the audit calls actually fail in a
container so we should be able to trigger starting auditd on if audit
works at all. If we can't do it that way certainly the work should be
put in so that it can be done that way.
- fsck. A rw filesystem check like you mentioned earlier seems like a
reasonable place to be I know the OpenVz folks were talking about
putting containers in their own block devices for their next round of
supporting containers. At which point a filesystem check on container
startup might not be a bad idea at all.
- cgroups hierarchies. I don't know at which point in the system
startup we care. The appropriate solution would seem to be to try
it and if the operation fails figure it isn't supported.
- selinux. It really should be in the same category. You should be
able to attempt to load a policy and have it fail in a way that
indicates that selinux is currently supported. I don't know if
we can make that work right until we get the user namespace into
a usable shame.
In general things in a container should work or the kernel feature
should fail in a way that indicates that the feature is not supported.
That currently works well for the networking stack, and with the
pending usablilty of the user namespace it should work just about
everywhere else as well. For things that don't fit that model we
need to fix the kernel.
So while I agree a check to see if something is a container seems
reasonable. I do not agree that the pid namespace is the place to put
that information. I see no natural to put that information in the
I further think there are a lot of reasonable checks for if a
kernel feature is supported in the current environment I would
rather pursue over hacks based the fact we are in a container.
to post comments)