execns()
[Posted July 11, 2006 by corbet]
The developers behind a whole range of virtualization and containerization
projects are continuing to work on ways to get the isolation features they
need into the mainline kernel. Much of that work is centered around the
elimination of global namespaces and additions to the
unshare()
system call so that interested processes can retreat into their own,
private namespaces. For example, on mainline Linux systems today, the
process ID namespace is global - a given process ID identifies the same
process for every other process on the system. The container developers
would like to move away from a global PID namespace so that containers can
present their own process IDs to the processes trapped inside. Many other
kernel namespaces are receiving the same sort of treatment.
Cedric Le Goater has posted a
patch set which takes this work forward in an interesting way by
de-globalizing another namespace and adding a different interface for
creating new namespaces.
The new namespace type added by the patch is the "user" namespace - the
system's view of user ID values. For the most part, the kernel just uses
user IDs for the enforcement of permissions; it does not really care if one
set of processes interprets user ID values differently than another. So,
if processes within one container cannot see resources
(processes, SYSV IPC, filesystems) belonging to another container, there is
little opportunity for processes to interfere with each other, even if they
are running with the same numeric user ID value. That user ID can map to
two entirely different accounts in the different containers, and the
isolation provided by those containers will keep them separate.
The one little exception is the user_struct structure maintained
in kernel/user.c. This structure exists to allow the kernel to
enforce per-user resource limits; to that end, one is allocated for each
user ID currently active on the system. The function responsible for
looking up one of these structures (find_user()) implements a
global user ID namespace, so processes sharing a user ID number in
different containers will affect each others' resource limits.
Cedric's patch fixes this problem by creating a new namespace type for user
IDs, allowing resource limits to be isolated within containers. The
implementation of this namespace is simple, but allowing processes to move
into a new user namespace with unshare(), as it turns out, is
not. When a process gets around to calling unshare(), it may have
a long list of resources which are reflected in the user_struct
structure. Disconnecting from the old structure will require the system to
somehow disassociate the process's current resource usage from that
structure and add them to the new one instead. This process is detailed
and error-prone; even if it works once, keeping it maintained and
functional into the future could be a challenge.
The same challenge applies to SYSV IPC namespaces. A process which holds
references to a SYSV semaphore, for example, must have those references
taken away, any undo information handled properly, and so on.
Rather than try to fix up unshare() to handle all of these issues,
Cedric has taken a different approach: only allow a process to disconnect
from namespaces when all of its references to those namespaces are being
shut down anyway. That time is when the process calls a form of
exec() to run a new program. So Cedric has created a new form of
the execve() call:
int execns(int unshare_flags, char *filename, char **argv, char **envp);
This call will function like execve, in that it will cause the
process to run the program found in filename with the given
arguments and environment. The new unshare_flags argument,
however, allows the caller to specify a set of namespaces to be unshared at
the same time. As a result, the new program starts fresh with its new
namespaces and no dangling references into the older ones. To help ensure
that things happen this way, execns() closes all open
files, regardless of whether they are marked "close on exec."
Moving namespace creation into exec() would seem to make some
sense. The creation of namespaces is a rare act, done as part of the
establishment of a new container; it's not something that running processes
just occasionally decide to do. The execns() will allow a
container's init-like process to start with a clean slate while,
with luck, simplifying the unsharing logic within the kernel.
(
Log in to post comments)