A new approach to user namespaces
There is another cost to containers that is not necessarily visible to users: their implementation in the kernel is, in many ways, far more complex. In a system supporting containers, any globally-visible resource must be wrapped in a namespace layer that provides each container with its own view. There are many such resources on a Linux system: process IDs, filesystems, and network interfaces, for example. Even the system name and time can differ from one container to the next. As a result, despite years of effort, Linux still lacks proper container support, while virtualization has been a solid feature for a long time.
Work continues on the Linux containers implementation; the latest piece is a new set of user namespace patches from Eric Biederman. The "user namespace" can be thought of as the encapsulation of user/group IDs and associated privilege; it allows the owner of a container to run as the root user within that container while isolating the rest of the system from the in-container users. The implementation of a proper user namespace has always been a hard problem for a number of reasons. Kernel code has been written with the understanding that a given process has one, globally-recognized user ID; what happens when processes can have multiple user IDs in different contexts? It is not hard to imagine developers making mistakes with user IDs - mistakes leading to serious security holes in the host system. Fears of this kind of mistake, along with the general difficulty of the problem, have kept a full user namespace out of the kernel so far.
Eric's patch takes a bit of a different approach to the problem in the hope of making user namespaces invisible to most kernel developers while catching errors when they are made. To that end, the patch set creates a new type to represent "kernel UIDs" - kuid_t; there is also a kgid_t type. The kernel UID is meant to describe a process's identity on the base system, regardless of any user IDs it may adopt within a container; it is the value used for most privilege checks. A process's kernel IDs may (or may not) be equal to the IDs it sees in user space; there is no way for that process to know. The kernel IDs exist solely to identify a process's credentials within the kernel itself.
In the patch, kuid_t is a typedef for a single-field structure; its main reason for existence is to be type-incompatible with the simple integer user and group IDs used throughout the kernel. Container-specific IDs, instead, retain the old integer (uid_t and gid_t) types. As a result, any attempt to mix kernel IDs with per-container IDs should yield an error from the compiler. That should eliminate a whole class of potential errors from kernel code that deals with user and group ID values.
The kuid_t type, being an opaque cookie to in-kernel users, needs a set of helper functions. Comparisons can be done with functions like uid_eq(), for example; similar functions exist for most arithmetic tests. For many purposes, that's all that is needed. Regardless of its namespace, a process's credentials are stored in the global kuid_t space, so most ID tests just do the right thing.
There are times, though, where it is necessary to convert between a kernel ID and a user or group ID as seen in a specific namespace. Perhaps the simplest example is a system call like getuid(); it should return the namespace-specific ID, not the kernel ID. There is a set of functions provided to perform these conversions:
kuid_t make_kuid(struct user_namespace *from, uid_t uid);
uid_t from_kuid(struct user_namespace *to, kuid_t kuid);
uid_t from_kuid_munged(struct user_namespace *to, kuid_t kuid);
bool kuid_has_mapping(struct user_namespace *ns, kuid_t uid);
(A similar set of functions exists for group IDs, of course.) Conversion between a kernel ID and the namespace-specific equivalent requires the use of a mapping stored within the namespace, so the namespace of interest must be supplied when calling these functions. For code that is executing in process context, a call to current_user_ns() is sufficient to get that namespace pointer. make_kuid() will turn a namespace-specific UID into a kernel ID, while from_kuid() maps a kernel ID into a specific namespace. If no mapping exists for the given kernel ID, from_kuid() will return -1; for cases where a valid ID must be returned, a call to from_kuid_munged() will return a special "unknown user" value instead. To simply test whether it is possible to map a specific kernel ID into a namespace, kuid_has_mapping() is available.
Actually setting up the mapping is a task that must be performed by the administrator, who will likely set aside a range of IDs for use within a specific container. The patch series adds a couple of files under /proc/pid/ for this purpose; setting up the mapping is just a matter of writing one or more tuples of three numbers to /proc/pid/uid_map:
first-ns-id first-target-id count
The first-ns-id is the first valid user ID in the given process's namespace. It is likely to be zero - the root ID is valid and harmless within user namespaces. That first ID will be mapped to first-target-id as it is defined in the parent namespace, and count is the number of consecutive IDs that exist in the mapping. If multiple levels of namespaces are involved, there may be multiple layers of mapping, but those layers are flattened by the mapping code, which only remembers the mapping directly to and from kernel IDs.
Establishing mappings clearly needs to be a privileged operation, so the process performing this operation must have the CAP_SETUID capability available. A process running as root within its own namespace may well have that capability, even though it has no access outside of its namespace. So processes in a user namespace can set up their own sub-namespaces with arbitrary mappings - but those mappings can only access user and group IDs that exist in the parent namespace. As an additional restriction, the namespace ID mapping can only be set once; after it has been established for a given namespace, it is immutable thereafter.
These mapping functions find their uses in a number of places in the core kernel. Any system call that deals in user or group ID numbers must include the conversions to and from the kernel ID space. The ext* family of filesystems allows the specification of a UID that can use reserved blocks, so the filesystem code must make sure it's working with kernel IDs when it does its comparisons. So the patch is, like much of the namespace work, mildly intrusive, but Eric has clearly worked to minimize its impact. In particular, he has taken care to ensure that the runtime overhead of the ID-mapping code is zero if user namespaces are not configured into the kernel, and as close as possible to zero when the feature is used.
The user namespace feature, he says, now has a number of nice features to offer:
It allows for applications with no capabilities to use multiple uids and to implement privilege separation. I certainly see user namespaces like this as having the potential to make linux systems more secure.
As of this writing, there have been few comments from reviewers, so it is
not yet clear whether other developers agree with this assessment or not.
The nature of the namespace work means that it can be difficult to get it
accepted into the mainline, where developers tend to be concerned about the
overhead and relatively uninterested in the benefits. But, with years of
persistence, much of this work has gotten in anyway; chances are that user
namespaces, in some form, will eventually find their way in as well.
| Index entries for this article | |
|---|---|
| Kernel | Containers |
| Kernel | Namespaces/User namespaces |
