LWN.net Logo

Divorcing namespaces from processes

By Jonathan Corbet
March 3, 2010
For the last few years, the development community interested in implementing containers has been working to add a variety of namespaces to the kernel. Each namespace wraps around a specific global kernel resource (such as the network environment, the list of running processes, or the filesystem tree), allowing different containers to have different views of that resource. Namespaces are tightly tied to process trees; they are created with new processes through the use of special flags to the clone() system call. Once created, a namespace is only visible to the newly-created process and any children thereof, and it only lives as long as those processes do. That works for many situations, but there are others where it would be nice to have longer-lived namespaces which are more readily accessible.

To that end, Eric Biederman has proposed the creation of a pair of new system calls. The first is the rather tersely named nsfd():

    int nsfd(pid_t pid, unsigned long nstype);

This system call will find the namespace of the given nstype which is in effect for the process identified by pid; the return value will be a file descriptor which identifies - and holds a reference to - that namespace. The calling process must be able to use ptrace() on pid for the call to succeed; in the current patch, only network namespaces are supported.

Simply holding the file descriptor open will cause the target namespace to continue to exist, even if all processes within it exit. The namespace can be made more visible by creating a bind mount on top of it with a command like:

    mount --bind /proc/self/fd/N /somewhere

The other piece of the puzzle is setns():

    int setns(unsigned long nstype, int fd);

This system call will make the namespace indicated by fd into the current namespace for the calling process. This solves the problem of being able to enter another container's namespace without the somewhat strange semantics of the once-proposed hijack() system call.

These new system calls are in an early, proof-of-concept stage, so they are likely to evolve considerably between now and the targeted 2.6.35 merge.


(Log in to post comments)

Divorcing namespaces from processes

Posted Mar 4, 2010 8:58 UTC (Thu) by ebiederm (subscriber, #35028) [Link]

I would like to point out the special (and I envision) common case of calling nsfd on pid 0. In that case it returns a file descriptor referring to a namespace of the current process and no privileges are required.

Divorcing namespaces from processes

Posted Mar 4, 2010 14:24 UTC (Thu) by nix (subscriber, #2304) [Link]

Nice! Everything is represented by an fd again :)

(one teeny problem: I see a lot of possibilities for typos in the name 'nsfd', simply because it's one transposition from 'nfsd'... ns_fd() would fix this.)

Divorcing namespaces from processes

Posted Mar 6, 2010 18:10 UTC (Sat) by bfields (subscriber, #19510) [Link]

Yeah, I mistype "nfsd" as "nsfd" about 10 times a day already....

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds