Divorcing namespaces from processes
To that end, Eric Biederman has proposed the creation of a pair of new system calls. The first is the rather tersely named nsfd():
int nsfd(pid_t pid, unsigned long nstype);
This system call will find the namespace of the given nstype which is in effect for the process identified by pid; the return value will be a file descriptor which identifies - and holds a reference to - that namespace. The calling process must be able to use ptrace() on pid for the call to succeed; in the current patch, only network namespaces are supported.
Simply holding the file descriptor open will cause the target namespace to continue to exist, even if all processes within it exit. The namespace can be made more visible by creating a bind mount on top of it with a command like:
mount --bind /proc/self/fd/N /somewhere
The other piece of the puzzle is setns():
int setns(unsigned long nstype, int fd);
This system call will make the namespace indicated by fd into the current namespace for the calling process. This solves the problem of being able to enter another container's namespace without the somewhat strange semantics of the once-proposed hijack() system call.
These new system calls are in an early, proof-of-concept stage, so they are
likely to evolve considerably between now and the targeted 2.6.35 merge.
| Index entries for this article | |
|---|---|
| Kernel | Containers |
| Kernel | Namespaces |
| Kernel | Virtualization/Containers |
