By Michael Kerrisk
January 8, 2013
A namespace wraps a global system resource in an abstraction that makes
it appear to the processes within the namespace that they have their own
isolated instance of the resource. Namespaces are used for a
variety of purposes, with the most notable being the implementation of containers, a technique for lightweight
virtualization. This is the second part in a series of articles that looks
in some detail at namespaces and the namespaces API. The
first article in this series provided an
overview of namespaces. This article looks at
the namespaces API in some detail and shows the API in
action in a number of example programs.
The namespace API consists of three system
calls—clone(), unshare(), and
setns()—and a number of /proc files. In this
article, we'll look at all of these system calls and some of the
/proc files. In order to specify a namespace type on which to
operate, the three system calls make use of the CLONE_NEW*
constants listed in the previous article:
CLONE_NEWIPC,
CLONE_NEWNS,
CLONE_NEWNET,
CLONE_NEWPID,
CLONE_NEWUSER, and
CLONE_NEWUTS.
Creating a child in a new namespace: clone()
One way of creating a namespace is via the use of clone(),
a system call that creates a new process. For our purposes,
clone() has the following prototype:
int clone(int (*child_func)(void *), void *child_stack, int flags, void *arg);
Essentially, clone() is a more general version of the
traditional UNIX fork() system call whose functionality can be
controlled via the flags argument. In all, there are more than
twenty different CLONE_* flags that control various aspects of the
operation of clone(), including whether the parent and child
process share resources such as virtual memory, open file descriptors, and
signal dispositions. If one of the CLONE_NEW* bits is specified
in the call, then a new namespace of the corresponding type is created, and
the new process is made a member of that namespace; multiple
CLONE_NEW* bits can be specified in flags.
Our example program (demo_uts_namespace.c) uses
clone() with the CLONE_NEWUTS flag to create a UTS
namespace. As we saw last week, UTS namespaces isolate two system
identifiers—the hostname and the NIS domain name—that are set
using the sethostname() and setdomainname() system calls
and returned by the uname() system call. You can find the full
source of the program here. Below, we'll
focus on just some of the key pieces of the program (and for brevity, we'll
omit the error checking code that is present in the full version of the
program).
The example program takes one command-line argument. When run, it
creates a child that executes in a new UTS namespace. Inside that
namespace, the child changes the hostname to the string given as the
program's command-line argument.
The first significant piece of the main program is the clone()
call that creates the child process:
child_pid = clone(childFunc,
child_stack + STACK_SIZE, /* Points to start of
downwardly growing stack */
CLONE_NEWUTS | SIGCHLD, argv[1]);
printf("PID of child created by clone() is %ld\n", (long) child_pid);
The new child will begin execution in the user-defined function
childFunc(); that function will receive the final clone()
argument (argv[1]) as its argument. Since CLONE_NEWUTS
is specified as part of the flags argument, the child will execute
in a newly created UTS namespace.
The main program then sleeps for a moment. This is a (crude) way of
giving the child time to change the hostname in its UTS namespace. The
program then uses uname() to retrieve the host name in the
parent's UTS namespace, and displays that hostname:
sleep(1); /* Give child time to change its hostname */
uname(&uts);
printf("uts.nodename in parent: %s\n", uts.nodename);
Meanwhile, the childFunc() function executed by the child
created by clone() first changes the hostname to the value
supplied in its argument, and then retrieves and displays the modified
hostname:
sethostname(arg, strlen(arg);
uname(&uts);
printf("uts.nodename in child: %s\n", uts.nodename);
Before terminating, the child sleeps for a while. This has the effect
of keeping the child's UTS namespace open, and gives us a chance to conduct
some of the experiments that we show later.
Running the program demonstrates that the parent and child processes
have independent UTS namespaces:
$ su # Need privilege to create a UTS namespace
Password:
# uname -n
antero
# ./demo_uts_namespaces bizarro
PID of child created by clone() is 27514
uts.nodename in child: bizarro
uts.nodename in parent: antero
As with most other namespaces (user namespaces are the exception),
creating a UTS namespace requires privilege (specifically,
CAP_SYS_ADMIN). This is necessary to avoid scenarios where
set-user-ID applications could be fooled into doing the wrong thing because the
system has an unexpected hostname.
Another possibility is that a set-user-ID application might be using the
hostname as part of the name of a lock file. If an unprivileged user could
run the application in a UTS namespace with an arbitrary hostname, this
would open the application to various attacks. Most simply, this would
nullify the effect of the lock file, triggering misbehavior in instances of
the application that run in different UTS namespaces. Alternatively, a
malicious user could run a set-user-ID application in a UTS namespace with
a hostname that causes creation of the lock file to overwrite an important
file. (Hostname strings can contain arbitrary characters, including
slashes.)
The /proc/PID/ns files
Each process has a /proc/PID/ns directory that
contains one file for each type of namespace. Starting in Linux 3.8, each
of these files is a special symbolic link that provides a kind of handle
for performing certain operations on the associated namespace for the
process.
$ ls -l /proc/$$/ns # $$ is replaced by shell's PID
total 0
lrwxrwxrwx. 1 mtk mtk 0 Jan 8 04:12 ipc -> ipc:[4026531839]
lrwxrwxrwx. 1 mtk mtk 0 Jan 8 04:12 mnt -> mnt:[4026531840]
lrwxrwxrwx. 1 mtk mtk 0 Jan 8 04:12 net -> net:[4026531956]
lrwxrwxrwx. 1 mtk mtk 0 Jan 8 04:12 pid -> pid:[4026531836]
lrwxrwxrwx. 1 mtk mtk 0 Jan 8 04:12 user -> user:[4026531837]
lrwxrwxrwx. 1 mtk mtk 0 Jan 8 04:12 uts -> uts:[4026531838]
One use of these symbolic links is to discover whether two processes
are in the same namespace. The kernel does some magic to ensure that if two
processes are in the same namespace, then the inode numbers reported for
the corresponding symbolic links in /proc/PID/ns
will be the same. The inode numbers can be obtained using the stat()
system call (in the st_ino field of the returned structure).
However, the kernel also constructs each of the
/proc/PID/ns symbolic links so that it points to a
name consisting of a string that identifies the namespace type, followed by
the inode number. We can examine this name using either the
ls -l or the readlink command.
Let's return to the shell session above where we ran the
demo_uts_namespaces program. Looking at the
/proc/PID/ns symbolic links for the
parent and child process provides an alternative method of checking whether
the two processes are in the same or different UTS namespaces:
^Z # Stop parent and child
[1]+ Stopped ./demo_uts_namespaces bizarro
# jobs -l # Show PID of parent process
[1]+ 27513 Stopped ./demo_uts_namespaces bizarro
# readlink /proc/27513/ns/uts # Show parent UTS namespace
uts:[4026531838]
# readlink /proc/27514/ns/uts # Show child UTS namespace
uts:[4026532338]
As can be seen, the content of the
/proc/PID/ns/uts symbolic links differs,
indicating that the two processes are in different UTS namespaces.
The /proc/PID/ns symbolic links also serve
other purposes. If we open one of these files, then the namespace will
continue to exist as long as the file descriptor remains open, even if all
processes in the namespace terminate. The same effect can also be obtained
by bind mounting one of the symbolic links to another location in the file
system:
# touch ~/uts # Create mount point
# mount --bind /proc/27514/ns/uts ~/uts
Before Linux 3.8, the files in /proc/PID/ns
were hard links rather than special symbolic links of the form described
above. In addition, only the ipc, net, and uts
files were present.
Joining an existing namespace: setns()
Keeping a namespace open when it contains no processes is of course
only useful if we intend to later add processes to it. That is the task of
the setns()
system call, which allows the calling process to join an existing
namespace:
int setns(int fd, int nstype);
More precisely, setns() disassociates the calling process from
one instance of a particular namespace type and reassociates the process
with another instance of the same namespace type.
The
fd argument specifies the namespace to join; it is a file
descriptor that refers to one of the symbolic links in a
/proc/PID/ns directory. That file descriptor can
be obtained either by opening one of those symbolic links directly or by
opening a file that was bind mounted to one of the links.
The nstype argument allows the caller to check the type of
namespace that fd refers to. If this argument is specified as
zero, no check is performed. This can be useful if the caller already
knows the namespace type, or does not care about the type. The example
program that we discuss in a moment (ns_exec.c) falls into the
latter category: it is designed to work with any namespace type.
Specifying nstype instead as one of the CLONE_NEW*
constants causes the kernel to verify that fd is a file descriptor
for the corresponding namespace type. This can be useful if, for example,
the caller was passed the file descriptor via a UNIX domain socket and
needs to verify what type of namespace it refers to.
Using setns() and execve() (or one of the other
exec() functions) allows us to construct a simple but useful tool:
a program that joins a specified namespace and then executes a command in
that namespace.
Our program (ns_exec.c, whose full source can be found here) takes two or more command-line
arguments. The first argument is the pathname of a
/proc/PID/ns/* symbolic link (or a file that is
bind mounted to one of those symbolic links). The remaining arguments are
the name of a program to be executed inside the namespace that corresponds
to that symbolic link and optional command-line arguments to be given to
that program. The key steps in the program are the following:
fd = open(argv[1], O_RDONLY); /* Get descriptor for namespace */
setns(fd, 0); /* Join that namespace */
execvp(argv[2], &argv[2]); /* Execute a command in namespace */
An interesting program to execute inside a namespace is, of course, a
shell. We can use the bind mount for the UTS namespace that we created
earlier in conjunction with the ns_exec program to execute a shell
in the new UTS namespace created by our invocation of
demo_uts_namespaces:
# ./ns_exec ~/uts /bin/bash # ~/uts is bound to /proc/27514/ns/uts
My PID is: 28788
We can then verify that the shell is in the same UTS namespace as the
child process created by demo_uts_namespaces, both by inspecting
the hostname and by comparing the inode numbers of the
/proc/PID/ns/uts files:
# hostname
bizarro
# readlink /proc/27514/ns/uts
uts:[4026532338]
# readlink /proc/$$/ns/uts # $$ is replaced by shell's PID
uts:[4026532338]
In earlier kernel versions, it was not possible to use setns()
to join mount, PID, and user namespaces, but, starting with Linux 3.8,
setns() now supports joining all namespace types.
Leaving a namespace: unshare()
The final system call in the namespaces API is unshare():
int unshare(int flags);
The unshare() system call provides functionality similar to
clone(), but operates on the calling process: it creates the new
namespaces specified by the CLONE_NEW* bits in its flags
argument and makes the caller a member of the namespaces. (As with
clone(), unshare() provides functionality beyond working
with namespaces that we'll ignore here.) The main purpose of
unshare() is to isolate namespace (and other) side effects without
having to create a new process or thread (as is done by clone()).
Leaving aside the other effects of the clone() system call,
a call of the form:
clone(..., CLONE_NEWXXX, ....);
is roughly equivalent, in namespace terms, to the sequence:
if (fork() == 0)
unshare(CLONE_NEWXXX); /* Executed in the child process */
One use of the unshare() system call is in the implementation
of the unshare command, which allows the user to execute a command
in a separate namespace from the shell. The general form of this command
is:
unshare [options] program [arguments]
The options are command-line flags that specify the namespaces
to unshare before executing program with the specified
arguments.
The key steps in the implementation of the unshare command are
straightforward:
/* Code to initialize 'flags' according to command-line options
omitted */
unshare(flags);
/* Now execute 'program' with 'arguments'; 'optind' is the index
of the next command-line argument after options */
execvp(argv[optind], &argv[optind]);
A simple implementation of the unshare command
(unshare.c) can be found here.
In the following shell session, we use our unshare.c program to
execute a shell in a separate mount namespace. As we noted in last week's
article, mount namespaces isolate the set of filesystem mount points seen
by a group of processes, allowing processes in different mount namespaces
to have different views of the filesystem hierarchy.
# echo $$ # Show PID of shell
8490
# cat /proc/8490/mounts | grep mq # Show one of the mounts in namespace
mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0
# readlink /proc/8490/ns/mnt # Show mount namespace ID
mnt:[4026531840]
# ./unshare -m /bin/bash # Start new shell in separate mount namespace
# readlink /proc/$$/ns/mnt # Show mount namespace ID
mnt:[4026532325]
Comparing the output of the two readlink commands shows that
the two shells are in separate mount namespaces. Altering the set of mount
points in one of the namespaces and checking whether that change is visible
in the other namespace provides another way of demonstrating that the two
programs are in separate namespaces:
# umount /dev/mqueue # Remove a mount point in this shell
# cat /proc/$$/mounts | grep mq # Verify that mount point is gone
# cat /proc/8490/mounts | grep mq # Is it still present in the other namespace?
mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0
As can be seen from the output of the last two commands, the
/dev/mqueue mount point has disappeared in one mount namespace, but
continues to exist in the other.
Concluding remarks
In this article we've looked at the fundamental pieces of the namespace
API and how they are employed together. In the follow-on articles, we'll look
in more depth at some other namespaces, in particular, the PID and user namespaces;
user namespaces open up a range of new possibilities for applications to use kernel
interfaces that were formerly restricted to privileged applications.
(2013-01-15: updated the concluding remarks to reflect the fact that there will be more than one following article.)
Comments (6 posted)
Brief items
...with so much drama, he now praises the iPad.
—
Google's
text-to-speech engine, when fed any of a number of input strings
containing "-ed with."
A programmer had a problem. He thought to himself, "I know, I'll solve it with threads!". has Now problems. two he
—
Davidlohr Bueso
Comments (4 posted)
For those watching systemd, version 197 has a number of interesting new
features, including
a
new mechanism for stable network interface naming, an integrated
bootchart implementation, optimized readahead behavior on Btrfs, and more.
"
systemd will no longer detect and recognize specific
distributions. All distribution-specific #ifdeffery has been removed,
systemd is now fully generic and distribution-agnostic. Effectively, not
too much is lost as a lot of the code is still accessible via explicit
configure switches. However, support for some distribution specific legacy
configuration file formats has been dropped. We recommend distributions to
simply adopt the configuration files everybody else uses now and convert
the old configuration from packaging scripts."
Full Story (comments: 44)
Firefox 18 has been released. See the
release
notes for the details. There are
Firefox for
Android release notes also.
Full Story (comments: 48)
The python.org wiki was compromised and the attacker gained shell access to
the "moin" user account. "
Some time later, the attacker deleted all
files owned by the "moin" user, including all instance data for both the
Python and Jython wikis. The attack also had full access to all MoinMoin
user data on all wikis. In light of this, the Python Software Foundation
encourages all wiki users to change their password on other sites if the
same one is in use elsewhere. We apologize for the inconvenience and will post
further news as we bring the new and improved wiki.python.org online."
This is the second high-profile compromise of a Moin-based wiki reported recently; anybody running such a site should be sure they are current with their security patches.
Full Story (comments: 2)
Richard Biener has posted another report on the status of GCC 4.8.0 development. Noting that the code had "stabilized itself over the holidays," stage 3 of the development cycle is over, so "GCC trunk is now in release branch mode, thus only regression fixes and documentation changes are allowed now."
Full Story (comments: none)
Version 3.6.3 of GNU Radio has been released. This release adds "major new
capabilities and many bug fixes, while maintaining strict source
compatibility with user code already written for the 3.6 API." Enhancements include asynchronous message passing, new blocks for interacting with the operating system's networking stack, and the ability to write signal processing blocks in Python.
Full Story (comments: none)
Newsletters and articles
Comments (none posted)
Groklaw reports on an invitation from the United States Patent and Trademark Office (USPTO) for software developers to join two roundtable discussions aimed at "enhancing" the quality of software patents. Both events are in February: one in New York City and one in Silicon Valley. As Groklaw points out, the events are space-limited and proprietary software vendors are sure to attend. "Large companies with patent portfolios they treasure and don't want to lose can't represent the interests of individual developers or the FOSS community, those most seriously damaged by toxic software patents." (Thanks to Davide Del Vento)
Comments (36 posted)
Taryn Fox has written a blog entry examining GNOME 3's global application menu, including a few outstanding problems, and a proposal for how they could be addressed. "Ideally, the App Menu will contain all of a given app's functionality. This is the assumption new GNOME apps (like Documents) are building on, and the one certain existing apps (like Empathy) are adopting."
Comments (none posted)
At his blog, Mozilla's Luke Crouch explores whether or not HTML applications deployed on "web runtime" environments offer a better experience than those running in a traditional browser. The answer is evidently no: "If you're making an HTML5 app, consider - do you want to make a native desktop application? Why or why not? Then consider if the same reasoning is true for the native mobile application."
Comments (none posted)
Page editor: Nathan Willis
Next page: Announcements>>