November 19, 2007
This article was contributed by Pavel Emelyanov and Kir Kolyshkin
One of the new features in the upcoming 2.6.24 kernel will be the PID
namespaces support developed by the OpenVZ team with the help of IBM.
The PID namespace allows for creating sets of tasks, with each such set looking
like a standalone machine with respect to process IDs. In other words,
tasks in different namespaces can have the same IDs.
This feature is the major prerequisite for the migration of containers between
hosts; having a namespace, one may move it to another host while keeping the PID
values -- and this is a requirement since a task is not expected to change
its PID. Without this feature, the migration will very likely fail, as
the processes with the same IDs can exist on the destination node, which
will cause conflicts when addressing tasks by their IDs.
PID namespaces are hierarchical; once a new PID namespace is created,
all the tasks in the current PID namespace will see the tasks (i.e. will
be able to address them with their PIDs) in this new namespace. However,
tasks from the new namespace will not see the ones from the current.
This means that now each task has more than one PID -- one for each namespace.
User-space API
To create a new namespace, one should just call the clone(2)
system call with the CLONE_NEWPID flag set.
After this, it is useful to change the root directory and mount
a new procfs instance in the /proc to make the common utilities
like ps work.
Note that since the parent knows the PID of its child, it may
wait() in the usual way for it to exit.
The first task in a new namespace will have a PID of 1. Thus, it
will be this namespace's init and child reaper, so all the orphaned
tasks will be re-parented to it. Unlike the standalone machine, this "init"
can die, and in this case, the whole namespace will be terminated.
Since now we will have isolated sets of tasks, we should make proc
show only the set of PIDs which is visible for a particular task. To achieve
this goal, procfs should be mounted multiple times -- once
for each namespace. After this the PIDs that are shown in the mounted instance
will be from the namespace which created that mount.
For example, a user may create some new proc_2 directory,
spawn a PID namespace and mount a procfs to it. After this, the
user will be able to see the PIDs as they appear inside this new namespace.
There will be the PID number 1, which is the namespace's init,
and all the other PIDs may coincide with some PIDs from the current namespace,
but refer to some other task.
No other changes in the user API are necessary. Tasks still have the ability to
get their PIDs, PGIDs, etc. with the known system calls. They can also
work with sessions and groups. Tasks may create threads and work with futexes.
Internal API
All the PIDs that a task may have are described in the struct pid.
This structure contains the ID value, the list of tasks having this ID,
the reference counter and the hashed list node to be stored in the
hash table for a faster search.
A few more words about the lists of tasks. Basically a task has three PIDs:
the process ID (PID), the process group ID (PGID), and the
session ID (SID). The PGID and the SID may be shared between the tasks,
for example, when two or more tasks belong to the same group, so each
group ID addresses more than one task.
With the PID namespaces this structure becomes elastic. Now, each PID
may have several values, with each one being valid in one namespace. That is,
a task may have PID of 1024 in one namespace, and 256 in another. So, the
former struct pid changes.
Here is how the struct pid looked like before introducing
the PID namespaces:
struct pid {
atomic_t count; /* reference counter */
int nr; /* the pid value */
struct hlist_node pid_chain; /* hash chain */
struct hlist_head tasks[PIDTYPE_MAX]; /* lists of tasks */
struct rcu_head rcu; /* RCU helper */
};
And this is how it looks now:
struct upid {
int nr; /* moved from struct pid */
struct pid_namespace *ns; /* the namespace this value
* is visible in
*/
struct hlist_node pid_chain; /* moved from struct pid */
};
struct pid {
atomic_t count;
struct hlist_head tasks[PIDTYPE_MAX];
struct rcu_head rcu;
int level; /* the number of upids */
struct upid numbers[0];
};
As you can see, the struct upid now represents the PID
value -- it is stored in the hash and has the PID value.
To convert the struct pid to the PID or vice versa one may
use a set of helpers like task_pid_nr(), pid_nr_ns(),
find_task_by_vpid(), etc.
All these calls has some information in their names:
..._nr()
- These operate with the so called "global" PIDs.
Global PIDs are the numbers that are unique in the whole system, just
like the old PIDs were. E.g.
pid_nr(pid) will tell you the
global PID of the given struct pid. These are only useful
when the PID value is not going to leave the kernel. For example, some code
needs to save the PID and then find the task by it. However, in this
case saving the direct pointer on the struct pid is
more preferable as global PIDs are going be used in kernel logs only.
..._vnr()
- These helpers work with the "virtual" PID, i.e.
with the ID as seen by a process. For example,
task_pid_vnr(tsk)
will tell you the PID of a task, as this task sees it (with
sys_getpid()). Note that this value will most likely be
useless if you're working in another namespace, so these are always used when working
with the current task, since all tasks always see their virtual PIDs.
..._nr_ns()
- These work with the PIDs as seen from the specified
namespace. If you want to get some task's PID (for example, to report it to
the userspace and find this task later), you may call
task_pid_nr_ns(tsk, current->nsproxy->pid_ns) to get
the number, and then find the task using
find_task_by_pid_ns(pid, current->nsproxy->pid_ns).
These are used in system calls, when the PID comes from the user
space. In this case one task may address another which exists in
another namespace.
Conclusion
The interface as described here has been merged for the 2.6.24 kernel
release. It has, however, been marked as "experimental" to prevent its
wide deployment by distributors while some remaining issues are worked
out. Few, if any, changes to this API are expected between now and when
the "experimental" tag is removed in a later kernel release.
(
Log in to post comments)