Preparing for user-space checkpoint/restore
CRIU
Cyrill Gorcunov has been working to fill in some of the gaps with a preparatory patch set for user-space checkpointing/restore with the "CRIU" tool set. There are a number of small additions to the kernel ABI to be found here:
- A new children entry in a thread's /proc directory
provides a list of that thread's immediate children. This information
allows a user-space checkpoint utility to find those child processes
without needing to walk through the entire process tree.
- /proc/pid/stat is extended to provide the bounds of
the process's argument and environment arrays, along with the exit
code. That allows this information to be reliably captured at
checkpoint time.
- A number of new prctl() options allow the argument and environment arrays to restored in a way matching what was there at checkpoint time. The desired end result is that ps shows the same information about a process after a checkpoint/restore cycle as it did before.
Perhaps the most significant new feature, though, is the addition of a new system call:
long kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1, unsigned long idx2);
Checkpoint/restore is meant to work as well on a tree of processes as on a single process. One challenge in the way of meeting that goal is that some of those processes may share resources - files, say, or, perhaps, a whole lot more. Replicating that sharing at restore time is relatively easy; the clone() system call provides a nice set of flags controlling the sharing of resources. The harder part is knowing, at checkpoint time, whether that sharing is taking place.
One way for user space to determine whether, for example, two processes are sharing the same open file would be to query the kernel for the address of the associated struct file and see if they are the same in both processes. That kind of functionality sets off alarms among those concerned about security, though; learning where data structures live in kernel space is often an important precondition to an attack. There was talk for a while of "obfuscating" the pointers - through an exclusive-OR with a random value, for example - but the risk was still seen as being too high. So the compromise is kcmp(), which simply answers the question of whether resources found in two processes are the same or not.
kcmp() takes two process ID parameters, indicating the processes of interest; both processes must be in the same PID namespace as the calling process. The type parameter tells the kernel the specific item that is being compared:
- KCMP_FILE: determines whether a file descriptor idx1
in the first process is the same as another descriptor (idx2)
in the second process.
- KCMP_FILES: compares the file descriptor arrays to see
whether the processes share all files.
- KCMP_FS: compares fs_struct structures (which hold
the current umask, working directory, namespace root, etc.).
- KCMP_IO: compares the I/O context, used mainly for block I/O
scheduling.
- KCMP_SIGHAND: compares the two process's signal handler
arrays.
- KCMP_SYSVSEM: compares the list of undo operations associated
with SYSV semaphores.
- KCMP_VM: compares each process's address space.
The return value from kcmp() is zero if the two items are equal, one if the first item is "less" than the second, or two if the first is "greater" than the second. The ordered comparison may seem a little strange, especially when one looks at the implementation and sees that the pointers are obfuscated before comparison within the kernel. The result is, thus, an ordering that (by design) does not match the ordering of the relevant data structures in kernel space. It turns out that even a reshuffled (but consistent) "ordering" is useful for optimizing comparisons in user space when large numbers of open files are present.
This patch set has been through a few cycles of review and seems to have addressed most of the concerns raised by reviewers. It may just find its way in through the next merge window. Meanwhile, people who want to see how the user-space side works can find the relevant code at criu.org.
DMTCP
CRIU is not the only user-space checkpoint/restore implementation out there; the DMTCP (Distributed MultiThreaded CheckPointing) project has been busy since about 2.6.9. DMTCP differs somewhat from CRIU, though; in particular, it is able to checkpoint groups of processes connected by sockets - even across different machines - and it requires no changes to the kernel at all. These features come with a couple of limitations, though.
Checkpoint/restore with DMTCP requires that the target process(es) be started with a special script; it is not possible to checkpoint arbitrary processes on the system. That script uses the LD_PRELOAD mechanism to place wrappers around a number of libc and (especially) system call implementations. As a result, DMTCP has no need to ask the kernel whether two processes are sharing a specific resource; it has been watching the relevant system calls and knows how the processes were created. The disadvantage to this approach - beyond having to run checkpointable process in a special environment - is that, as can be seen in the table of supported applications, not all programs can be checkpointed.
The recent 1.2.4
release improves support, though, to the point that
everything a wide range of users care about should be checkpointable. The
system has been integrated with Open
MPI and is able to respond to MPI-generated checkpoint and restore
requests. DMTCP is available with the openSUSE, Debian Testing, and Ubuntu
distributions. DMTCP may offer something good enough today for many users,
who may not need to wait for one of the other projects to be ready sometime
in the future.
| Index entries for this article | |
|---|---|
| Kernel | Checkpointing |
| Kernel | DMTCP |
| Kernel | System calls/kcmp() |
