Preparing for user-space checkpoint/restore

By Jonathan Corbet
January 31, 2012

The addition of a checkpoint/restore functionality to Linux has been an ongoing topic of discussion and development for some years now. After the poor reception given to the in-kernel C/R implementation at the end of 2010, that particular project seems to have faded into the background. Instead, most of the interest seems to be in solutions that operate mostly in user space. Depending on the approach taken, most or all the support needed to implement this functionality in user space already exists. But a complete solution is not yet there.

CRIU

Cyrill Gorcunov has been working to fill in some of the gaps with a preparatory patch set for user-space checkpointing/restore with the "CRIU" tool set. There are a number of small additions to the kernel ABI to be found here:

A new children entry in a thread's /proc directory provides a list of that thread's immediate children. This information allows a user-space checkpoint utility to find those child processes without needing to walk through the entire process tree.
/proc/pid/stat is extended to provide the bounds of the process's argument and environment arrays, along with the exit code. That allows this information to be reliably captured at checkpoint time.
A number of new prctl() options allow the argument and environment arrays to restored in a way matching what was there at checkpoint time. The desired end result is that ps shows the same information about a process after a checkpoint/restore cycle as it did before.

Perhaps the most significant new feature, though, is the addition of a new system call:

    long kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1, unsigned long idx2);

Checkpoint/restore is meant to work as well on a tree of processes as on a single process. One challenge in the way of meeting that goal is that some of those processes may share resources - files, say, or, perhaps, a whole lot more. Replicating that sharing at restore time is relatively easy; the clone() system call provides a nice set of flags controlling the sharing of resources. The harder part is knowing, at checkpoint time, whether that sharing is taking place.

One way for user space to determine whether, for example, two processes are sharing the same open file would be to query the kernel for the address of the associated struct file and see if they are the same in both processes. That kind of functionality sets off alarms among those concerned about security, though; learning where data structures live in kernel space is often an important precondition to an attack. There was talk for a while of "obfuscating" the pointers - through an exclusive-OR with a random value, for example - but the risk was still seen as being too high. So the compromise is kcmp(), which simply answers the question of whether resources found in two processes are the same or not.

kcmp() takes two process ID parameters, indicating the processes of interest; both processes must be in the same PID namespace as the calling process. The type parameter tells the kernel the specific item that is being compared:

KCMP_FILE: determines whether a file descriptor idx1 in the first process is the same as another descriptor (idx2) in the second process.
KCMP_FILES: compares the file descriptor arrays to see whether the processes share all files.
KCMP_FS: compares fs_struct structures (which hold the current umask, working directory, namespace root, etc.).
KCMP_IO: compares the I/O context, used mainly for block I/O scheduling.
KCMP_SIGHAND: compares the two process's signal handler arrays.
KCMP_SYSVSEM: compares the list of undo operations associated with SYSV semaphores.
KCMP_VM: compares each process's address space.

The return value from kcmp() is zero if the two items are equal, one if the first item is "less" than the second, or two if the first is "greater" than the second. The ordered comparison may seem a little strange, especially when one looks at the implementation and sees that the pointers are obfuscated before comparison within the kernel. The result is, thus, an ordering that (by design) does not match the ordering of the relevant data structures in kernel space. It turns out that even a reshuffled (but consistent) "ordering" is useful for optimizing comparisons in user space when large numbers of open files are present.

This patch set has been through a few cycles of review and seems to have addressed most of the concerns raised by reviewers. It may just find its way in through the next merge window. Meanwhile, people who want to see how the user-space side works can find the relevant code at criu.org.

DMTCP

CRIU is not the only user-space checkpoint/restore implementation out there; the DMTCP (Distributed MultiThreaded CheckPointing) project has been busy since about 2.6.9. DMTCP differs somewhat from CRIU, though; in particular, it is able to checkpoint groups of processes connected by sockets - even across different machines - and it requires no changes to the kernel at all. These features come with a couple of limitations, though.

Checkpoint/restore with DMTCP requires that the target process(es) be started with a special script; it is not possible to checkpoint arbitrary processes on the system. That script uses the LD_PRELOAD mechanism to place wrappers around a number of libc and (especially) system call implementations. As a result, DMTCP has no need to ask the kernel whether two processes are sharing a specific resource; it has been watching the relevant system calls and knows how the processes were created. The disadvantage to this approach - beyond having to run checkpointable process in a special environment - is that, as can be seen in the table of supported applications, not all programs can be checkpointed.

The recent 1.2.4 release improves support, though, to the point that everything a wide range of users care about should be checkpointable. The system has been integrated with Open MPI and is able to respond to MPI-generated checkpoint and restore requests. DMTCP is available with the openSUSE, Debian Testing, and Ubuntu distributions. DMTCP may offer something good enough today for many users, who may not need to wait for one of the other projects to be ready sometime in the future.

Index entries for this article
Kernel	Checkpointing
Kernel	DMTCP
Kernel	System calls/kcmp()

Preparing for user-space checkpoint/restore

Posted Feb 2, 2012 4:57 UTC (Thu) by thedevil (guest, #32913) [Link] (1 responses)

As for programs that play well with DMTCP, I am interested in HOL Light:

http://www.cl.cam.ac.uk/~jrh13/hol-light/

Does anyone know the answer, or do I have to just try it? :-P

Preparing for user-space checkpoint/restore

Posted Feb 9, 2012 2:51 UTC (Thu) by karya (guest, #71446) [Link]

I am writing on behalf of the DMTCP team. If you run into any issues with checkpointing HOL Light, please let us know and we will happy to work with you in fixing them.

Preparing for user-space checkpoint/restore

Posted Feb 2, 2012 12:13 UTC (Thu) by misiu_mp (guest, #41936) [Link] (2 responses)

What is Checkpoint/Restore?
Its definitely not trivial what it does, and the article doesn't make it easy to figure out.

Preparing for user-space checkpoint/restore

Posted Feb 2, 2012 12:35 UTC (Thu) by gidoca (subscriber, #62438) [Link] (1 responses)

This article has a brief explanation.

Preparing for user-space checkpoint/restore

Posted Feb 2, 2012 13:00 UTC (Thu) by misiu_mp (guest, #41936) [Link]

Thanks. I will cite it here for convenience.
Description of the feature:
"Checkpoint/restart allows the state of a set of processes to be saved to persistent storage, then restarted at some future time, possibly on a different system."
Use cases:
"It has a number of potential uses, including fault-tolerant systems, debugging (it's a sort of "super core dump"), fast application startup, testing, and as a kind of "generic time machine." That last one allows for the important use case of checkpointing a game, then restoring it after a move which proves to be a mistake. Checkpoint/restart can also be used as a sort of application-level suspend feature; it can function as a kind of "smart swap" which can move an application entirely out of memory when the need arises. There is also the interesting prospect of saving a desktop session on a USB key, then restarting it on an entirely different system in a different location."

kcmp vs strcmp convention

Posted Feb 2, 2012 15:16 UTC (Thu) by jnareb (subscriber, #46500) [Link] (1 responses)

Why not return -1 if "less" and 1 if "greater", just like strcmp (and what e.g. qsort expect from comparison function), instead of 1 and 2?

kcmp vs strcmp convention

Posted Feb 2, 2012 16:05 UTC (Thu) by cesarb (subscriber, #6266) [Link]

This is a system call.

In a system call, values between -1 and -4095 (inclusive) are reserved for errors, with values from errno.h. In particular, -1 means EPERM on x86.

Preparing for user-space checkpoint/restore

Posted Feb 8, 2012 10:04 UTC (Wed) by ebirdie (guest, #512) [Link]

Have been loosely following the development of the checkpoint/restore feature, but from the article can't be said if the implementation builds on this idea and previous article about?

LWN.net: Checkpoint/restart (mostly) in user space
https://lwn.net/Articles/452184/

Just a minor interesting info bit.

Relationship with Android?

Posted Feb 9, 2012 13:36 UTC (Thu) by renox (guest, #23785) [Link] (3 responses)

If memory serves all application in Android must be able to be stopped at any time (to preserve battery), so is-there a relationship with this and Android?

Relationship with Android?

Posted Feb 9, 2012 14:37 UTC (Thu) by mfedyk (guest, #55303) [Link] (2 responses)

No.

In android, apps are expected to store state themselves so that when started again they will continue with that state.

As you can imagine, app implementation of this is spotty.

Relationship with Android?

Posted Feb 9, 2012 18:48 UTC (Thu) by raven667 (subscriber, #5198) [Link] (1 responses)

Well, a robust checkpoint and restore feature could be used to implement the behavior Android desires without requiring special, spotty, support from the app.

Relationship with Android?

Posted Feb 9, 2012 19:27 UTC (Thu) by mjg59 (subscriber, #23239) [Link]

Full checkpoint/restore is much more resource consuming than having the application be able to serialise its important state and resynthesise the rest later. You probably wouldn't want to encourage the former at the expense of the latter.

Preparing for user-space checkpoint/restore

Posted Jul 23, 2012 3:09 UTC (Mon) by bergwolf (guest, #55931) [Link] (1 responses)

Being confused somewhat, what do HPC people use when they checkpoint/restart? I've been told many times that HPC applications do checkpoint/restart very often. But how?

Preparing for user-space checkpoint/restore

Posted Jul 23, 2012 7:17 UTC (Mon) by dlang (guest, #313) [Link]

The HPC applications periodically store their state so that they can kill the app, move the state file to another machine, and start the app again (picking up where it left off)

Doing this inside an app is fairly easy as long as there is no problem re-doing work since the last checkpoint, or you can send the app a signal "stop working and save a checkpoint now"

doing this at the OS level so that you can do this with arbitrary apps, without the app (or other systems the app is communicating with) even knowing that it has taken place is very hard. It's this problem that you are seeing worked on.