Checkpoint/restart: it's complicated

By Jonathan Corbet
November 9, 2010

At the recent Kernel Summit checkpoint/restart discussion, developer Oren Laadan was asked to submit a trimmed-down version of the patch which would just show the modifications to existing core kernel code. Oren duly responded with a "naked patch" which, as one might have expected, kicked off a new round of discussion. What many observers may not have expected was the appearance of an alternative approach to the problem which has seemingly been under development for years. Now we have two clearly different ways of solving this problem but no apparent increase in clarity; the checkpoint/restart problem, it seems, is simply complicated.

The responses to Oren's patch will not have been surprising to anybody who has been following the discussion. Kernel developers are nervous about the broad range of core code which is changed by this patch. They don't like the idea of spreading serialization hooks around the kernel which, the authors' claims to the contrary notwithstanding, look like they could be a significant maintenance burden over time. It is clear that kernel checkpoint/restart can never handle all processes; kernel developers wonder where the real-world limits are and how useful the capability will be in the end. The idea of moving checkpointed processes between kernel versions by rewriting the checkpoint image with a user-space tool causes kernel hackers to shiver. And so on; none of these worries are new.

Tejun Heo raised all these issues and more. He also called out an interesting alternative checkpoint/restart implementation called DMTCP, which solves the problem entirely in user space. With DMTCP in mind, Tejun concluded:

I think in-kernel checkpointing is in awkward place in terms of tradeoff between its benefits and the added complexities to implement it. If you give up coverage slightly, userland checkpointing is there. If you need reliable coverage, proper virtualization isn't too far away. As such, FWIW, I fail to see enough justification for the added complexity.

As one might imagine, this post was followed by an extended conversation between the in-kernel checkpoint/restart developers and the DMTCP developers, who had previously not put in an appearance on the kernel mailing lists. It seems that the two projects were each surprised to learn of the other's existence.

The idea behind DMTCP is to checkpoint a distributed set of processes without any special support from the kernel. Doing so requires support from the processes themselves; a checkpointing tool is injected into their address spaces using the LD_PRELOAD mechanism. DMTCP is able to checkpoint (and, importantly, restart) a wide variety of programs, including those running in the Python or Perl interpreters and those using GNU Screen. DMTCP is also used to support the universal reversible debugger project. It is, in other words, a capable tool with real-world uses.

Kernel developers naturally like the idea of eliminating a bunch of in-kernel complexity and solving a problem in user space, where things are always simpler. The only problem is that, in this case, it's not necessarily simpler. There is a surprising amount that DMTCP can do with the available interfaces, but there are also some real obstacles. Quite a bit of information about a process's history is not readily available from user space, but that history is often needed for checkpoint/restart; consider tracking whether two file descriptors are shared as the result of a fork() call or not. To keep the requisite information around, DMTCP must place wrappers around a number of system calls. Those wrappers interpose significant new functionality and may change semantics in unpredictable ways.

Pipes are hard for DMTCP to handle, so the pipe() wrapper has to turn them into full Unix-domain sockets. There is also an interesting dance required to get those sockets into the proper state at restart time. The handling of signals - not always straightforward even in the simplest of applications - is made more complicated by DMTCP, which also must reserve one signal (SIGUSR2 by default) for its own uses. The system call wrappers try to hide that signal handler from the application; there is also the little problem that signals which are pending at checkpoint time may be lost. Checkpointing will interrupt system calls, leading to unexpected EINTR returns; the wrappers try to compensate by automatically redoing the call when this happens. A second VDSO page must be introduced into a restarted process because it's not possible to control where the kernel places that page. There's a "virtual PID" layer which tries to fool restarted processes into thinking that they are still running with the same process ID they had when they were checkpointed.

There is an interesting plan for restarting programs which have a connection to an X server: they will wrap Xlib (not a small interface) and use those wrappers to obtain the state of the window(s) maintained by the application. That state can then be recreated at restart time before reconnecting the application with the server. Meanwhile, applications talking to an xterm are forced to reinitialize themselves at restart time by sending two SIGWINCH signals to them. And so on.

Given all of that, it is not surprising that the kernel checkpoint/restart developers see their approach as being a simpler, more robust, and more general solution to the problem. To them, DMTCP looks like a shaky attempt to reimplement a great deal of kernel functionality in user space. Matt Helsley summarized it this way:

Frankly it sounds like we're being asked to pin our hopes on a house of cards -- weird userspace hacks involving extra processes, hodge-podge combinations of ptrace, LD_PRELOAD, signal hijacking, brk hacks, scanning passes in /proc (possibly at numerous times which begs for races), etc....

In contrast, kernel-based cr is rather straight forward when you bother to read the patches. It doesn't require using combinations of obscure userspace interfaces to intercept and emulate those very same interfaces. It doesn't add a scattered set of new ABIs.

Seasoned LWN readers will be shocked to learn that few minds appear to have been changed by this discussion. Most developers seem to agree that some sort of checkpoint/restart functionality would be a useful addition to Linux, but they differ on how it should be done. Some see a kernel-side implementation as the only way to get even close to a full solution to the problem and as the simplest and most maintainable option. Others think that the user-space approach makes more sense, and that, if necessary, a small number of system calls can be added to simplify the implementation. It has the look of the sort of standoff that can keep a project like this out of the kernel indefinitely.

That said, something interesting may happen here. One thing that became reasonably clear in the discussion is that a complete, performant, and robust checkpoint/restart implementation will almost certainly require components in both kernel and user space. And it seems that the developers behind the two implementations will be getting together to talk about the problem in a less public setting. With luck, determination, and enough beer, they might just figure out a way to solve the problem using the best parts of both approaches. That would be a worthy outcome by any measure.

Index entries for this article
Kernel	Checkpointing
Kernel	DMTCP

Checkpoint/restart: it's complicated

Posted Nov 11, 2010 11:18 UTC (Thu) by ebirdie (guest, #512) [Link] (2 responses)

There also exists this http://wiki.openvz.org/Checkpointing_and_live_migration, a working implementation, although it is for an container. One just wonders, isn't it worth to look at this as well?

Checkpoint/restart: it's complicated

Posted Nov 15, 2010 16:59 UTC (Mon) by orenl (guest, #21050) [Link] (1 responses)

OpenVZ is a nice working implementation. Unfortunately, it is for specific (older) kernels only. It is an out-of-tree implementation and seems to intend to remain as such.

Linux-CR builds on the experience garnered through the previous out-of-mainstream projects such as OpenVZ and Zap. In developing it we aim to improve the design and extend the range of of supported features. Not less important, Linux-CR aims for inclusion in mainline.

(OpenVZ developer contributed to our effort in the past and we hope for more :)

Checkpoint/restart: it's complicated

Posted Nov 16, 2010 3:18 UTC (Tue) by mfedyk (guest, #55303) [Link]

why not take openvz cr, get it into mergable state and then morph/improve it in mainline like perf, o1 scheduler, cfs scheduler.... hmm, maybe it's because ingo molnar isn't developing this patch? ;)

Checkpoint/restart: it's complicated

Posted Nov 11, 2010 15:20 UTC (Thu) by sean.hunter (guest, #7920) [Link]

Hehe. "Seasoned LWN readers will be shocked to learn".

At my workplace there is a similarly sarcastic saying "Imagine my huge surprise". It has been said so often now that it has its own well-known abbreviation (IMHFS).

I can't imagine what the "F" stands for.

Checkpoint/restart: it's complicated

Posted Nov 11, 2010 17:34 UTC (Thu) by Quazatron (guest, #4368) [Link] (3 responses)

The value of beer in software development is vastly underrated.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 15:36 UTC (Fri) by jschrod (subscriber, #1646) [Link] (2 responses)

Don't forget the wine, too. There are those of us who prefer it to beer.

Checkpoint/restart: it's complicated

Posted Nov 15, 2010 19:56 UTC (Mon) by knobunc (guest, #4678) [Link] (1 responses)

Ah, a symposium (in the original Greek sense).

Checkpoint/restart: it's complicated

Posted Nov 15, 2010 21:06 UTC (Mon) by Trelane (subscriber, #56877) [Link]

"Although free women of status did not attend symposia, female prostitutes (hetairai) and entertainers were hired to perform, consort, and converse with the guests." (http://en.wikipedia.org/wiki/Symposium)

Not *quite* the original, I'd wager....

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 17:32 UTC (Fri) by Np237 (guest, #69585) [Link] (15 responses)

One of the primary users for checkpoint/restart is high-performance computing. You dont want to waste CPU cycles just because a node needs maintenance, so you want to checkpoint your whole computation. For very large clusters, you even want to checkpoint at regular intervals, since the probability for a node failure is just way too high.

In this situation, pure userland checkpointing is a joke. It will never be able to store the state of network communications and of the various funny things that scientific code developers can imagine.

Virtualization is not a solution either. The performance impact is too high, and youd have to plug in the hosts anyway for things like RDMA support.

This is where in-kernel checkpointing comes handy. Its not sufficient either: you need to have support in the MPI library as well, so that all processes put themselves in a consistent state to be checkpointed together.

I used to be a SuperUX sysadmin; I doubt many of you here have even heard the name, but seamless checkpoint/restart in the whole stack gives very impressive results, and in the end saves a lot of CPU cycles. There are not many systems where you can change a kernel setting and reboot, at 6PM on a Friday evening, all without the users noticing and without any doubt about causing problems during the weekend. HPC on Linux is still very, very far from this.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 19:55 UTC (Fri) by daglwn (guest, #65432) [Link] (14 responses)

One of the primary users for checkpoint/restart is high-performance computing.

Not for much longer. CR does not scale. Given that DoD wants an exascale computer by 2018 with millions of cores and the associated MTBF, there's no way a CR system could possibly keep up. At best it would completely saturate the network.

We're going to have to get a lot smarter about resiliency.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 20:15 UTC (Fri) by Np237 (guest, #69585) [Link] (4 responses)

You dont need to checkpoint every minute. It would already saturate the filesystem on current clusters. But you can do that every now and then, and if your filesystem cannot handle storing all the RAM of your compute nodes in a small time, it means it is already not up to the task.

I expect an exascale computer to come with an exascale filesystem. If the CR is correctly implemented, it means exabytes of data being written to it simultaneously, with no other bottleneck than a barrier to halt all processes in a consistent state. In the end its the same amount of data that the computation would write when it ends to store the result. If the cluster cannot handle that, its just an entry in the Top500 dick size contest, not something to do serious work.

The network speed should definitely not be a problem. It should be able to transfer all your node memorys contents in less than a second. Otherwise it will just not be able to run computations that exchange a lot of data.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 20:17 UTC (Fri) by Np237 (guest, #69585) [Link]

§2: I meant petabytes, of course. Its hard enough without needing petabytes.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 22:36 UTC (Fri) by daglwn (guest, #65432) [Link] (2 responses)

Your are seriously underestimating the complexity of these systems. We're going to have node failures on the order of hours, at best. We really need something much better than CR. No matter how CR is implemented, it won't scale.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 22:40 UTC (Fri) by daglwn (guest, #65432) [Link]

All IMHO, of course. I am far from an OS, filesystem or CR expert, but I am aware of the trends.

Checkpoint/restart: it's complicated

Posted Nov 15, 2010 17:44 UTC (Mon) by orenl (guest, #21050) [Link]

There are serious challenges to make large HPC environments resilient, and C/R functionality is an essential component in getting there.

The main scalability bottleneck IMHO is that most distribute-C/R approaches require a barrier point for all participating processes across the cluster (the alternative - asynchronous C/R with message logging - has many issues).

However, write-back of data to disk is not necessarily the biggest concern. First, you could write the checkpoints to _local_ disks (rather than flood the NAS). Second, you can write the checkpoints to "memory servers" which is faster than disks. Such servers need a lot of memory, but don't care about CPUs and core. You could speed things up using RDMA. For both methods, it's useful also to keep some redundancy so data can be recovered should a node (or more) disappear.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 20:17 UTC (Fri) by dlang (guest, #313) [Link] (7 responses)

much of the HPC software is custom written, and implements checkpointing on it's own, so adding c/r at the OS level doesn't actually buy you that much.

the HPC software does it's own checkpointing anyway so that if the system crashes it can pick up at a reasonable point and not loose all the work that was done.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 20:22 UTC (Fri) by Np237 (guest, #69585) [Link] (5 responses)

It really depends on the software. You cant easily add C/R to a large piece of code that was written in the 70s and evolved organically since that time.

For a dedicated cluster, it makes sense to adapt your code, but on general-purpose clusters you have hundreds of different codes running; it is much less expensive to implement system C/R if possible, instead of porting all the codes to be resilient.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 20:39 UTC (Fri) by dlang (guest, #313) [Link] (4 responses)

the problem is that just dong c/r on the process isn't good enough, it also has to do c/r on every other thing that the process talks to.

for example,

saving tcp connection info does you no good unless the other end of the tcp connection gets checkpointed at the same instant.

saving pending disk writes does no good if the file you are writing to is off on some other system and will contain writes after the checkpoint

system level c/r is useful for planned outages, but when you are in HPC environments, you have enough nodes that this is really not good enough, you will have unplanned outages, and unless your c/r can back out all these other side effects, it's not going to be able to be used for these outages.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 21:52 UTC (Fri) by Np237 (guest, #69585) [Link] (3 responses)

This is precisely why you need the help of the MPI library and the resource manager: so that all processes related to a given job can be handled at the same time.

Most of the things you describe are already handled by BLCR, although still in an imperfect way.

Checkpoint/restart: it's complicated

Posted Nov 13, 2010 6:37 UTC (Sat) by dlang (guest, #313) [Link] (2 responses)

you miss my point.

doing checkpointing of the apps on any one system isn't good enough.

you need to checkpoint the app and everything that it talks to on _every_ system at once (and make sure that you do it at the same instant so that there's no chance of data being in flight between systems to make it inconsistant)

Checkpoint/restart: it's complicated

Posted Nov 13, 2010 8:44 UTC (Sat) by Np237 (guest, #69585) [Link] (1 responses)

No, you are missing my point. When I wrote all processes related to a given job, I really mean all processes, on all cluster nodes.

Yes, its complicated. But with the help of the MPI library you can close all connections (since all inter-nodes connections are supposed to go through MPI) in a synchronized way. This is what BLCR + OpenMPI already do.

Checkpoint/restart: it's complicated

Posted Nov 15, 2010 17:51 UTC (Mon) by orenl (guest, #21050) [Link]

There are two types of checkpoints here:

1) Coordinated checkpoint of all participating processes (across the cluster) so that the entire jobs can be restarted later from this checkpoint. This is useful for fault-tolerance.

2) Checkpoint of processes on one (or more) nodes and then restart on a different node (or set of nodes). This is useful for load-balancing, but also for maintenance, e.g. by vacating an over-heating node.

The former is usually done combining a C/R mechanism with a C/R-aware implementation of e.g. MPI. The latter is more tricky if one would like to do seamless live migration.

Linux-CR supports both.

Checkpoint/restart: it's complicated

Posted Nov 15, 2010 17:55 UTC (Mon) by orenl (guest, #21050) [Link]

The Blue Waters people (http://www.ncsa.illinois.edu/BlueWaters/) think that integrated checkpoint-restart is a worthy goal (see computing system / software configuration there). Linux-CR is being proposed for it.

Checkpoint/restart: it's complicated

Posted Nov 15, 2010 20:28 UTC (Mon) by mhelsley (guest, #11324) [Link]

Which is why you need a c/r implementation that aggressively avoids checkpointing shared data multiple times and avoids unnecessary IO as much as possible.

linux-cr does the former using its objhash so that we don't checkpoint shared state more than once. It avoids doing disk IO by, whenever possible. not bundling file/directory contents into the checkpoint image (generic_file_checkpoint()). Instead it relies on userspace to do IO-bandwidth friendly optimizations like using filesystem snapshots.

That said, file "contents" are necessary for anon_inode-based interfaces such as eventfd, epoll, signalfd, and timerfd because those can't be "backed up" and restored like normal files.

Enhancing kernel API for user-space CR?

Posted Nov 20, 2010 13:45 UTC (Sat) by slashdot (guest, #22014) [Link]

Instead of directly implementing checkpoint/restore in the kernel, wouldn't it be better to enhance the kernel API to allow perfect checkpoint/restore in userspace?

You could still put the tool in the kernel repository, but as an userspace tool like perf.

This would allow other uses of such an API, like "saving" a specific file descriptor.

Basically, what you need to add is:
1. A way to access process-specific data of other processes, by splitting the notion of current fd/signal/mm/etc. table from the notion of the one accessed by system calls
2. A way to query all kernel state from userspace (with separate calls for each state type)
3. A way to recreate single kernel objects
4. Maybe a way to lock subsystems so that a consistent view can be saved

Checkpoint/restart: it's complicated

Posted Nov 25, 2010 3:20 UTC (Thu) by karya (guest, #71446) [Link]

As two of the DMTCP developers, we thought we would add some small comments.

We should first add that we have the highest respect for Linux C/R, and are continuing to talk with them and exchange information.

1) In response to Np237, DMTCP does directly handle distributed processes in a pure userland fashion. For example, it can checkpoint OpenMPI as if it were just one more black box distributed computation. It handles the network communication through a strategy of draining sockets. In one phase, each host sends a special cookie through each socket. In the next phase, each host reads from all sockets until seeing the cookie. The data drained from the network can be re-inserted upon restart or resume from checkpoint. We do indeed use barriers, as mentioned by orenl. If anyone encounters a bug in using DMTCP for distributed processes, please do let us know so that we can fix it.

2) Np237 also points out:
> Most of the things you describe are already handled by BLCR, although still in an imperfect way.
This is a good point, and we agree.