Checkpoint/restart: it's complicated
The responses to Oren's patch will not have been surprising to anybody who has been following the discussion. Kernel developers are nervous about the broad range of core code which is changed by this patch. They don't like the idea of spreading serialization hooks around the kernel which, the authors' claims to the contrary notwithstanding, look like they could be a significant maintenance burden over time. It is clear that kernel checkpoint/restart can never handle all processes; kernel developers wonder where the real-world limits are and how useful the capability will be in the end. The idea of moving checkpointed processes between kernel versions by rewriting the checkpoint image with a user-space tool causes kernel hackers to shiver. And so on; none of these worries are new.
Tejun Heo raised all these issues and more. He also called out an interesting alternative checkpoint/restart implementation called DMTCP, which solves the problem entirely in user space. With DMTCP in mind, Tejun concluded:
As one might imagine, this post was followed by an extended conversation between the in-kernel checkpoint/restart developers and the DMTCP developers, who had previously not put in an appearance on the kernel mailing lists. It seems that the two projects were each surprised to learn of the other's existence.
The idea behind DMTCP is to checkpoint a distributed set of processes without any special support from the kernel. Doing so requires support from the processes themselves; a checkpointing tool is injected into their address spaces using the LD_PRELOAD mechanism. DMTCP is able to checkpoint (and, importantly, restart) a wide variety of programs, including those running in the Python or Perl interpreters and those using GNU Screen. DMTCP is also used to support the universal reversible debugger project. It is, in other words, a capable tool with real-world uses.
Kernel developers naturally like the idea of eliminating a bunch of in-kernel complexity and solving a problem in user space, where things are always simpler. The only problem is that, in this case, it's not necessarily simpler. There is a surprising amount that DMTCP can do with the available interfaces, but there are also some real obstacles. Quite a bit of information about a process's history is not readily available from user space, but that history is often needed for checkpoint/restart; consider tracking whether two file descriptors are shared as the result of a fork() call or not. To keep the requisite information around, DMTCP must place wrappers around a number of system calls. Those wrappers interpose significant new functionality and may change semantics in unpredictable ways.
Pipes are hard for DMTCP to handle, so the pipe() wrapper has to turn them into full Unix-domain sockets. There is also an interesting dance required to get those sockets into the proper state at restart time. The handling of signals - not always straightforward even in the simplest of applications - is made more complicated by DMTCP, which also must reserve one signal (SIGUSR2 by default) for its own uses. The system call wrappers try to hide that signal handler from the application; there is also the little problem that signals which are pending at checkpoint time may be lost. Checkpointing will interrupt system calls, leading to unexpected EINTR returns; the wrappers try to compensate by automatically redoing the call when this happens. A second VDSO page must be introduced into a restarted process because it's not possible to control where the kernel places that page. There's a "virtual PID" layer which tries to fool restarted processes into thinking that they are still running with the same process ID they had when they were checkpointed.
There is an interesting plan for restarting programs which have a connection to an X server: they will wrap Xlib (not a small interface) and use those wrappers to obtain the state of the window(s) maintained by the application. That state can then be recreated at restart time before reconnecting the application with the server. Meanwhile, applications talking to an xterm are forced to reinitialize themselves at restart time by sending two SIGWINCH signals to them. And so on.
Given all of that, it is not surprising that the kernel checkpoint/restart developers see their approach as being a simpler, more robust, and more general solution to the problem. To them, DMTCP looks like a shaky attempt to reimplement a great deal of kernel functionality in user space. Matt Helsley summarized it this way:
In contrast, kernel-based cr is rather straight forward when you bother to read the patches. It doesn't require using combinations of obscure userspace interfaces to intercept and emulate those very same interfaces. It doesn't add a scattered set of new ABIs.
Seasoned LWN readers will be shocked to learn that few minds appear to have been changed by this discussion. Most developers seem to agree that some sort of checkpoint/restart functionality would be a useful addition to Linux, but they differ on how it should be done. Some see a kernel-side implementation as the only way to get even close to a full solution to the problem and as the simplest and most maintainable option. Others think that the user-space approach makes more sense, and that, if necessary, a small number of system calls can be added to simplify the implementation. It has the look of the sort of standoff that can keep a project like this out of the kernel indefinitely.
That said, something interesting may happen here. One thing that became
reasonably clear in the discussion is that a complete, performant, and
robust checkpoint/restart implementation will almost certainly require
components in both kernel and user space. And it seems that the developers
behind the two implementations will be getting
together to talk about the problem in a less public setting. With
luck, determination, and enough beer, they might just figure out a way to
solve the problem using the best parts of both approaches. That would be a
worthy outcome by any measure.
Index entries for this article | |
---|---|
Kernel | Checkpointing |
Kernel | DMTCP |
Posted Nov 11, 2010 11:18 UTC (Thu)
by ebirdie (guest, #512)
[Link] (2 responses)
Posted Nov 15, 2010 16:59 UTC (Mon)
by orenl (guest, #21050)
[Link] (1 responses)
Linux-CR builds on the experience garnered through the previous out-of-mainstream projects such as OpenVZ and Zap. In developing it we aim to improve the design and extend the range of of supported features. Not less important, Linux-CR aims for inclusion in mainline.
(OpenVZ developer contributed to our effort in the past and we hope for more :)
Posted Nov 16, 2010 3:18 UTC (Tue)
by mfedyk (guest, #55303)
[Link]
Posted Nov 11, 2010 15:20 UTC (Thu)
by sean.hunter (guest, #7920)
[Link]
At my workplace there is a similarly sarcastic saying "Imagine my huge surprise". It has been said so often now that it has its own well-known abbreviation (IMHFS).
I can't imagine what the "F" stands for.
Posted Nov 11, 2010 17:34 UTC (Thu)
by Quazatron (guest, #4368)
[Link] (3 responses)
Posted Nov 12, 2010 15:36 UTC (Fri)
by jschrod (subscriber, #1646)
[Link] (2 responses)
Posted Nov 15, 2010 19:56 UTC (Mon)
by knobunc (guest, #4678)
[Link] (1 responses)
Posted Nov 15, 2010 21:06 UTC (Mon)
by Trelane (subscriber, #56877)
[Link]
Not *quite* the original, I'd wager....
Posted Nov 12, 2010 17:32 UTC (Fri)
by Np237 (guest, #69585)
[Link] (15 responses)
In this situation, pure userland checkpointing is a joke. It will never be able to store the state of network communications and of the various funny things that scientific code developers can imagine.
Virtualization is not a solution either. The performance impact is too high, and youd have to plug in the hosts anyway for things like RDMA support.
This is where in-kernel checkpointing comes handy. Its not sufficient either: you need to have support in the MPI library as well, so that all processes put themselves in a consistent state to be checkpointed together.
I used to be a SuperUX sysadmin; I doubt many of you here have even heard the name, but seamless checkpoint/restart in the whole stack gives very impressive results, and in the end saves a lot of CPU cycles. There are not many systems where you can change a kernel setting and reboot, at 6PM on a Friday evening, all without the users noticing and without any doubt about causing problems during the weekend. HPC on Linux is still very, very far from this.
Posted Nov 12, 2010 19:55 UTC (Fri)
by daglwn (guest, #65432)
[Link] (14 responses)
Not for much longer. CR does not scale. Given that DoD wants an exascale computer by 2018 with millions of cores and the associated MTBF, there's no way a CR system could possibly keep up. At best it would completely saturate the network. We're going to have to get a lot smarter about resiliency.
Posted Nov 12, 2010 20:15 UTC (Fri)
by Np237 (guest, #69585)
[Link] (4 responses)
I expect an exascale computer to come with an exascale filesystem. If the CR is correctly implemented, it means exabytes of data being written to it simultaneously, with no other bottleneck than a barrier to halt all processes in a consistent state. In the end its the same amount of data that the computation would write when it ends to store the result. If the cluster cannot handle that, its just an entry in the Top500 dick size contest, not something to do serious work.
The network speed should definitely not be a problem. It should be able to transfer all your node memorys contents in less than a second. Otherwise it will just not be able to run computations that exchange a lot of data.
Posted Nov 12, 2010 20:17 UTC (Fri)
by Np237 (guest, #69585)
[Link]
Posted Nov 12, 2010 22:36 UTC (Fri)
by daglwn (guest, #65432)
[Link] (2 responses)
Posted Nov 12, 2010 22:40 UTC (Fri)
by daglwn (guest, #65432)
[Link]
Posted Nov 15, 2010 17:44 UTC (Mon)
by orenl (guest, #21050)
[Link]
There are serious challenges to make large HPC environments resilient, and C/R functionality is an essential component in getting there.
The main scalability bottleneck IMHO is that most distribute-C/R approaches require a barrier point for all participating processes across the cluster (the alternative - asynchronous C/R with message logging - has many issues).
However, write-back of data to disk is not necessarily the biggest concern. First, you could write the checkpoints to _local_ disks (rather than flood the NAS). Second, you can write the checkpoints to "memory servers" which is faster than disks. Such servers need a lot of memory, but don't care about CPUs and core. You could speed things up using RDMA. For both methods, it's useful also to keep some redundancy so data can be recovered should a node (or more) disappear.
Posted Nov 12, 2010 20:17 UTC (Fri)
by dlang (guest, #313)
[Link] (7 responses)
the HPC software does it's own checkpointing anyway so that if the system crashes it can pick up at a reasonable point and not loose all the work that was done.
Posted Nov 12, 2010 20:22 UTC (Fri)
by Np237 (guest, #69585)
[Link] (5 responses)
For a dedicated cluster, it makes sense to adapt your code, but on general-purpose clusters you have hundreds of different codes running; it is much less expensive to implement system C/R if possible, instead of porting all the codes to be resilient.
Posted Nov 12, 2010 20:39 UTC (Fri)
by dlang (guest, #313)
[Link] (4 responses)
for example,
saving tcp connection info does you no good unless the other end of the tcp connection gets checkpointed at the same instant.
saving pending disk writes does no good if the file you are writing to is off on some other system and will contain writes after the checkpoint
system level c/r is useful for planned outages, but when you are in HPC environments, you have enough nodes that this is really not good enough, you will have unplanned outages, and unless your c/r can back out all these other side effects, it's not going to be able to be used for these outages.
Posted Nov 12, 2010 21:52 UTC (Fri)
by Np237 (guest, #69585)
[Link] (3 responses)
Most of the things you describe are already handled by BLCR, although still in an imperfect way.
Posted Nov 13, 2010 6:37 UTC (Sat)
by dlang (guest, #313)
[Link] (2 responses)
doing checkpointing of the apps on any one system isn't good enough.
you need to checkpoint the app and everything that it talks to on _every_ system at once (and make sure that you do it at the same instant so that there's no chance of data being in flight between systems to make it inconsistant)
Posted Nov 13, 2010 8:44 UTC (Sat)
by Np237 (guest, #69585)
[Link] (1 responses)
Yes, its complicated. But with the help of the MPI library you can close all connections (since all inter-nodes connections are supposed to go through MPI) in a synchronized way. This is what BLCR + OpenMPI already do.
Posted Nov 15, 2010 17:51 UTC (Mon)
by orenl (guest, #21050)
[Link]
1) Coordinated checkpoint of all participating processes (across the cluster) so that the entire jobs can be restarted later from this checkpoint. This is useful for fault-tolerance.
2) Checkpoint of processes on one (or more) nodes and then restart on a different node (or set of nodes). This is useful for load-balancing, but also for maintenance, e.g. by vacating an over-heating node.
The former is usually done combining a C/R mechanism with a C/R-aware implementation of e.g. MPI. The latter is more tricky if one would like to do seamless live migration.
Linux-CR supports both.
Posted Nov 15, 2010 17:55 UTC (Mon)
by orenl (guest, #21050)
[Link]
Posted Nov 15, 2010 20:28 UTC (Mon)
by mhelsley (guest, #11324)
[Link]
linux-cr does the former using its objhash so that we don't checkpoint shared state more than once. It avoids doing disk IO by, whenever possible. not bundling file/directory contents into the checkpoint image (generic_file_checkpoint()). Instead it relies on userspace to do IO-bandwidth friendly optimizations like using filesystem snapshots.
That said, file "contents" are necessary for anon_inode-based interfaces such as eventfd, epoll, signalfd, and timerfd because those can't be "backed up" and restored like normal files.
Posted Nov 20, 2010 13:45 UTC (Sat)
by slashdot (guest, #22014)
[Link]
You could still put the tool in the kernel repository, but as an userspace tool like perf.
This would allow other uses of such an API, like "saving" a specific file descriptor.
Basically, what you need to add is:
Posted Nov 25, 2010 3:20 UTC (Thu)
by karya (guest, #71446)
[Link]
We should first add that we have the highest respect for Linux C/R, and are continuing to talk with them and exchange information.
1) In response to Np237, DMTCP does directly handle distributed processes in a pure userland fashion. For example, it can checkpoint OpenMPI as if it were just one more black box distributed computation. It handles the network communication through a strategy of draining sockets. In one phase, each host sends a special cookie through each socket. In the next phase, each host reads from all sockets until seeing the cookie. The data drained from the network can be re-inserted upon restart or resume from checkpoint. We do indeed use barriers, as mentioned by orenl. If anyone encounters a bug in using DMTCP for distributed processes, please do let us know so that we can fix it.
2) Np237 also points out:
There also exists this http://wiki.openvz.org/Checkpointing_and_live_migration, a working implementation, although it is for an container. One just wonders, isn't it worth to look at this as well?
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
One of the primary users for checkpoint/restart is high-performance computing.
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Enhancing kernel API for user-space CR?
1. A way to access process-specific data of other processes, by splitting the notion of current fd/signal/mm/etc. table from the notion of the one accessed by system calls
2. A way to query all kernel state from userspace (with separate calls for each state type)
3. A way to recreate single kernel objects
4. Maybe a way to lock subsystems so that a consistent view can be saved
Checkpoint/restart: it's complicated
> Most of the things you describe are already handled by BLCR, although still in an imperfect way.
This is a good point, and we agree.