User: Password:
Subscribe / Log in / New account

Checkpoint/restart: it's complicated

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 19:55 UTC (Fri) by daglwn (guest, #65432)
In reply to: Checkpoint/restart: it's complicated by Np237
Parent article: Checkpoint/restart: it's complicated

One of the primary users for checkpoint/restart is high-performance computing.

Not for much longer. CR does not scale. Given that DoD wants an exascale computer by 2018 with millions of cores and the associated MTBF, there's no way a CR system could possibly keep up. At best it would completely saturate the network.

We're going to have to get a lot smarter about resiliency.

(Log in to post comments)

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 20:15 UTC (Fri) by Np237 (subscriber, #69585) [Link]

You don’t need to checkpoint every minute. It would already saturate the filesystem on current clusters. But you can do that every now and then, and if your filesystem cannot handle storing all the RAM of your compute nodes in a small time, it means it is already not up to the task.

I expect an exascale computer to come with an exascale filesystem. If the CR is correctly implemented, it means exabytes of data being written to it simultaneously, with no other bottleneck than a barrier to halt all processes in a consistent state. In the end it’s the same amount of data that the computation would write when it ends to store the result. If the cluster cannot handle that, it’s just an entry in the Top500 dick size contest, not something to do serious work.

The network speed should definitely not be a problem. It should be able to transfer all your node memory’s contents in less than a second. Otherwise it will just not be able to run computations that exchange a lot of data.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 20:17 UTC (Fri) by Np237 (subscriber, #69585) [Link]

§2: I meant petabytes, of course. It’s hard enough without needing petabytes.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 22:36 UTC (Fri) by daglwn (guest, #65432) [Link]

Your are seriously underestimating the complexity of these systems. We're going to have node failures on the order of hours, at best. We really need something much better than CR. No matter how CR is implemented, it won't scale.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 22:40 UTC (Fri) by daglwn (guest, #65432) [Link]

All IMHO, of course. I am far from an OS, filesystem or CR expert, but I am aware of the trends.

Checkpoint/restart: it's complicated

Posted Nov 15, 2010 17:44 UTC (Mon) by orenl (guest, #21050) [Link]

There are serious challenges to make large HPC environments resilient, and C/R functionality is an essential component in getting there.

The main scalability bottleneck IMHO is that most distribute-C/R approaches require a barrier point for all participating processes across the cluster (the alternative - asynchronous C/R with message logging - has many issues).

However, write-back of data to disk is not necessarily the biggest concern. First, you could write the checkpoints to _local_ disks (rather than flood the NAS). Second, you can write the checkpoints to "memory servers" which is faster than disks. Such servers need a lot of memory, but don't care about CPUs and core. You could speed things up using RDMA. For both methods, it's useful also to keep some redundancy so data can be recovered should a node (or more) disappear.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 20:17 UTC (Fri) by dlang (subscriber, #313) [Link]

much of the HPC software is custom written, and implements checkpointing on it's own, so adding c/r at the OS level doesn't actually buy you that much.

the HPC software does it's own checkpointing anyway so that if the system crashes it can pick up at a reasonable point and not loose all the work that was done.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 20:22 UTC (Fri) by Np237 (subscriber, #69585) [Link]

It really depends on the software. You can’t easily add C/R to a large piece of code that was written in the 70s and evolved organically since that time.

For a dedicated cluster, it makes sense to adapt your code, but on general-purpose clusters you have hundreds of different codes running; it is much less expensive to implement system C/R if possible, instead of porting all the codes to be resilient.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 20:39 UTC (Fri) by dlang (subscriber, #313) [Link]

the problem is that just dong c/r on the process isn't good enough, it also has to do c/r on every other thing that the process talks to.

for example,

saving tcp connection info does you no good unless the other end of the tcp connection gets checkpointed at the same instant.

saving pending disk writes does no good if the file you are writing to is off on some other system and will contain writes after the checkpoint

system level c/r is useful for planned outages, but when you are in HPC environments, you have enough nodes that this is really not good enough, you will have unplanned outages, and unless your c/r can back out all these other side effects, it's not going to be able to be used for these outages.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 21:52 UTC (Fri) by Np237 (subscriber, #69585) [Link]

This is precisely why you need the help of the MPI library and the resource manager: so that all processes related to a given job can be handled at the same time.

Most of the things you describe are already handled by BLCR, although still in an imperfect way.

Checkpoint/restart: it's complicated

Posted Nov 13, 2010 6:37 UTC (Sat) by dlang (subscriber, #313) [Link]

you miss my point.

doing checkpointing of the apps on any one system isn't good enough.

you need to checkpoint the app and everything that it talks to on _every_ system at once (and make sure that you do it at the same instant so that there's no chance of data being in flight between systems to make it inconsistant)

Checkpoint/restart: it's complicated

Posted Nov 13, 2010 8:44 UTC (Sat) by Np237 (subscriber, #69585) [Link]

No, you are missing my point. When I wrote “all processes related to a given job”, I really mean all processes, on all cluster nodes.

Yes, it’s complicated. But with the help of the MPI library you can close all connections (since all inter-nodes connections are supposed to go through MPI) in a synchronized way. This is what BLCR + OpenMPI already do.

Checkpoint/restart: it's complicated

Posted Nov 15, 2010 17:51 UTC (Mon) by orenl (guest, #21050) [Link]

There are two types of checkpoints here:

1) Coordinated checkpoint of all participating processes (across the cluster) so that the entire jobs can be restarted later from this checkpoint. This is useful for fault-tolerance.

2) Checkpoint of processes on one (or more) nodes and then restart on a different node (or set of nodes). This is useful for load-balancing, but also for maintenance, e.g. by vacating an over-heating node.

The former is usually done combining a C/R mechanism with a C/R-aware implementation of e.g. MPI. The latter is more tricky if one would like to do seamless live migration.

Linux-CR supports both.

Checkpoint/restart: it's complicated

Posted Nov 15, 2010 17:55 UTC (Mon) by orenl (guest, #21050) [Link]

The Blue Waters people ( think that integrated checkpoint-restart is a worthy goal (see computing system / software configuration there). Linux-CR is being proposed for it.

Checkpoint/restart: it's complicated

Posted Nov 15, 2010 20:28 UTC (Mon) by mhelsley (guest, #11324) [Link]

Which is why you need a c/r implementation that aggressively avoids checkpointing shared data multiple times and avoids unnecessary IO as much as possible.

linux-cr does the former using its objhash so that we don't checkpoint shared state more than once. It avoids doing disk IO by, whenever possible. not bundling file/directory contents into the checkpoint image (generic_file_checkpoint()). Instead it relies on userspace to do IO-bandwidth friendly optimizations like using filesystem snapshots.

That said, file "contents" are necessary for anon_inode-based interfaces such as eventfd, epoll, signalfd, and timerfd because those can't be "backed up" and restored like normal files.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds