Checkpoint/restart: it's complicated

Posted Nov 13, 2010 8:44 UTC (Sat) by Np237 (guest, #69585)
In reply to: Checkpoint/restart: it's complicated by dlang
Parent article: Checkpoint/restart: it's complicated

No, you are missing my point. When I wrote all processes related to a given job, I really mean all processes, on all cluster nodes.

Yes, its complicated. But with the help of the MPI library you can close all connections (since all inter-nodes connections are supposed to go through MPI) in a synchronized way. This is what BLCR + OpenMPI already do.

Checkpoint/restart: it's complicated

Posted Nov 15, 2010 17:51 UTC (Mon) by orenl (guest, #21050) [Link]

There are two types of checkpoints here:

1) Coordinated checkpoint of all participating processes (across the cluster) so that the entire jobs can be restarted later from this checkpoint. This is useful for fault-tolerance.

2) Checkpoint of processes on one (or more) nodes and then restart on a different node (or set of nodes). This is useful for load-balancing, but also for maintenance, e.g. by vacating an over-heating node.

The former is usually done combining a C/R mechanism with a C/R-aware implementation of e.g. MPI. The latter is more tricky if one would like to do seamless live migration.

Linux-CR supports both.