Checkpoint/restart: it's complicated

Posted Nov 12, 2010 21:52 UTC (Fri) by Np237 (guest, #69585)
In reply to: Checkpoint/restart: it's complicated by dlang
Parent article: Checkpoint/restart: it's complicated

This is precisely why you need the help of the MPI library and the resource manager: so that all processes related to a given job can be handled at the same time.

Most of the things you describe are already handled by BLCR, although still in an imperfect way.

Checkpoint/restart: it's complicated

Posted Nov 13, 2010 6:37 UTC (Sat) by dlang (guest, #313) [Link] (2 responses)

you miss my point.

doing checkpointing of the apps on any one system isn't good enough.

you need to checkpoint the app and everything that it talks to on _every_ system at once (and make sure that you do it at the same instant so that there's no chance of data being in flight between systems to make it inconsistant)

Checkpoint/restart: it's complicated

Posted Nov 13, 2010 8:44 UTC (Sat) by Np237 (guest, #69585) [Link] (1 responses)

No, you are missing my point. When I wrote all processes related to a given job, I really mean all processes, on all cluster nodes.

Yes, its complicated. But with the help of the MPI library you can close all connections (since all inter-nodes connections are supposed to go through MPI) in a synchronized way. This is what BLCR + OpenMPI already do.

Checkpoint/restart: it's complicated

Posted Nov 15, 2010 17:51 UTC (Mon) by orenl (guest, #21050) [Link]

There are two types of checkpoints here:

1) Coordinated checkpoint of all participating processes (across the cluster) so that the entire jobs can be restarted later from this checkpoint. This is useful for fault-tolerance.

2) Checkpoint of processes on one (or more) nodes and then restart on a different node (or set of nodes). This is useful for load-balancing, but also for maintenance, e.g. by vacating an over-heating node.

The former is usually done combining a C/R mechanism with a C/R-aware implementation of e.g. MPI. The latter is more tricky if one would like to do seamless live migration.

Linux-CR supports both.