Checkpoint/restart: it's complicated
Checkpoint/restart: it's complicated
Posted Nov 13, 2010 8:44 UTC (Sat) by Np237 (guest, #69585)In reply to: Checkpoint/restart: it's complicated by dlang
Parent article: Checkpoint/restart: it's complicated
Yes, its complicated. But with the help of the MPI library you can close all connections (since all inter-nodes connections are supposed to go through MPI) in a synchronized way. This is what BLCR + OpenMPI already do.
Posted Nov 15, 2010 17:51 UTC (Mon)
by orenl (guest, #21050)
[Link]
1) Coordinated checkpoint of all participating processes (across the cluster) so that the entire jobs can be restarted later from this checkpoint. This is useful for fault-tolerance.
2) Checkpoint of processes on one (or more) nodes and then restart on a different node (or set of nodes). This is useful for load-balancing, but also for maintenance, e.g. by vacating an over-heating node.
The former is usually done combining a C/R mechanism with a C/R-aware implementation of e.g. MPI. The latter is more tricky if one would like to do seamless live migration.
Linux-CR supports both.
Checkpoint/restart: it's complicated