User: Password:
Subscribe / Log in / New account

Checkpoint/restart: it's complicated

Checkpoint/restart: it's complicated

Posted Nov 15, 2010 17:51 UTC (Mon) by orenl (guest, #21050)
In reply to: Checkpoint/restart: it's complicated by Np237
Parent article: Checkpoint/restart: it's complicated

There are two types of checkpoints here:

1) Coordinated checkpoint of all participating processes (across the cluster) so that the entire jobs can be restarted later from this checkpoint. This is useful for fault-tolerance.

2) Checkpoint of processes on one (or more) nodes and then restart on a different node (or set of nodes). This is useful for load-balancing, but also for maintenance, e.g. by vacating an over-heating node.

The former is usually done combining a C/R mechanism with a C/R-aware implementation of e.g. MPI. The latter is more tricky if one would like to do seamless live migration.

Linux-CR supports both.

(Log in to post comments)

Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds