Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 23, 2013
An "enum" for Python 3
An unexpected perf feature
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
the HPC software does it's own checkpointing anyway so that if the system crashes it can pick up at a reasonable point and not loose all the work that was done.
Checkpoint/restart: it's complicated
Posted Nov 12, 2010 20:22 UTC (Fri) by Np237 (subscriber, #69585)
For a dedicated cluster, it makes sense to adapt your code, but on general-purpose clusters you have hundreds of different codes running; it is much less expensive to implement system C/R if possible, instead of porting all the codes to be resilient.
Posted Nov 12, 2010 20:39 UTC (Fri) by dlang (✭ supporter ✭, #313)
saving tcp connection info does you no good unless the other end of the tcp connection gets checkpointed at the same instant.
saving pending disk writes does no good if the file you are writing to is off on some other system and will contain writes after the checkpoint
system level c/r is useful for planned outages, but when you are in HPC environments, you have enough nodes that this is really not good enough, you will have unplanned outages, and unless your c/r can back out all these other side effects, it's not going to be able to be used for these outages.
Posted Nov 12, 2010 21:52 UTC (Fri) by Np237 (subscriber, #69585)
Most of the things you describe are already handled by BLCR, although still in an imperfect way.
Posted Nov 13, 2010 6:37 UTC (Sat) by dlang (✭ supporter ✭, #313)
doing checkpointing of the apps on any one system isn't good enough.
you need to checkpoint the app and everything that it talks to on _every_ system at once (and make sure that you do it at the same instant so that there's no chance of data being in flight between systems to make it inconsistant)
Posted Nov 13, 2010 8:44 UTC (Sat) by Np237 (subscriber, #69585)
Yes, its complicated. But with the help of the MPI library you can close all connections (since all inter-nodes connections are supposed to go through MPI) in a synchronized way. This is what BLCR + OpenMPI already do.
Posted Nov 15, 2010 17:51 UTC (Mon) by orenl (subscriber, #21050)
1) Coordinated checkpoint of all participating processes (across the cluster) so that the entire jobs can be restarted later from this checkpoint. This is useful for fault-tolerance.
2) Checkpoint of processes on one (or more) nodes and then restart on a different node (or set of nodes). This is useful for load-balancing, but also for maintenance, e.g. by vacating an over-heating node.
The former is usually done combining a C/R mechanism with a C/R-aware implementation of e.g. MPI. The latter is more tricky if one would like to do seamless live migration.
Linux-CR supports both.
Posted Nov 15, 2010 17:55 UTC (Mon) by orenl (subscriber, #21050)
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds