User: Password:
Subscribe / Log in / New account

Checkpoint/restart: it's complicated

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 22:36 UTC (Fri) by daglwn (guest, #65432)
In reply to: Checkpoint/restart: it's complicated by Np237
Parent article: Checkpoint/restart: it's complicated

Your are seriously underestimating the complexity of these systems. We're going to have node failures on the order of hours, at best. We really need something much better than CR. No matter how CR is implemented, it won't scale.

(Log in to post comments)

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 22:40 UTC (Fri) by daglwn (guest, #65432) [Link]

All IMHO, of course. I am far from an OS, filesystem or CR expert, but I am aware of the trends.

Checkpoint/restart: it's complicated

Posted Nov 15, 2010 17:44 UTC (Mon) by orenl (guest, #21050) [Link]

There are serious challenges to make large HPC environments resilient, and C/R functionality is an essential component in getting there.

The main scalability bottleneck IMHO is that most distribute-C/R approaches require a barrier point for all participating processes across the cluster (the alternative - asynchronous C/R with message logging - has many issues).

However, write-back of data to disk is not necessarily the biggest concern. First, you could write the checkpoints to _local_ disks (rather than flood the NAS). Second, you can write the checkpoints to "memory servers" which is faster than disks. Such servers need a lot of memory, but don't care about CPUs and core. You could speed things up using RDMA. For both methods, it's useful also to keep some redundancy so data can be recovered should a node (or more) disappear.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds