Checkpoint/restart: it's complicated
Posted Nov 12, 2010 22:40 UTC (Fri) by daglwn (guest, #65432)
Posted Nov 15, 2010 17:44 UTC (Mon) by orenl (guest, #21050)
There are serious challenges to make large HPC environments resilient, and C/R functionality is an essential component in getting there.
The main scalability bottleneck IMHO is that most distribute-C/R approaches require a barrier point for all participating processes across the cluster (the alternative - asynchronous C/R with message logging - has many issues).
However, write-back of data to disk is not necessarily the biggest concern. First, you could write the checkpoints to _local_ disks (rather than flood the NAS). Second, you can write the checkpoints to "memory servers" which is faster than disks. Such servers need a lot of memory, but don't care about CPUs and core. You could speed things up using RDMA. For both methods, it's useful also to keep some redundancy so data can be recovered should a node (or more) disappear.
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds