User: Password:
Subscribe / Log in / New account

Checkpoint/restart: it's complicated

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 20:15 UTC (Fri) by Np237 (subscriber, #69585)
In reply to: Checkpoint/restart: it's complicated by daglwn
Parent article: Checkpoint/restart: it's complicated

You don’t need to checkpoint every minute. It would already saturate the filesystem on current clusters. But you can do that every now and then, and if your filesystem cannot handle storing all the RAM of your compute nodes in a small time, it means it is already not up to the task.

I expect an exascale computer to come with an exascale filesystem. If the CR is correctly implemented, it means exabytes of data being written to it simultaneously, with no other bottleneck than a barrier to halt all processes in a consistent state. In the end it’s the same amount of data that the computation would write when it ends to store the result. If the cluster cannot handle that, it’s just an entry in the Top500 dick size contest, not something to do serious work.

The network speed should definitely not be a problem. It should be able to transfer all your node memory’s contents in less than a second. Otherwise it will just not be able to run computations that exchange a lot of data.

(Log in to post comments)

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 20:17 UTC (Fri) by Np237 (subscriber, #69585) [Link]

§2: I meant petabytes, of course. It’s hard enough without needing petabytes.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 22:36 UTC (Fri) by daglwn (guest, #65432) [Link]

Your are seriously underestimating the complexity of these systems. We're going to have node failures on the order of hours, at best. We really need something much better than CR. No matter how CR is implemented, it won't scale.

Checkpoint/restart: it's complicated

Posted Nov 12, 2010 22:40 UTC (Fri) by daglwn (guest, #65432) [Link]

All IMHO, of course. I am far from an OS, filesystem or CR expert, but I am aware of the trends.

Checkpoint/restart: it's complicated

Posted Nov 15, 2010 17:44 UTC (Mon) by orenl (guest, #21050) [Link]

There are serious challenges to make large HPC environments resilient, and C/R functionality is an essential component in getting there.

The main scalability bottleneck IMHO is that most distribute-C/R approaches require a barrier point for all participating processes across the cluster (the alternative - asynchronous C/R with message logging - has many issues).

However, write-back of data to disk is not necessarily the biggest concern. First, you could write the checkpoints to _local_ disks (rather than flood the NAS). Second, you can write the checkpoints to "memory servers" which is faster than disks. Such servers need a lot of memory, but don't care about CPUs and core. You could speed things up using RDMA. For both methods, it's useful also to keep some redundancy so data can be recovered should a node (or more) disappear.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds