LWN.net Logo

Preparing for user-space checkpoint/restore

Preparing for user-space checkpoint/restore

Posted Jul 23, 2012 3:09 UTC (Mon) by bergwolf (subscriber, #55931)
Parent article: Preparing for user-space checkpoint/restore

Being confused somewhat, what do HPC people use when they checkpoint/restart? I've been told many times that HPC applications do checkpoint/restart very often. But how?


(Log in to post comments)

Preparing for user-space checkpoint/restore

Posted Jul 23, 2012 7:17 UTC (Mon) by dlang (✭ supporter ✭, #313) [Link]

The HPC applications periodically store their state so that they can kill the app, move the state file to another machine, and start the app again (picking up where it left off)

Doing this inside an app is fairly easy as long as there is no problem re-doing work since the last checkpoint, or you can send the app a signal "stop working and save a checkpoint now"

doing this at the OS level so that you can do this with arbitrary apps, without the app (or other systems the app is communicating with) even knowing that it has taken place is very hard. It's this problem that you are seeing worked on.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds