The HPC applications periodically store their state so that they can kill the app, move the state file to another machine, and start the app again (picking up where it left off)
Doing this inside an app is fairly easy as long as there is no problem re-doing work since the last checkpoint, or you can send the app a signal "stop working and save a checkpoint now"
doing this at the OS level so that you can do this with arbitrary apps, without the app (or other systems the app is communicating with) even knowing that it has taken place is very hard. It's this problem that you are seeing worked on.