I think us HPC folks need more than this and the project with the head start in
this area is BLCR (Berkeley Lab Checkpoint/Restart), a hybrid kernel/user space
solution .
http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml
You need more than O/S support for this, you need support in the MPI stacks too
and BLCR is already supported by OpenMPI. You also want support in the queueing
systems, and Torque (derived from OpenPBS) now has initial BLCR support.
There's a nice presentation on BLCR from GlobusWorld earlier this year:
http://www.globusworld.org/E.Roman-BLCROverview080515.pdf...