In this situation, pure userland checkpointing is a joke. It will never be able to store the state of network communications and of the various funny things that scientific code developers can imagine.
Virtualization is not a solution either. The performance impact is too high, and youd have to plug in the hosts anyway for things like RDMA support.
This is where in-kernel checkpointing comes handy. Its not sufficient either: you need to have support in the MPI library as well, so that all processes put themselves in a consistent state to be checkpointed together.
I used to be a SuperUX sysadmin; I doubt many of you here have even heard the name, but seamless checkpoint/restart in the whole stack gives very impressive results, and in the end saves a lot of CPU cycles. There are not many systems where you can change a kernel setting and reboot, at 6PM on a Friday evening, all without the users noticing and without any doubt about causing problems during the weekend. HPC on Linux is still very, very far from this.
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds