As two of the DMTCP developers, we thought we would add some small comments.
We should first add that we have the highest respect for Linux C/R, and are continuing to talk with them and exchange information.
1) In response to Np237, DMTCP does directly handle distributed processes in a pure userland fashion. For example, it can checkpoint OpenMPI as if it were just one more black box distributed computation. It handles the network communication through a strategy of draining sockets. In one phase, each host sends a special cookie through each socket. In the next phase, each host reads from all sockets until seeing the cookie. The data drained from the network can be re-inserted upon restart or resume from checkpoint. We do indeed use barriers, as mentioned by orenl. If anyone encounters a bug in using DMTCP for distributed processes, please do let us know so that we can fix it.
2) Np237 also points out:
> Most of the things you describe are already handled by BLCR, although still in an imperfect way.
This is a good point, and we agree.