"Finally, application developers must bear in mind that processes which run this long will invariably experience failures, sooner or later. So they will need to be designed with some sort of checkpoint and restart capability."
Was that exactly Ric's point -- that the applications had to checkpoint themselves? Or did he just say that being able to checkpoint applications was necessary? I ask because there's a big difference. Expecting all applications that might be run in these environments to explicitly checkpoint themselves just isn't practical. Look at how many non-HPC applications use BLCR for example.
The alternative is to enable "external" checkpointing. Checkpoints that don't require rewriting the application, or ld preloads, etc. There is already an effort underway to push this to mainline: