LWN.net Logo

A rough restart for checkpoints

By Jonathan Corbet
May 5, 2010
Back in February, the checkpoint/restart patch set was brought to the kernel mailing list with a request for inclusion in the -mm tree. That was immediately prior to the 2.6.34 merge window, so there were limited amounts of developer attention available for review. At that time, Andrew Morton suggested:

I'd suggest waiting until very shortly after 2.6.34-rc1 then please send all the patches onto the list and let's get to work.

The checkpoint/restart developers did post the the patches in March, to relatively little response. Shortly before the 2.6.35 merge window, they reposted the whole thing as a 100-patch series. Unsurprisingly, there have been some complaints about the massive mailing, but there is another outcome which is less fortunate: the patches are not being looked at.

That, too, is unsurprising. The amount of developer time available for patch review is insufficient in the best of times, and it gets worse as the merge window approaches. Even the most seasoned reviewer is going to be a bit intimidated by a 100-patch series which pokes its fingers into almost every part of the core kernel. Most of them will decide that they have more important things to do elsewhere.

So, once again, checkpoint/restart is likely to be put on hold until after the next merge window. After that, if it comes back in more manageable pieces, the developers might truly get to work.


(Log in to post comments)

A rough restart for checkpoints

Posted May 5, 2010 21:16 UTC (Wed) by mhelsley (subscriber, #11324) [Link]

"Instead, the checkpoint/restart developers waited until shortly before the 2.6.35 merge window".

That's quite an exaageration.

The date on 2.6.34-rc1 is Mon Mar 8 10:45:44 2010 -0800.
The date on the v20 post was Wed, 17 Mar 2010 11:48:11 -0400:

http://lwn.net/Articles/379013/

Oops

Posted May 5, 2010 21:41 UTC (Wed) by corbet (editor, #1) [Link]

Not an exaggeration, just a screwup on my part; I'd forgotten the previous posting. My mistake; I've edited the text to reflect, I hope, something closer to reality.

The good news is that the edition isn't set to be published for a while yet. So probably nobody but you saw that version of the article.

Oops

Posted May 5, 2010 22:02 UTC (Wed) by mhelsley (subscriber, #11324) [Link]

Thank you for the correction.

A rough restart for checkpoints

Posted May 6, 2010 14:47 UTC (Thu) by rwmj (subscriber, #5474) [Link]

Isn't it better to change the processes to make it possible to restart them? This is generally a good idea anyway, either for long-running scientific jobs, or user apps that have to run on embedded devices with limited memory. Even major web browsers can be gracefully killed and restarted these days; every app should be like that.

A rough restart for checkpoints

Posted May 6, 2010 17:37 UTC (Thu) by spotter (subscriber, #12199) [Link]

a web browser can reload a web page, but are you guaranteed that you'll be looking at the same web page on restart?

A rough restart for checkpoints

Posted May 6, 2010 22:10 UTC (Thu) by mhelsley (subscriber, #11324) [Link]

In some ways changing the process to implement checkpoint/restart is potentially better because the application is in the best position to minimize the quantity of checkpointed information.

However it's unlikely that many existing applications will be changed to do this. The cost is typically greater code complexity and the expected return, especially for non-embedded desktop apps, may not justify the effort. Worse, each application is likely to present a very different [user] interface.

Lastly, restart ala Firefox and the like do not address checkpointing and restarting a collection of applications. e.g. Firefox and OpenOffice. That suggests a common framework -- something that is notoriously difficult to mandate. You'd probably see developers "start from scratch" -- perhaps much like Android -- instead of porting code to such a framework.

All of that makes the application-specific checkpoint/restart solution much more difficult if not impossible to use on containers.

Checkpoint/restart support within the kernel avoids these problems.

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds