LWN.net Logo

Kernel-based checkpoint and restart

By Jonathan Corbet
August 11, 2008
Your editor, who has carefully hidden several years of experience in Fortran-based scientific programming from this readership, encountered checkpoint and restart facilities a long time ago. In those days, programs which would run for days of hard-won CPU time on an unimaginably fast CDC or Cray mainframe would occasionally checkpoint themselves, minimizing the amount of compute time lost when (not if) the system went down at an inopportune time. It was a sort of insurance policy, with the premiums being paid in the form of regular checkpoint calls.

Central processor time is no longer in such short supply, but there is still interest in the ability to checkpoint a running application and restore its state at some future time. One obvious application of this capability is to restore the application on a different machine; in this way, running applications can be moved from one host to another. If the "application" is an entire container full of tasks, you now have the ability to shift those containers around without the contained tasks even being aware of what is going on. That, in turn, can provide for load balancing, or just the ability to move containers off a machine which is being taken down.

Linux does not have this capability now. Anybody who thinks about adding it must certainly find the prospect daunting; applications have a lot of state hidden throughout the system. This state includes open files (and positions within the files), network sockets and pipes connected to remote peers, signal states, outstanding timers, special-purpose file descriptors (for epoll_wait(), for example), ptrace() status, CPU affinities, SYSV semaphores, futexes, SELinux state, and much more. Any failure to save and properly restore all of that state will result in a broken process. It is no wonder that Linux does not do checkpoint and restart; most rational developers would be driven away by the complexities involved in making it work in an even remotely robust manner.

But, then, there was a time when rational programmers would not have attempted the creation of Linux in the first place. So it should not be surprising to see that developers are working on the checkpoint and restart problem. The latest attempt can be seen in this patch set posted by Dave Hansen (but originally written by Oren Laadan). It is far from being ready for prime-time use, but it does show the sort of approach which is being taken.

For some time, the prevailing wisdom was that checkpoint and restart should be pushed as much into user space as possible. A user-space process could handle the marshaling of process state and writing it to a file; the kernel would only get involved when it was strictly necessary. It turns out, though, that this involvement is required fairly often, requiring the addition of "lots of new, little kernel interfaces" to make everything work. So, at a meeting at OLS, the checkpoint/restart developers decided to take a different approach and move the work into the kernel. The result is the creation of just two new system calls:

    int checkpoint(pid_t pid, int fd, unsigned long flags);
    int restart(int crid, int fd, unsigned long flags);

A call to checkpoint() will write an image of the current process to the given fd. The pid argument identifies the init process for the current process's container; it is saved to the image but not otherwise used in the current patch. If the operation succeeds, the return value will be a unique (until the system reboots) "checkpoint image identifier". restart() reverses the process; crid is the image identifier, which is not currently used. The flags argument is currently unused in both system calls. These interfaces seem likely to change; future enhancements to the interface are likely to include capabilities like checkpointing other processes and groups of processes.

The CAP_SYS_ADMIN capability is currently required for both checkpoint() and restart(). That is somewhat unfortunate, in that it would be nice if ordinary, unprivileged processes were able to checkpoint and restart themselves. There are some real security implications which must be kept in mind, though, especially when one considers the sort of damage that could result from an attempt to restart a carefully-manipulated checkpoint image. Making restart() secure for unprivileged use will not be a job for the faint of heart.

At this stage of development, the patch does not even attempt to solve the entire problem. It is able to save the current state of virtual memory (but only in the absence of non-private, shared mappings), current processor state, and the contents of the task structure. That is enough to checkpoint and restart a "hello, world" program, but not a whole lot more. But that is a reasonable place to start. Given the complexity of the problem, proceeding in careful baby steps seems like the right way to go. So we're probably not going to have a working checkpoint facility in the kernel in the near future, but, with luck and patience, we'll eventually have something that works.


(Log in to post comments)

Out-of-tree project has been doing checkpointing and restore for some time

Posted Aug 14, 2008 4:17 UTC (Thu) by dowdle (subscriber, #659) [Link]

One word...errr name...  OpenVZ.

As you probably already know, in 2005 OpenVZ was spun off as a FOSS project from SWsoft's
commercial Virtuozzo product and as a result a free, complete implementation of containers has
been available for Linux for over three years.  Then a little over a year ago OpenVZ added
checkpointing with restore and live migration features.  OpenVZ is definitely a well tested
and widely deployed out-of-tree patch for the Linux kernel... although strangely enough, none
of the "enterprise" distributions have picked it up yet.

Linux-VServer is also an out-of-tree, FOSS containers project of high quality and wide
deployment... that has been around for some time... but it doesn't include checkpointing.
Linux-VServer does have some unique and interesting features that OpenVZ doesn't but I'll not
get into those. The Linux-VServer developers have decided, for reasons I'll not get into, to
stay an out-of-tree patch and are not directly involved in the effort to move containers into
the mainline kernel... as far as I can tell.

The OpenVZ developers are (probably) the top contributors of new code going into the mainline
kernel relating to containers (starting with 2.6.24).  It is my understanding that all of
their additions to mainline have been written completely from scratch and do not borrow from
the OpenVZ code at all.  Given the nature of mainline kernel development... small, slow,
incremental steps that are judged by all of the stake holders... I imagine it was impossible
to reuse the existing, fully implemented OpenVZ patch.  OpenVZ is a largish patch which
contains significant changes to many kernel subsystems to make them container aware and since
it is feature complete and mature... it is more work to break it back down into code suitable
for a mainline... than it is to write completely new code.

I do wonder how long it is going to take before a complete implementation of containers
appears in the mainline kernel.  A year ago, conventional wisdom said about one year.  A year
has passed and although significant progress has been made, it sure seems a long, long way
off.  I've heard informed people revise their guess to be three to five years.  That sounds
kind of sad because other forms of virtualization are ready now and are growing in popularity
and deployments.  I have to wonder if starting over from scratch and adapting to the slow,
methodical staging method required by the mainline kernel developers is going to have an
impact on the viability of containers in the long run.  One hope is that starting from scratch
will potentially lead to a much cleaner / more maintainable, better implementation of
containers... but will it?  I'm not sure.

I understand that it is not possible to simply plop the huge patch that is OpenVZ into the
mainline kernel... and there are (always) those who have serious complaints about designs not
their own.  The mainline kernel developers would surly question why SWsoft (merged with
Parallels and renamed to Parallels, Inc.) developed Virtuozzo / OpenVZ outside of mainline to
begin with and say that it is Parallels' own fault that the current process is going to
progress as it is.

Where was I going to go with all of this?  Oh yeah...  anyhoo...

It reminds me of the tragic story of some file systems that have yet to make it into
mainline... and probably countless other projects I'm less familiar with.  It would be nice if
there was a way to somehow automate the slicing up and incrementalization (a new word I just
made up) of a large existing codebase so it could become acceptable to reasonable people like
the mainline kernel developers.

Jon, any comments into this sort of situation?  Is starting from scratch and having a lot of
other developers involved, each working on the particular feature(s) that most interests
them... perhaps without as strong of an overall vision that OpenVZ and Linux-VServer had
toward a complete containers solution... could it possibly result in as good or an even better
container implementation than we already have with OpenVZ and Linux-VServer?  Or are the odds
in our favor or not?

It is a little comical to see you report on the baby steps that are being taken... when the
two complete and mature solutions have existed for a long time... but I am definitely glad to
see you cover container related kernel development and strongly encourage you to keep it up.

this is one process at a time

Posted Aug 15, 2008 6:59 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

for everyone pointing at virutaliztion, hibernation, etc, this allows you to just checkpoint
and restore just a single process. you don't have to save the entire system

if this can be made to work reliably on one system it can group (with the help of namespaces)
the ability to take the image and resume it on a different system.

this is one process at a time

Posted Aug 16, 2008 3:48 UTC (Sat) by kolyshkin (subscriber, #34342) [Link]

Actually there's no way to checkpoint a single process, just because it has lots of relations
with other processes. Starting from parent-child relationship (a process is at least someone's
child and the parent do not expect a child to "disappear" suddenly), down to inter-process
communication such as inter-process pipes etc., a process just can't be torn apart from its
neighbourhood and be checkpointed.

That is why containers are a prerequisite to checkpointing. A container is a self-sustained
process group not tied to any other processes, and thus it can be checkpointed.

It's interesting that some people think a container is just a thing needed for checkpointing,
while others think of checkpointing as just yet another feature of containers.

Kernel-based checkpoint and restart

Posted Aug 14, 2008 12:31 UTC (Thu) by csamuel (✭ supporter ✭, #2624) [Link]

I think us HPC folks need more than this and the project with the head start in
this area is BLCR (Berkeley Lab Checkpoint/Restart), a hybrid kernel/user space
solution .

http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml

You need more than O/S support for this, you need support in the MPI stacks too 
and BLCR is already supported by OpenMPI.  You also want support in the queueing
systems, and Torque (derived from OpenPBS) now has initial BLCR support.

There's a nice presentation on BLCR from GlobusWorld earlier this year:

http://www.globusworld.org/E.Roman-BLCROverview080515.pdf...

Existing alternatives

Posted Aug 14, 2008 13:50 UTC (Thu) by pjm (subscriber, #2080) [Link]

A related facility is of course suspend to disk, which is already available and working.

cryopid: user-space checkpoint and restart

Posted Aug 15, 2008 1:57 UTC (Fri) by lipak (guest, #43911) [Link]


There is a user-space checkpoint and restart project
that does not seem to have got much notice.

	http://cryopid.berlios.de/

A quote from its home page about its features:

  - Can run as an ordinary user! (no root privileges needed)
  - Works on both 2.4 and 2.6.
  - Works on x86 and AMD64.
  - Can start & stop a process multiple times
  - Can migrate processes between machines and between
    kernel versions (tested between 2.4 to 2.6 and 2.6 to 2.4).


cryopid: user-space checkpoint and restart

Posted Aug 15, 2008 18:38 UTC (Fri) by rise (guest, #5045) [Link]

Ckpt (http://pages.cs.wisc.edu/~zandy/ckpt/) was another user-space solution that worked quite
well for me until glibc & kernel changes bit-rotted it.  Victor Zandy revised it a few times
to keep it alive, but it was originally a PhD research project and he moved on to other
things.  It went down the LD_PRELOAD route that cryopid seems to be avoiding and I suspect
that might have made it more fragile.

Kernel-based checkpoint and restart

Posted Aug 16, 2008 3:54 UTC (Sat) by kolyshkin (subscriber, #34342) [Link]

I would like to add that Andrey Mirkin has presented a paper on checkpointing and live
migration during the last Linux Symposium. The paper, available at
http://ols.fedoraproject.org/OLS/Reprints-2008/mirkin-rep... (~200K PDF), goes into some
details on how CPT is implemented in OpenVZ. It also lists a few other existing checkpointing
implementations -- some were already mentioned in the above comments.

Kernel-based checkpoint and restart

Posted Aug 16, 2008 11:37 UTC (Sat) by Brenner (subscriber, #28232) [Link]

Another checkpointing solution from the HPC world is Kerrighed.

Kerrighed (http://www.kerrighed.org, GPL patches available for Linux 2.6.20) supports
transparent process checkpointing, and much more as a Single System Image operating system for
clusters.


Kernel-based checkpoint and restart

Posted Aug 21, 2008 12:17 UTC (Thu) by ketilmalde (guest, #18719) [Link]

I guess both OpenSSI and openMosix bears mention, too. We used to run a small oM cluster, and
for many tasks, it was a really easy way to get a lot of parallelism - often simply using
"make -j", letting "make" spawn parallel processes, and oM distribute jobs across the cluster.

Kernel-based checkpoint and restart

Posted Aug 21, 2008 12:21 UTC (Thu) by stevelord (guest, #53493) [Link]

Bit late to the game here, but here is an early implementation I did back in 1993, the running
process was saved as a new executable which you could just fire up again. Yes, I had seen
inside the Cray implementation, and trust me, this bears no resemblance to it, and you would
run screaming for the hills if you saw it!

http://ftp.metalab.unc.edu/pub/historic-linux/ftp-archive...



Kernel-based checkpoint and restart

Posted Aug 22, 2008 20:04 UTC (Fri) by nix (subscriber, #2304) [Link]

It's unexec()! Yay!

(You're an Emacs user, I trust. That or a Lisp hacker. Nobody else ever 
treats memory images as something *executable* :) )

Kernel-based checkpoint and restart

Posted Sep 17, 2008 2:08 UTC (Wed) by roelofs (guest, #2599) [Link]

Nobody else ever treats memory images as something *executable* :)

Wot's this, then? That's the very definition of a DOS .COM file! I wrote more than a couple of those m'self...

Greg

Kernel-based checkpoint and restart

Posted Sep 17, 2008 7:25 UTC (Wed) by nix (subscriber, #2304) [Link]

They're not generally produced by snapshotting a running process (the same
one that's doing the snapshotting) and dumping it out to disk. (Also,
unexec() can produce things like ELF files, and as far as I recall it does
it without using the OS's core-dumping mechanism.)

Kernel-based checkpoint and restart

Posted Oct 20, 2008 18:52 UTC (Mon) by yohahn (subscriber, #4107) [Link]

The condor project has had a userspace checkpointing capability available in Linux since 1995ish. It is limited in that it cannot checkpoint threads that have a kernel component to them, or Multiple processes (no fork() :( )
Condor currently decided they won't checkpoint dynamic libraries, but that may change in the future.

http://www.cs.wisc.edu/condor

In particular:
http://www.cs.wisc.edu/condor/manual/v7.1/4_2Condor_s_Che...

If you have questions about it, you can email condor-admin@cs.wisc.edu.

Kernel-based checkpoint and restart

Posted Dec 18, 2008 0:20 UTC (Thu) by davidegolf (guest, #55649) [Link]

I work for Bull Worldwide Information Systems on the GCOS 8 virtual operating system. The virtual version was originally developed by Honeywell in the early 1980s. I wrote the original specification for the virtual program checkpoint - restart facility. We use a kernel based approach and wanted to include references to communication stacks and database buffer pools whose structures were maintained in secure, shared memory.

Early in the design phase I concluded that allowing a user to access the content of the checkpoint image would make security hopeless. Therefore, in support of our virtual checkpoint, we built a checkpoint database. Unprivileged users could ask the kernel to produce a checkpoint image to be created on the database. The user could refer to the last checkpoint image, by default. They could also refer to prior checkpoint images produced by their userid by date-time stamps.

When the kernel is requested to perform a restart, the primary checks are to assure that the userid matches, the requesting process is appropriate, and that the software installed at the time of the checkpoint matches the software currently installed. Since the user is NEVER allowed direct access to the checkpoint image content, there is no concern that the image has been manipulated. This prevents unnecessary errors and security breaches.

As an aside, we defined callbacks to the software maintaining data structures in shared memory areas. At checkpoint time, all data files were closed. They were reopened after the checkpoint or restart. This caused all pointers to shared memory areas to be reestablished. This also allowed dynamic entities, like dynamic shared buffer pools, to be rebuilt.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds