A Checkpoint/restart update

By Jonathan Corbet
February 24, 2010

It has been exactly one year since LWN last checked up on the checkpoint/restart patch set. This code has just been reposted with a request for inclusion into the -mm tree, so it seems like an opportune time to restart our coverage of it. A lot of progress has been made on this front over the last year, but checkpoint/restart remains a difficult task which can probably never be implemented completely.

"Checkpointing" refers to the act of saving the state of a group of processes to a file, with the intent of restarting those processes at some future time. For many years, checkpointing has been used to save the state of long-running calculations to avoid losing work should the system fail. More recently, it has become a desired part of the virtualization toolkit, enabling the live migration of processes between physical hosts. The checkpoint/restart developers also see other potential advantages, such as the ability to quickly launch a set of processes on demand from a checkpoint image.

This patch set addresses checkpoint/restart in the containers context. In the context of full virtualization, checkpointing is relatively easy; the system just needs to save the entire memory image associated with the virtual machine and a bit of associated data. The "containers" model of virtualization tends to be messier in almost every way, and checkpointing is no exception. There is no memory image to be saved in one big chunk; instead, the kernel must track down every bit of state associated with the checkpointed processes and save it independently. When it works, it can be faster and more efficient than full virtual machine checkpointing; the checkpoint image will be much smaller. But getting it to work is a challenge. The complexity of this task can be seen in the current checkpoint/restart tree, which, despite being far from a complete solution of the problem, is a 27,000-line diff from 2.6.33-rc8.

Checkpointing

To checkpoint a group of processes, the following new system call is used:

    int checkpoint(pid_t pid, int fd, unsigned long flags, int logfd);

The pid parameter identifies the top-level process to be checkpointed; all children of that process will also be included in the checkpoint image, which will be written to the file indicated by fd. There is currently only one possible flag value, CHECKPOINT_SUBTREE, which turns off the normal requirement that an entire container be checkpointed as a whole. Checkpointing just a subtree is a bit riskier than checkpointing a full container because it is harder to ensure that all needed resources have been saved. The logfd parameter is file descriptor open for writing; the kernel will write relevant logging information there. There are vast numbers of possible ways for a checkpoint to fail; the log file is intended to help users figure out what is happening when a checkpoint refuses to work. If logging is not desired, logfd can be -1.

The set of processes to be checkpointed should be frozen prior to the call to checkpoint(). One exception to that rule is a process running in checkpoint() itself; this exception allows processes to checkpoint themselves.

Internally, the checkpointing process is implemented as a two-phase operation:

The kernel traverses the tree of processes and "collects" every object which is to be a part of the checkpoint image. Essentially, "collecting" means building a hash table with an entry for every process, every open file, every virtual memory area, every open socket, etc. which must be saved. Scanning the tree in this way helps the kernel to abort the checkpoint process early if something which cannot be checkpointed is encountered. Just as importantly, the collecting process also lets the system track objects which have multiple references and ensure that they are only written to the image file once.
The second pass then iterates over the collected objects and causes each to be serialized and written to the image file.

Once this is done, the checkpoint is finished. The just-checkpointed processes can either go on with their business or be killed, depending on the reason for the checkpoint.

These two phases are reflected in the changes made to the lower levels of the system. For example, the none-too-svelte file_operations structure gains two new operations:

    int collect(struct ckpt_ctx *ctx, struct file *filp);
    int checkpoint(struct ckpt_ctx *ctx, struct file *filp);

The collect() operation should identify every object which must be saved, eventually passing each to ckpt_obj_collect() (or one of several higher-level interfaces) for tracking. Later, a call to checkpoint() is made to request that the given filp be serialized for saving to the checkpoint image. Similar methods have been added to a number of other structure types, including vm_operations_struct and proto_ops.

The serialization process requires copying data from kernel data structures into a series of special structures intended to be written to the image file. So, for example, a file descriptor finds its way from struct fdtable into one of these:

    struct ckpt_hdr_file_desc {
	struct ckpt_hdr h;
	__s32 fd_objref;
	__s32 fd_descriptor;
	__u32 fd_close_on_exec;
    } __attribute__((aligned(8)));

Doing this copy requires a 75-line function which grabs the requisite information and very carefully checks that everything can be checkpointed successfully. In this case, the presence of locks on the file or an owner (to be notified with SIGIO) will cause the checkpoint to fail. In the absence of such roadblocks, the completed structure is handed to the checkpoint code for saving to the image file.

This serialization process is one of the scarier parts of the whole checkpoint/restart concept. Any changes to struct fdtable will almost certainly break this serialization, quite possibly in ways which will not be detected until some user runs into a problem. Even if a VFS developer cared about checkpointing, they might not think to look at the code in checkpoint/files.c to see if anything might require changing. Similar dependencies are created for every other kernel data structure which must be checkpointed. The whole setup looks like it could be a little fragile; keeping it working is almost certain to require significant ongoing maintenance.

Restarting

On the restart side, the application performing the restart is first expected to create a set of processes to be animated with the checkpointed information. That creation will be done with the much-reviewed "extended clone()" system call, which, in this iteration, looks like:

    int eclone(u32 flags_low, struct clone_args *cargs, int cargs_size,
	       pid_t *pids);

With eclone(), the processes can be created with specific pids and with an extended set of flags.

Once the process hierarchy exists, the restart() system call can be used:

    int restart(pid_t pid, int fd, unsigned long flags, int logfd);

The checkpoint image found at fd will be restored into the process hierarchy starting at pid. Once again, logfd can be used to gain information on how the process went. There are a number of flags defined: RESTART_TASKSELF (a single task is being restarted on top of the process calling restart()), RESTART_FROZEN (causes the restarted processes to be left frozen at the end), RESTART_GHOST (appears to be a debugging feature), RESTART_KEEP_LSM (restore security labels too), and RESTART_CONN_RESET (force the closing of open sockets). On a successful return from restart(), the process hierarchy should be ready to go.

Once again, restart requires support at the lower levels of the kernel. So our long-suffering file_operations structure gains another function:

    int restore(struct ckpt_ctx *, struct ckpt_hdr_file *);

This function (along with its analogs elsewhere in the kernel) is charged with reanimating the given object from the checkpoint file.

Security

It is not hard to imagine that these new system calls could have any of a number of security-related consequences, so it is surprising to see that, in the current implementation, both checkpoint() and restart() are unprivileged operations. This decision was made deliberately, with the idea of forcing the developers to think about security issues from the outset.

The biggest potential problem with checkpoint() is probably information disclosure. To avoid this problem, checkpoint() is only able to checkpoint processes which the caller would be able to call ptrace() on. So there should be no way for a hostile user to gain information from a checkpoint image which would not be available anyway.

The restart side is a little more frightening; it allows the caller to load vast amounts of potentially arbitrary data into kernel data structures. This risk is, one hopes, mitigated by causing all operations to be done in the context of the calling process. If the caller cannot open a file directly, that file cannot be opened via a corrupted checkpoint image either. Doing things this way will break certain use cases, such as checkpointing a setuid program which has since dropped its privileges, but there is probably no way to make that case work securely for unprivileged users.

For an added challenge, the checkpoint/restart developers have also implemented the checkpointing of security labels on objects. By default, these labels will not be used during the restart process, but the RESTART_KEEP_LSM flag can change that. Again, the labels are created in the context of the calling process, so the active security module should prevent the attachment of labels which compromise the security of the system.

Even with these measures in place, one still has to wonder about the security of the process as a whole. The kernel is populating a wide array of data structures from input which may be under the control of a hostile user; it is not hard to imagine that, somewhere in tens of thousands of lines of checkpoint/restart code, an important check has not been made. Perhaps as a result of this concern, the patch set adds a sysctl knob which can be set to disallow unprivileged checkpoint/restart operations.

Where things stand

According to the most recent patch posting:

This one is able to checkpoint/restart screen and vnc sessions, and live-migrate network servers between hosts. It also adds support for x86-64 (in addition to x86-32, s390x and powerpc).

So the patch set appears to be sufficiently functional to be minimally useful. There are, however, a lot of things which can stil prevent the creation of a successful checkpoint; they are summarized on this page. Problem areas include private filesystem mounts, network sockets in some states, open-but-unlinked files, use of any of the file event notification interfaces, open files on network or FUSE filesystems, use of netlink, ptrace(), asynchronous I/O, and more. There are patches in the works for some of these problems; others look hard.

As of this writing, there has been no response to the developers' request for inclusion in the -mm kernel. In the past, there have been concerns about how much work would be required to finish the job. Over the last year, much of that work is done, but checkpoint/restart looks like a job which is never truly finished. It's mostly a matter of whether what has been done so far appears to be good enough for real work, and whether the maintenance cost of this code is deemed to be worth paying.

Index entries for this article
Kernel	Checkpointing
Kernel	Containers

A Checkpoint/restart update

Posted Feb 24, 2010 22:58 UTC (Wed) by mhelsley (guest, #11324) [Link] (4 responses)

Even if a VFS developer cared about checkpointing, they might not think to look at the code in checkpoint/files.c to see if anything might require changing. Similar dependencies are created for every other kernel data structure which must be checkpointed. The whole setup looks like it could be a little fragile; keeping it working is almost certain to require significant ongoing maintenance.

Code placement is definitely something to fix. Recently I've been working on moving checkpoint/* contents to more appropriate places in the kernel sources for precisely this reason.

A Checkpoint/restart update

Posted Feb 25, 2010 11:36 UTC (Thu) by k3ninho (subscriber, #50375) [Link] (2 responses)

Can you not have a build-time reader which examines the kernel's interface definitions in the source tree, allowing you to learn the kernel's own definitions of the data structure without having to maintain more than a list of the files/functions to scan?

K3n.

A Checkpoint/restart update

Posted Feb 25, 2010 15:47 UTC (Thu) by hallyn (subscriber, #22558) [Link]

Matt wasn't talking (I don't think) about the checkpoint image format,
which (IIUC) is what you would be addressing with your suggestion.

I personally think the main maintainability concern is that updates to
object creation/destruction/updates code not require maintainers to know
to look at other random places like checkpoint/file.c to update
corresponding checkpoint and restart code. That is why we have made it
a point to re-use existing (or create) helpers like cred_setresuid()
in the checkpoint and restart paths, so that updates to the core helpers
will automatically update checkpoint/restart code as well.

In addition to this, Matt is working on moving everything (or nearly
everything) that is under checkpoint/ into the right files in the core
code, i.e. checkpoint/file.c helpers likely belong in fs/namei.c,
fs/open.c,
fs/namespace.c etc.

Now, what you're talking about with auto-generation of headers has also
been discussed, and specifically suggested by Andrew Morgan
(i.e. https://lists.linux-foundation.org/pipermail/containers/2...
June/018289.html ). But I think it's still an open question whether
that will just obfuscate what is really going on, and whether it is
addressing a real problem. If it turns out to be a real problem, then
we're certainly open to it!

A Checkpoint/restart update

Posted Feb 25, 2010 22:29 UTC (Thu) by mhelsley (guest, #11324) [Link]

Hi K3n,

That actually gets us suprisingly little. Yes, there's a good amount of code that looks like:

save_field = struct_field;

but that's not most of what we do. Locking, error handling, and reference counting account for much of the rest of it. We don't assume that the freezer protects kernel data structures for checkpoint/restart. The freezer is there to ensure the checkpointed tasks themselves aren't doing anything - inside or outside the kernel. That doesn't prevent non-frozen tasks from using things inside the kernel (shared struct files for example). Even if we limited ourselves to freezing "whole containers" this would be a problem. So we need to take the appropriate locks in the right order, hold references while we checkpoint, and avoid sleeping sometimes.

More extreme solutions using lists could be made to work but the code, memory, and performance of that approach is actually much worse. We could stop all other cpus on the machine while we checkpoint. Then we couldn't use most of the existing kernel code though because the functions we want to reused wouldn't be allowed to grab locks, sleep, or reliably allocate memory. That means even more new code and more fragile code too. We'd also need to know beforehand exactly how much memory is needed for the checkpoint image. Finally, it would all have to fit in kernel memory. Given that userspace tasks can use large amounts of memory that's not very practical.

A Checkpoint/restart update

Posted Feb 26, 2010 16:37 UTC (Fri) by mhelsley (guest, #11324) [Link]

As a follow up, here's what I have posted for moving a good portion of the code to the right places:

http://permalink.gmane.org/gmane.linux.kernel.containers/... (Introduction)
http://permalink.gmane.org/gmane.linux.kernel.containers/... (First patch)

Security of restore

Posted Feb 25, 2010 11:30 UTC (Thu) by epa (subscriber, #39769) [Link] (2 responses)

Shouldn't the restoring be done entirely in user space? That would deal with many of the security questions raised in the article.

Security of restore

Posted Feb 25, 2010 15:39 UTC (Thu) by hallyn (subscriber, #22558) [Link] (1 responses)

We've gone down that road before - see for instance http://git.sr71.net/?
p=cryodev.git;a=summary. There are a boatload of things you cannot do from
userspace. In particular, specifying identifiers for resources like
sysvipc.

'We' (meaning a much larger community than what has been the core
contributing team) decided to finally dive fully into the in-kernel approach
at the 2008 containers mini-summit - meeting notes at
http://wiki.openvz.org/Containers/Mini-summit_2008_notes.

Security of restore

Posted Feb 25, 2010 22:56 UTC (Thu) by mhelsley (guest, #11324) [Link]

What's more, some parts of restart are done in userspace when it allows us to keep the kernel code much simpler. The process tree is recreated in userspace for example. As another example we also plan on using execve() to ensure that checkpoint images containing 32 and 64-bit processes are restarted correctly.

The in-kernel alternative for the process tree tended to involve creating kernel threads and then stripping them of their "special" status as kernel threads. Doing it in userspace with clone() (or eclone()) will be much easier to review and maintain.

The in-kernel alternative for the second example is to have sys_restart() do the 32/64-bit mm switches. That switch had to be carefully orchestrated for execve() where the kernel can throw away lots of state. We can't throw away as much state in sys_restart() so orchestrating that switch is even harder there. Plus it adds to the maintenance burden of arch maintainers. By using execve() we'll avoid all of that.

Easing maintenance has been a concern of ours for a long time and we're open to new suggestions from maintainers for making it even easier.

Cryptography to the rescue!

Posted Feb 25, 2010 16:23 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (1 responses)

Doing things this way will break certain use cases, such as checkpointing a setuid program which has since dropped its privileges, but there is probably no way to make that case work securely for unprivileged users.

The problem with restoring a setuid program is that users might be able to modify the serialized state. Why not use a MAC to authenticate the saved state? Administrators would need to provide a secret key not visible to ordinary users, of course, but that would be trivial to provide via a sysctl.

On the other hand, it seems like it'd be possible to implement this authenticated-checkpoint functionality from userspace by asking a privileged process to do the checkpointing and restoration on behalf of an ordinary user.

Cryptography to the rescue!

Posted Feb 25, 2010 16:33 UTC (Thu) by hallyn (subscriber, #22558) [Link]

Indeed we definately intend to exploit the TPM, and have it sign valid
checkpoint images.

As for using MAC, you can certainly set up an assured pipeline using
SELinux policy to make sure that noone can modify a checkpoint image,
and that /bin/restart runs in a domain which can only read valid
checkpoint images. Hmm, well, I suppose /bin/restart_wrapper would
only be able to open validated checkpoint images, then pass those in
to /bin/restart (restart itself will need to open all the files used
by the restarted program).

Finally, note that an unprivileged user can neither checkpoint nor
restart a setuid program. It can't checkpoint it because it will fail
the ptrace access checks, and can't restart it because sys_restart() will
try to do cred_setresuid() to an effective userid of 0 and fail (or open
a resource which the unprivileged user cannot access, and fail).