By Michael Kerrisk
November 20, 2012
Checkpoint/restore refers to the ability to snapshot the state of an
application (which may consist of multiple processes) and then later
restore the application to a running state, possibly on a different
(virtual) system. Pavel Emelyanov's talk at LinuxCon Europe 2012 provided
an overview of the current status of the checkpoint/restore in user space
(CRIU) system that has been in development
for a couple of years now.
Uses of checkpoint/restore
There are various uses for checkpoint/restore functionality. For
example, Pavel's employer, Parallels, uses it for live migration, which
allows a running application to be moved between host machines without loss
of service. Parallels also uses it for so-called rebootless kernel updates,
whereby applications on a machine are checkpointed to persistent storage
while the kernel is updated and rebooted, after which the applications
are restored; the applications then continue to run, unaware that the
kernel has changed and the system has been restarted.
Another potential use of checkpoint/restore is to speed start-up of
applications that have a long initialization time. An application can be
started and checkpointed to persistent storage after the initialization is
completed. Later, the application can be quickly (re-)started from the
checkpointed snapshot. (This is analogous to the dump-emacs
feature that is used to speed up start times for emacs by creating a
preinitialized binary.)
Checkpoint/restore also has uses in high-performance computing. One
such use is for load balancing, which is essentially another application of
live migration. Another use is incremental snapshotting,
whereby an application's state is periodically checkpointed to persistent
storage, so that, in the event of an unplanned system outage, the
application can be restarted from a recent checkpoint rather than losing
days of calculation.
"You might ask, is it possible to already do all of these things
on Linux right now? The answer is that it's almost possible." Pavel
spent the remainder of the talk describing how the CRIU implementation
works, how close the implementation is to completion, and what work remains
to be done. He began with some history of the checkpoint/restore project.
History of checkpoint/restore
The origins of the CRIU implementation go back to work that started in
2005 as part of the OpenVZ
project. The project provided a set of out-of-mainline patches to the
Linux kernel that supported a kernel-space implementation of
checkpoint/restore.
In 2008, when the first efforts were made to upstream the
checkpoint/restore functionality, the OpenVZ project communicated with a
number of other parties who were interested in the functionality. At the
time, it seemed natural to employ an in-kernel implementation of
checkpoint/restore. A few year's work resulted in a set of more than 100 patches that
implemented almost all of the same functionality as OpenVZ's kernel-based
checkpoint/restore mechanism.
However, concerns from the upstream
kernel developers eventually led to the rejection of the kernel-based
approach. One concern related to the sheer scale of the patches and the
complexity they would add to the kernel: the patches amounted to tens of
thousands of lines and touched a very wide range of subsystems in the
kernel. There were also concerns about the difficulties of implementing
backward compatibility for checkpoint/restore, so that an application could
be checkpointed on one kernel version and then successfully restored on a
later kernel version.
Over the course of about a year, the OpenVZ project then turned its
efforts to developing an implementation of checkpoint/restore that was done
mainly in user space, with help from the kernel where it was needed. In
January 2012, that effort was repaid when Linus Torvalds merged
a first set CRIU-related patches into the mainline kernel, albeit with an
amusingly skeptical covering note from Andrew Morton:
A note on this: this is a project by various mad Russians to perform
checkpoint/restore mainly from userspace, with various oddball helper code
added into the kernel where the need is demonstrated.
So rather than some large central lump of code, what we have is
little bits and pieces popping up in various places which either
expose something new or which permit something which is normally
kernel-private to be modified.
Since then, two versions of the corresponding user-space tools have
been released: CRIU v0.1 in July, and CRIU v0.2, which added support for Linux Containers (LXC), in
September.
Goal and concept
The ultimate goal of the CRIU project is to allow the entire state of
an application to be dumped (checkpointed) and then later restored. This
is a complex task, for several reasons. First of all, there are many
pieces of process state that must be saved, for example, information about
virtual memory mappings, open files, credentials, timers, process ID,
parent process ID, and so on. Furthermore, an application may consist of
multiple processes that share some resources. The CRIU facility must allow
all of these processes to be checkpointed and restored to the same state.
For each piece of state that the kernel records about a process, CRIU
needs two pieces of support from the kernel. The first piece is a mechanism
to interrogate the kernel about the value of the state, in preparation for
dumping the state during a checkpoint. The second piece is a mechanism to
pass that state back to the kernel when the process is restored. Pavel
illustrated this point using the example of open files. A process may open
an arbitrary set of files. Each open() call results in the
creation of a file descriptor that is a handle to some internal kernel
state describing the open file. In order to dump that state, CRIU needs a
mechanism to ask the kernel which files are opened by that process. To
restore the application, CRIU then re-opens those files using the same
descriptor numbers.
The CRIU system makes use of various kernel APIs for retrieving and
restoring process state, including files in the /proc file system,
netlink sockets, and system calls. Files in /proc can be used to
retrieve a wide range of information about processes and their
interrelationships. Netlink sockets are used both to retrieve and to
restore various pieces of state information.
System calls provide a mechanism to both retrieve and restore various
pieces of state. System calls can be subdivided into two categories.
First, there are system calls that operate only on the process that calls
them. For example, getitimer() can be used to retrieve only
the caller's interval timer value. System calls in this category
can't easily be used to retrieve or restore the state of arbitrary
processes. However, later in his talk, Pavel described a technique that the
CRIU project came up with to employ these calls. The other category of
system calls can operate on arbitrary processes. The system calls
that set process scheduling attributes are an example:
sched_getscheduler() and sched_getparam() can be used to
retrieve the scheduling attributes of an arbitrary process and
sched_setscheduler() can be used to set the attributes of an
arbitrary process.
CRIU requires kernel support for retrieving each piece of process
state. In some cases, the necessary support already existed. However, in
other cases, there is no kernel API that can be used to interrogate the
kernel about the state; for each such case, the CRIU project must add a
suitable kernel API. Pavel used the example of memory-mapped files to
illustrate this point. The /proc/PID/maps file provides the
pathnames of the files that a process has mapped. However, the file
pathname is not a reliable identifier for the mapped file. For example,
after the mapping was created, filesystem mount points may have been
rearranged or the pathname may have been unlinked. Therefore, in order to
obtain complete and accurate information about mappings, the CRIU
developers added a new kernel API:
/proc/PID/map_files.
The situation when restoring process state is often a lot simpler: in
many cases the same API that was used to create the state in the first
place can be used to re-create the state during a restore. However, in
some cases, restoring process state is not so simple. For example,
getpid() can be used to retrieve a process's PID, but there is no
corresponding API to set a process's PID during a restore (the
fork() system call does not allow the caller to specify the PID of
the child process). To address this problem, the CRIU developers added an API that could be used to control the
PID that was chosen by the next fork() call. (In response to a
question at the end of the talk, Pavel noted that in cases where the new
kernel features added to support CRIU have security implications, access to
those features has been restricted by a requirement that the user must have
the CAP_SYS_ADMIN capability.)
Kernel impact and new kernel features
The CRIU project has largely achieved its goal, Pavel said. Instead of
having a large mass of code inside the kernel that does checkpoint/restore,
there are instead many small extensions to the kernel that allow
checkpoint/restore to be done in user space. By now, just over 100
CRIU-related patches have been merged upstream or are sitting in "-next"
trees. Those patches added nine new features to the kernel, of which only
one was specific to checkpoint/restore; all of the others have
turned out to also have uses outside checkpoint/restore.
Approximately 15 further patches are currently being discussed on the
mailing lists; in most cases, the principles have been agreed on by the
stakeholders, but details are being resolved. These "in flight" patches
provide two additional kernel features.
Pavel detailed a few of the more interesting new features added to the
kernel for the CRIU project. One of these was parasite code injection, which was added by
Tejun Heo, "not specifically within the CRIU project, but with the
same intention". Using this feature, a process can be made to
execute an arbitrary piece of code. The CRIU framework employs parasite
code injection to use those system calls mentioned earlier that operate
only on the caller's state; this obviated the need to add a range of new
APIs to retrieve and restore various pieces of state of arbitrary
processes. Examples of system calls used to obtain process state via
injected parasite code are getitimer() (to retrieve interval
timers) and sigaction() (to retrieve signal dispositions).
The kcmp() system call was
added as part of the CRIU project. It allows the comparison of various
kernel objects used by two processes. Using this system call, CRIU can
build a full picture of what resources two processes share inside the
kernel. Returning to the example of open files gives some idea of how
kcmp() is useful.
Information about an open file is available via /proc/PID/fd
and the files in /proc/PID/fdinfo. Together, these files reveal
the file descriptor number, pathname, file offset, and open file flags for
each file that a process has opened. This is almost enough information to
be able to re-open the file during a restore. However, one notable piece of
information is missing: sharing of open files. Sometimes, two open file
descriptors refer to the same file structure. That can happen, for
example, after a call to fork(), since the child inherits copies
of all of its parent's file descriptors. As a consequence of this type of
sharing, the file descriptors share file offset and open file flags.
This sort of sharing of open file descriptions can't be restored via
simple calls to open(). Instead, CRIU makes use of the
kcmp() system call to discover instances of file sharing when
performing the checkpoint, and then uses a combination of open()
and file descriptor passing via UNIX domain sockets to re-create the
necessary sharing during the restore. (However, this is far from the full
story for open files, since there are many other attributes associated with
specific kinds of open files that CRIU must handle. For example, inotify
file descriptors, sockets, pseudo-terminals, and pipes all require
additional work within CRIU.)
Another notable feature added to the kernel for CRIU is
sock_diag. This is a netlink-based subsystem that can be used to
obtain information about sockets. sock_diag is an example of how
a CRIU-inspired addition to the kernel has also benefited other projects.
Nowadays, the ss command, which displays information about sockets
on the system, also makes use of sock_diag. Previously,
ss used /proc files to obtain the information it
displayed. The advantage of employing sock_diag is that, by
comparison with the corresponding /proc files, it is much easier
to extend the interface to provide new information without breaking
existing applications. In addition, sock_diag provides some
information that was not available with the older interfaces. In
particular, before the advent of sock_diag, ss did not
have a way of discovering the connections between pairs of UNIX domain
sockets on a system.
Pavel briefly mentioned a few other kernel features added as part of
the CRIU work. TCP repair mode allows CRIU
to checkpoint and restore an active TCP connection, transparently to the
peer application. Virtualization of network device indices allows virtual
network devices to be restored in a network namespace; it also had the
side-benefit of a small improvement in the speed of network routing. As
noted earlier, the /proc/PID/map_files file was added for CRIU.
CRIU has also implemented a technique for
peeking at the data in a socket queue, so that the contents of a socket
input queue can be dumped. Finally, CRIU added a number of options to the
getsockopt() system call, so that various options that were
formerly only settable via setsockopt() are now also retrievable.
Current status
Pavel then summarized the current state of the CRIU implementation,
looking at what is supported by the mainline 3.6 kernel. CRIU currently
supports (only) the x86-64 architecture. Asked at the end of the talk how
much work would be required to port CRIU to a new architecture, Pavel
estimated that the work should not be large. The main tasks are to
implement code that dumps architecture-specific state (mainly registers)
and reimplement a small piece of code that is currently written in x86
assembler.
Arbitrary process trees are supported: it is possible to dump a process
and all of its descendants. CRIU supports multithreaded applications,
memory mappings of all kinds, and terminals, process groups, and sessions.
Open files are supported, including shared open files, as
described above. Established TCP connections are supported, as are UNIX
domain sockets.
The CRIU user-space tools also support various kinds of non-POSIX
files, including inotify, epoll, and signalfd file descriptors, but the
required kernel-side support is not yet available. Patches for that
support are currently queued, and Pavel hopes that they will be merged for
kernel 3.8.
Testing
The CRIU project tests its work in a variety of ways. First, there is
the ZDTM (zero-down-time-migration) test suite. This test suite consists
of a large number of small tests. Each test program sets up a test before
a checkpoint, and then reports on the state of the tested feature after a
restore. Every new feature merged into the CRIU project adds a test to
this suite.
In addition, from time to time, the CRIU developers take some real
software and test whether it survives a checkpoint/restore. Among the
programs that they have successfully checkpointed and restored are Apache
web server, MySQL, a parallel compilation of the kernel, tar, gzip, an SSH
daemon with connections, nginx, VNC with XScreenSaver and a client
connection, MongoDB, and tcpdump.
Plans for the near future
The CRIU developers have a number of plans for the near future. (The
CRIU wiki has a TODO list.) First
among these is to complete the coverage of resources supported by CRIU.
For example, CRIU does not currently support POSIX timers. The problem
here is that the kernel doesn't currently provide an API to detect whether
a process is using POSIX timers. Thus, if an application using POSIX
timers is checkpointed and restored, the timers will be lost. There
are some other similar examples. Fixing these sorts of problems will
require adding suitable APIs to the kernel to expose the required state
information.
Another outstanding task is to integrate the user-space crtools into
LXC and OpenVZ to permit live migration of containers. Pavel noted that
OpenVZ already supports live migration, but with its own out-of-tree kernel
modules.
The CRIU developers plan to improve the automation of live migration.
The issue here is that CRIU deals only with process state. However, there
are other pieces of state in a container. One such piece of state is the
filesystem. Currently, when checkpointing and restoring an application, it
is necessary to ensure that the filesystem state has not changed in the
interim (e.g., no files that are open in the checkpointed application have
been deleted). Some scripting using rsync to automate the copying files
from the source system to the destination system could be implemented to
ease the task of live migration.
One further piece of planned work is to improve the handling of
user-space memory. Currently, around 90% of the time required to
checkpoint an application is taken up by reading user-space memory. For
many use cases, this is not a problem. However, for live migration and
incremental snapshotting, improvements are possible. For example, when
performing live migration, the whole application must first be frozen, and
then the entire memory is copied out to the destination system, after which
the application is restarted on the destination system. Copying out a huge
amount of memory may require several seconds; during that time the
application is unavailable. This situation could be alleviated by allowing
the application to continue to run at the same time as memory is copied to
the destination system, then freezing the application and asking the kernel
which pages of memory have changed since the checkpoint operation began.
Most likely, only a small amount of memory will have changed; those
modified pages can then be copied to the destination system. This could
result in a considerable shortening of the interval during which the
application is unavailable. The CRIU developers plan to talk with the
memory-management developers about how to add support for this
optimization.
Concluding remarks
Although many groups are interested in having checkpoint/restore
functionality, an implementation that works with the mainline kernel has
taken a long time in coming. When one looks into the details and realize
how complex the task is, it is perhaps unsurprising that it has taken so
long. Along the way, one major effort to solve the
problem—checkpoint/restore in kernel space—was considered and
rejected. However, there are some promising signs that the mad Russians
led by Pavel may be on the verge of success with their alternative approach
of a user-space implementation.
(
Log in to post comments)