OpenVZ's live checkpointing
[Posted April 26, 2006 by corbet]
The
OpenVZ project is a GPL-licensed
subset of SWSoft's proprietary Virtuozzo offering. With OpenVZ, a Linux
system can implement multiple "virtual environments", each of which
appears, to the processes running within it, to be a separate, standalone
system. Virtual environments can have their own IP addresses and be
subjected to specific resource limits. They are, in other words, an
implementation of the container concept, one of several for Linux. In
recent times the various virtualization and container projects have shown a
higher level of interest in getting at least some of their code merged into
the mainline kernel, and OpenVZ is no exception. So the OpenVZ developers
have been maintaining a higher profile on the kernel mailing lists.
The latest news from OpenVZ is this announcement of a new
release with a major feature addition: live checkpointing and migration of
virtual environments. An environment (being a container full of Linux
processes) can be checkpointed to a file,
allowing it to be restarted at some later time. But it is also possible to
checkpoint a running virtual environment and move it to another system,
with no interruption in service. This feature, clearly meant to be
competitive with Xen's live migration capabilities, enables run-time load
balancing across systems.
The OpenVZ patch, weighing at 2.2MB, is not for the faint of heart; it
makes the price to be paid for these features quite clear. Much of what is
contained within the patch has been discussed here before; for example, it
contains the PID virtualization
patches, and every bit of code within the kernel must be aware of
whether it is working with "real" or "virtual" process IDs. A number of
other kernel interfaces must be changed to support OpenVZ's virtualization
features; among other things, many device drivers and filesystems require
tweaks.
As might be expected, the checkpointing code is on the long and complicated
side. The checkpoint process starts by putting the target process(es) on
hold, in a manner similar to what the software suspend code does. Then it
comes down to a long series of routines which serialize and write out
every data structure and bit of memory associated with a virtual
environment. The obvious things are saved: process memory, open files,
etc. But the code must also save the full state of each TCP socket
(including the backlog of sk_buff structures waiting to be
processed), connection tracking information, signal handling status, SYSV
IPC information, file descriptors obtained via Unix-domain sockets,
asynchronous I/O operations, memory mappings, filesystem namespaces, data
in tmpfs files, tty settings, file locks, epoll() file
descriptors, accounting information, and more.
For each of the objects to be saved, an in-file version of the kernel data
structure must be created. Each dump routine then serializes one or more
data structures into the proper format for writing to the checkpoint file.
It all apparently works, but it has the look of a highly brittle system -
almost any change to the kernel's data structures seems guaranteed to break
the checkpoint and restore code. Even if the checkpoint and restore code
were merged into the mainline, getting kernel developers to understand (and
care about) that code would be a challenge. Keeping it working must be be an
ongoing hassle, whether or not the code is in the mainline tree.
None of the above should be interpreted to say that OpenVZ's features are
not worth the cost. Virtual environments, checkpointing, and live
migration are powerful and useful features. But the virtualization of
everything within the kernel will lead to a higher level of internal
complexity and higher maintenance costs. The decision process which draws
the line determining which features are merged and which are not will be
interesting to watch.
(
Log in to post comments)