By Jonathan Corbet
August 9, 2011
The 3.1 kernel will include a number of enhancements to the
ptrace() system call by Tejun Heo. These improvements are meant
to make reliable debugging of programs easier, but Tejun, it seems, is not
one to be satisfied with mundane objectives like that. So he has
posted an example program showing how the new
features can be used to solve a difficult problem faced by
checkpoint/restart implementations: capturing and restoring the state of
network connections. The code is in an early stage of development; it's
audacious and scary, but it may show how interesting things can be done.
The traditional ptrace() API calls for a tracing program to attach
to a target process with the PTRACE_ATTACH command; that command
puts the target into a traced state and stops it in its tracks.
PTRACE_ATTACH has never been perfect; it changes the target's
signal handling and can never be entirely transparent to the target. So
Tejun supplemented it with a new PTRACE_SEIZE command;
PTRACE_SEIZE attaches to the target but does not stop it or change
its signal handling in any way. Stopping a seized process is done with
PTRACE_INTERRUPT which, again, does not send any signals or make
any signal handling changes. The result is a mechanism which enables the
manipulation of processes in a more transparent, less disruptive way.
All of this seems useful, but it does not necessarily seem like part of a
checkpoint/restart implementation. But it can help in an important way.
One of the problems associated with saving the state of a process is that
not all of that state is visible from user space. Getting around this
limitation has tended to involve doing checkpointing from within the kernel
or the addition of new interfaces to expose the required information;
neither approach is seen as ideal. But, in many cases, the required
information can be had by running in the context of the targeted process;
that is where an approach based on ptrace() can have a role to
play.
Tejun took on the task of saving and restoring the state of an open TCP
connection for his example implementation. The process starts by using
ptrace() to seize and stop the target thread(s); then it's just a
matter of running some code in that process's context to get the requisite
information. To do so, Tejun's example program digs around in the target's
address space for a nice bit of memory which has execute permission; the
contents of that memory are saved and replaced by his "parasite" code. A
bit of register manipulation allows the target process to be restarted in
the injected code, which does the needed information gathering. Once
that's done, the original code and registers are restored, and the target
process is as it was before all this happened.
The "parasite" code starts by gathering the basic information about open
connections: IP addresses, ports, etc. The state of the receive side of
each connection is saved by (1) copying any buffered incoming data
using the MSG_PEEK option to recvmsg(), and
(2) getting the sequence number to be read next with a new
SIOCGINSEQ ioctl() command. On the transmit side, the
sequence number of each queued outgoing packet - along with the packet data
itself must be captured with another pair of new ioctl()
commands. With that done, the checkpointing of the network connection is
complete.
Restarting the connection - possibly in a different process on a different
machine entirely - is a bit tricky; the kernel's idea of the connection
must be made to match the situation at checkpoint time without perturbing
or confusing the other side. That requires the restart code to pretend to
be the other side of the connection for as long as it takes to get things
in sync. The kernel already provides most of the machinery needed for this
task: outgoing packets can be intercepted with the "nf_queue" mechanism,
and a raw socket can be used to inject new packets that appear to be coming
from the remote side.
So, at restart time, things start by simply opening a new socket to the
remote end. Another new ioctl() command (SIOCSOUTSEQ) is
used to set the sequence number before connecting to make it match the
number found at checkpoint time. Once the connection process starts, the
outgoing SYN packet will be intercepted - the remote side will certainly
not be prepared to deal with it - and a SYN/ACK reply will be injected
locally. The outgoing ACK must also be intercepted and dropped on the
floor, of course. Once that is done, the kernel thinks it has an open
connection, with sequence numbers matching the pre-checkpoint connection,
to the remote side.
After that, it's a matter of restoring the incoming data that had been
found queued in the kernel at checkpoint time; that is done by injecting
new packets containing that data and intercepting the resulting ACKs from
the network stack. Outgoing data, instead, can be replaced with a series
of simple send() calls, but there is one little twist. Packets in
the outgoing queue may have already been transmitted and received by the
remote side. Retransmitting those packets is not a problem, as long as the
size of those packets remains the same. If, instead, the system uses different
offsets as it divides the outgoing data into packets, it can create
confusion at the remote end. To keep that from happening, Tejun added one
more ioctl() (SIOCFORCEOUTBD) to force the packets to
match those created before the checkpoint operation began.
Once the transmit queue is restored, the connection is back to its original
state. At this point, the interception of outgoing packets can stop.
All of this seems somewhat complex and fragile, but Tejun states that it
"actually works rather reliably." That said, there are a lot
of details that have been ignored; it is, after all, a proof-of-concept
implementation. It's not meant to be a complete solution to the problem of
checkpointing and
restarting network connections; the idea is to show that the problem can,
indeed, be solved. If the user-space
checkpoint/restart work proceeds, it may well adopt some variant of
this approach at some point. In the meantime, though, what we have is a
fun hack showing what can be done with the new ptrace() commands.
Those wanting more details on how it works can find them in the
README file found in the example code repository.
(
Log in to post comments)