LWN.net Logo

TCP connection hijacking and parasites - as a good thing

By Jonathan Corbet
August 9, 2011
The 3.1 kernel will include a number of enhancements to the ptrace() system call by Tejun Heo. These improvements are meant to make reliable debugging of programs easier, but Tejun, it seems, is not one to be satisfied with mundane objectives like that. So he has posted an example program showing how the new features can be used to solve a difficult problem faced by checkpoint/restart implementations: capturing and restoring the state of network connections. The code is in an early stage of development; it's audacious and scary, but it may show how interesting things can be done.

The traditional ptrace() API calls for a tracing program to attach to a target process with the PTRACE_ATTACH command; that command puts the target into a traced state and stops it in its tracks. PTRACE_ATTACH has never been perfect; it changes the target's signal handling and can never be entirely transparent to the target. So Tejun supplemented it with a new PTRACE_SEIZE command; PTRACE_SEIZE attaches to the target but does not stop it or change its signal handling in any way. Stopping a seized process is done with PTRACE_INTERRUPT which, again, does not send any signals or make any signal handling changes. The result is a mechanism which enables the manipulation of processes in a more transparent, less disruptive way.

All of this seems useful, but it does not necessarily seem like part of a checkpoint/restart implementation. But it can help in an important way. One of the problems associated with saving the state of a process is that not all of that state is visible from user space. Getting around this limitation has tended to involve doing checkpointing from within the kernel or the addition of new interfaces to expose the required information; neither approach is seen as ideal. But, in many cases, the required information can be had by running in the context of the targeted process; that is where an approach based on ptrace() can have a role to play.

Tejun took on the task of saving and restoring the state of an open TCP connection for his example implementation. The process starts by using ptrace() to seize and stop the target thread(s); then it's just a matter of running some code in that process's context to get the requisite information. To do so, Tejun's example program digs around in the target's address space for a nice bit of memory which has execute permission; the contents of that memory are saved and replaced by his "parasite" code. A bit of register manipulation allows the target process to be restarted in the injected code, which does the needed information gathering. Once that's done, the original code and registers are restored, and the target process is as it was before all this happened.

The "parasite" code starts by gathering the basic information about open connections: IP addresses, ports, etc. The state of the receive side of each connection is saved by (1) copying any buffered incoming data using the MSG_PEEK option to recvmsg(), and (2) getting the sequence number to be read next with a new SIOCGINSEQ ioctl() command. On the transmit side, the sequence number of each queued outgoing packet - along with the packet data itself must be captured with another pair of new ioctl() commands. With that done, the checkpointing of the network connection is complete.

Restarting the connection - possibly in a different process on a different machine entirely - is a bit tricky; the kernel's idea of the connection must be made to match the situation at checkpoint time without perturbing or confusing the other side. That requires the restart code to pretend to be the other side of the connection for as long as it takes to get things in sync. The kernel already provides most of the machinery needed for this task: outgoing packets can be intercepted with the "nf_queue" mechanism, and a raw socket can be used to inject new packets that appear to be coming from the remote side.

So, at restart time, things start by simply opening a new socket to the remote end. Another new ioctl() command (SIOCSOUTSEQ) is used to set the sequence number before connecting to make it match the number found at checkpoint time. Once the connection process starts, the outgoing SYN packet will be intercepted - the remote side will certainly not be prepared to deal with it - and a SYN/ACK reply will be injected locally. The outgoing ACK must also be intercepted and dropped on the floor, of course. Once that is done, the kernel thinks it has an open connection, with sequence numbers matching the pre-checkpoint connection, to the remote side.

After that, it's a matter of restoring the incoming data that had been found queued in the kernel at checkpoint time; that is done by injecting new packets containing that data and intercepting the resulting ACKs from the network stack. Outgoing data, instead, can be replaced with a series of simple send() calls, but there is one little twist. Packets in the outgoing queue may have already been transmitted and received by the remote side. Retransmitting those packets is not a problem, as long as the size of those packets remains the same. If, instead, the system uses different offsets as it divides the outgoing data into packets, it can create confusion at the remote end. To keep that from happening, Tejun added one more ioctl() (SIOCFORCEOUTBD) to force the packets to match those created before the checkpoint operation began.

Once the transmit queue is restored, the connection is back to its original state. At this point, the interception of outgoing packets can stop.

All of this seems somewhat complex and fragile, but Tejun states that it "actually works rather reliably." That said, there are a lot of details that have been ignored; it is, after all, a proof-of-concept implementation. It's not meant to be a complete solution to the problem of checkpointing and restarting network connections; the idea is to show that the problem can, indeed, be solved. If the user-space checkpoint/restart work proceeds, it may well adopt some variant of this approach at some point. In the meantime, though, what we have is a fun hack showing what can be done with the new ptrace() commands. Those wanting more details on how it works can find them in the README file found in the example code repository.


(Log in to post comments)

TCP connection hijacking and parasites - as a good thing

Posted Aug 11, 2011 13:35 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Is it just me, or is this whole "patch a stopped process and then muck with TCP forging a SYN/ACK sequence" procedure looks kinda Rube Goldbergesque?

What are you going to do with SIP connections? Or with SCTP connections?

This code should properly be in the kernel, possibly integrated with conntrack module of iptables (which seems to be the ideal place for it).

TCP connection hijacking and parasites - as a good thing

Posted Aug 11, 2011 17:16 UTC (Thu) by bronson (subscriber, #4806) [Link]

It's a good prototype. Glue random parts together and see if it flies. If it does, THEN you do the engineering to turn it into a product. If not, no big deal.

TCP connection hijacking and parasites - as a good thing

Posted Aug 11, 2011 21:29 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

I'm very interested in checkpointing to the extent that I've implemented userspace TCP/IP stack using http://sourceforge.net/projects/cipsuite/ with support for rapid resuming.

However, the proposed solution is what I'd call an example of how NOT to do checkpointing. I've read its code and I'd say that it conclusively proves that there should be kernel-level support for it.

Actually, it should not even be that hard! We already have /proc/pid/fd directory with the list of open handles. So we just need to add, say, /proc/pid/fd-pickle directory with the list of files, containing handles' information. So TCP connections would store their endpoints, sequence numbers, the sets of TCP flags, and probably IPSec state in these files.

TCP connection hijacking and parasites - as a good thing

Posted Aug 11, 2011 22:27 UTC (Thu) by bronson (subscriber, #4806) [Link]

cipsuite looks impressive! Wish I'd had it back in my embedded days.

Agreed, the article's technique not the best way of doing it. /proc/pid/fd-pickle seems like it would be somewhat high maint and prone to racing... Is it possible to extract the fd info and other kernel state after the process is frozen?

(asking as someone who has never actually checkpointed a process...)

TCP connection hijacking and parasites - as a good thing

Posted Aug 11, 2011 22:52 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

It shouldn't be hard, actually I'm thinking of implementing it myself.

Race conditions would be a problem, but:
1) Checkpoint/restart is inherently racy. Network packets might got lost, connections can time out during migration, etc.
2) It can be mitigated somewhat by providing kernel-level support for freezing.

TCP connection hijacking and parasites - as a good thing

Posted Aug 12, 2011 19:58 UTC (Fri) by gswoods (subscriber, #37) [Link]

The question I have about this is, doesn't the other end of the connection also keep state? Won't it drop the connection after some period of time?

Checkpointing is, presumably, something that is going to be done when it is expected that the checkpointed process(es) will be suspended for a relatively long period of time. What happens when those packets are reinjected after the remote side has already torn down the connection?

TCP connection hijacking and parasites - as a good thing

Posted Aug 12, 2011 21:13 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

actually, one of the most valuable use cases for checkpoint/restart is where you only expect to be down for a few seconds because you want to migrate a process from one system to another one.

TCP connection hijacking and parasites - as a good thing

Posted Aug 12, 2011 23:08 UTC (Fri) by martinfick (subscriber, #4455) [Link]

I am not sure it's the most valuable, but most likely the only really usable one give the types of issues you bring up.

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 0:46 UTC (Sat) by Lennie (subscriber, #49641) [Link]

I wonder if checkpoint/restart would at some point in the future be useful to do regular checkpoints to a shared filesystem.

So when a machine crashes it can (automatically) be started on an other machine.

It might not be useful for all workloads, but maybe some ?

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 1:41 UTC (Sat) by dlang (✭ supporter ✭, #313) [Link]

the problem is that if the system crashes, you don't have a checkpoint at the time of the crash. at best you have a checkpoint that's fairly recent but is prior to the crash.

If you take that older checkpoint and start it on another machine, you have time-warped the application, and there may be other components that you are talking to that haven't done a similar time warp.

a perfect example of this:

I was evaluating a commercial product that goes out and changes the root password on your system on a frequent basis. If you need the root password you make a request to this system. When I asked the vendor how they did High Availability and Disaster Recovery for the system (i.e. what happens if the hardware it's running on bursts into flames), their answer was that they used vmware and vmware would auto-migrate the service to another server.

When I pointed out that vmware could not be 100% up to date and the old instance may have changed passwords on systems and the new instance won't know about it, their answer turned into "just trust us"

not all apps are subject to problems like this from time warps, but the more it communicates with the outside world, the more likely it is to have a problem.

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 4:49 UTC (Sat) by raven667 (subscriber, #5198) [Link]

The vendor here is probably more right than you give credit for. Either you have a case where the vm live migrates to another machine and there is no lost data or time warp or the hardware crashes and the only important thing is what is committed to disk, which does not introduce any new or special problem. The vm would be booted an another piece of hardware, no time warp only crash recovery, journal replay, etc.

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 5:00 UTC (Sat) by dlang (✭ supporter ✭, #313) [Link]

no, if the password gets changed on a remote machine and the time warp happens, the password is now known by no system.

even if they were smart enough to not change the remote password until they stored it (and then what happens if they can't change the password? they did _not_ keep a record of old passwords), I've seen too many cases where a dieing system scribbles on the disk to be comfortable with the idea of betting that such a thing does not happen for access to my servers.

the company abandoned the product (shut it down, didn't even sell it off) about 9 months later, so I don't think I was the only person asking such hard questions of them.

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 22:37 UTC (Sat) by raven667 (subscriber, #5198) [Link]

I think you misunderstood my point, I wasnt clear enough. I do not disagree that in a system such as that that data loss from a rolled back transaction record of a password change during crash recovery would be very problematic I only disagree that the introduction of a vm environment meaningfully changes that risk one way or another.

I am surprised that they couldn't give you a better answer though, I can think of three or four ways to mitigate that risk off the top of my head. Maybe their sales force was undertrained or their engineers were chuckleheads.

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 22:46 UTC (Sat) by raven667 (subscriber, #5198) [Link]

One other thing.. Unless you are taking snapshots of a running vm and reverting back to them I do not think you can have the "time warp" you describe. Failing to commit changes to permanent storage is a different issue.

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 22:55 UTC (Sat) by dlang (✭ supporter ✭, #313) [Link]

the thing is, that's exactly what vmware does when it migrates machines, it takes snapshots of the VM periodically, and then if a system dies, boots the latest snapshot on another system.

this works quite well if you aren't sensitive to time warps, and everything is in one geographic location so that you can have the snapshots stored on shared disks.

if you are sensitive to time warps, and you need to allow for your recovery to be in a different geographic location (so that there is a significant lag in the storage changes getting replicated to the new datacenter) then the approach of just saying "vmware solves our HA/DR issues, we don't need to care about them at an application design level" is a very bad approach to take.

TCP connection hijacking and parasites - as a good thing

Posted Aug 15, 2011 20:50 UTC (Mon) by raven667 (subscriber, #5198) [Link]

I think you are confused and talking about two different things. If a system dies, then the VMs which were running on it crash and are booted on other systems in the cluster, there are no snapshots or time warps involved. This has nothing to do which live migration of machines which does not in any way make machines more highly available except that one manually live migrate off of one host in a cluster to do maintenance on it. Live migration does not cause time warps, at best the system is unavailable on the network for a few moments while the switches relearn what port its MAC address is coming from.

Of course, saying that VM solves our HA/DR problems like so much secret sauce is BS without making the application resilient to the same kind of problems that can happen to bare metal such as unexpected crashes and some way to handle DR which is very different than HA.

Actually VMware does have a feature now called Fault Tolerance where a VM is run in lockstep on two different nodes in a cluster. They are kept running the same instruction for instruction AFAIK so there would be no time warp during a failover event, there are some product demos on youtube showing them playing a video file over a remote desktop and it not skipping a frame while failing over. That could work as an HA solution but would do nothing for DR.

TCP connection hijacking and parasites - as a good thing

Posted Aug 16, 2011 0:08 UTC (Tue) by dlang (✭ supporter ✭, #313) [Link]

I fully agree that for live migration when nothing is wrong, virtual machines (and containers with checkpoint/restore) can be very nice.

In this case the vendor was claiming that this also solved crash problems because vmware would just restart the application on the other server exactly where it left off when the first server crashed.

I doubt that the vmware Fault Tolerance would work for any system that used random numbers as I don't see how they could find all the places that generated or used the random numbers and make both machines have the exact same results.

TCP connection hijacking and parasites - as a good thing

Posted Aug 16, 2011 20:36 UTC (Tue) by robbe (subscriber, #16131) [Link]

> In this case the vendor was claiming that this also solved crash problems because vmware would just restart the application on the other server exactly where it left off when the first server crashed.

VMware's HA feature just restarts the VM affected by a server crash on another ESX server. From the perspective of the VM it looks like a spontaneous reboot. This guarantees minimum downtime, but does not make the crash invisible to the application.

> I doubt that the vmware Fault Tolerance would work for any system that used random numbers [...]

Remember that all the HW excepting the CPU is emulated. Fault Tolerance works by replicating the external events (e.g. an incoming network packet) seen by the original VM at the same point in time on the backup VM. Usual sources of entropy will be equal to both machines.

Maybe someone could deliberately construct a program that runs differently in the original and the backup VM. The only thing I can think of right now is measuring time via a covert channel (the normal means like RDTSC are covered, of course) -- but this is too vague still to make into an "exploit".

TCP connection hijacking and parasites - as a good thing

Posted Aug 16, 2011 20:52 UTC (Tue) by dlang (✭ supporter ✭, #313) [Link]

given that the entropy sources include nanosecond timeing, and if the CPU supports it, thermal noise on the CPU, I _really_ doubt that the resulting random numbers would be the same on the two systems.

TCP connection hijacking and parasites - as a good thing

Posted Aug 16, 2011 22:12 UTC (Tue) by raven667 (subscriber, #5198) [Link]

I think you underestimate the state of the art. Considering how often modern systems use random numbers I can't imagine this case not being handled, by disabling any hardware mechanism that could introduce non-determinism at the very least.

Looking at the VMware FT docs the hypervisor for the primary VM very thoroughly records anything that could change state and does not pass it through to the primary until it is transmitted and receipt acknowledged by the hypervisor for the secondary VM. Features such as SMP or hardware MMU are disabled as their state can't be recorded and could introduce non-determinism. Each event is injected into the secondary at the same execution point. That certainly has to work with nanosecond timing, from the point view from inside the secondary VM. According to wall clock time the secondary will always be lagging behind, the demos show lag in the millisecond range, but because events are recorded it can be brought up to current during a failover event, so no state should be lost.

If you are interested you may want to do a little research on the topic, on your own. When I get this set up in my test environment I'll definitely run though creating ssh keys and whatnot to validate my understanding that this does work.

TCP connection hijacking and parasites - as a good thing

Posted Aug 17, 2011 9:25 UTC (Wed) by Lennie (subscriber, #49641) [Link]

Does that mean you would expect it to work, even if I use something like http://www.issihosts.com/haveged/ as one of the sources of random ?

TCP connection hijacking and parasites - as a good thing

Posted Aug 17, 2011 13:24 UTC (Wed) by raven667 (subscriber, #5198) [Link]

That's what I would expect. I'd love to get my test environment straightened out and then I could determine one way or another.

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 22:48 UTC (Sat) by dlang (✭ supporter ✭, #313) [Link]

the 'correct' answer (and what competing products do) is to have multiple servers, and have the data replicated to all the servers.

it's fairly common (although complex) at the application level to implement updates of small amounts of information across a geographically distributed cluster of machines (lookup two phase commit). but if you rely on the entire VM state being synced, there is just too much unnecessary data involved to keep things synced in real time.

TCP connection hijacking and parasites - as a good thing

Posted Aug 15, 2011 21:00 UTC (Mon) by raven667 (subscriber, #5198) [Link]

One thing I thought of when mentally designing a system as you described is that you would both need to have complete history but also a large window of near-future passwords. If you restore a host from last year, you need to be able to log into it and if you restore last week's backup of your password management system it needs to know what the current passwords would be otherwise you have no DR, just HA.

TCP connection hijacking and parasites - as a good thing

Posted Aug 16, 2011 0:09 UTC (Tue) by dlang (✭ supporter ✭, #313) [Link]

yes, you do need to have a record of the historic passwords as well.

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 10:11 UTC (Sat) by Lennie (subscriber, #49641) [Link]

So I guess the workload where it is easiest to work with is: long running calculations that have mostly read-only data on the network.

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 20:04 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

Additionally, it's not unusual to checkpoint the state of both endpoints simultaneously.

We're doing exactly that. It's kinda awe-inspiring to see the whole cluster with all the interconnections to 'time-warp' into a previous state.

TCP connection hijacking and parasites - as a good thing

Posted Aug 18, 2011 10:30 UTC (Thu) by etienne (subscriber, #25256) [Link]

I do not understand how you can hide the originating IP address of the process you are restarting.
Can you only checkpoint/restart on a machine with the same IP address (so the machine you are restarting cannot be present on the network at the checkpoint time because it would be duplicate IP)?

TCP connection hijacking and parasites - as a good thing

Posted Aug 18, 2011 18:19 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

you create a separate IP address for the service group that you want to migrate, then you move that IP address along with the services.

TCP connection hijacking and parasites - as a good thing

Posted Aug 19, 2011 2:04 UTC (Fri) by kevinm (guest, #69913) [Link]

(1) copying any buffered incoming data using the MSG_PEEK option to
recvmsg(), and (2) getting the sequence number to be read next with a new
SIOCGINSEQ ioctl() command.

This seems fundamentally racy. What happens if a new segment arrives in the kernel in between (1) and (2)? Instead of a new ioctl(), it should be a new `MSG_PEEK_SEQ` flag to recvmsg() that supplies a control message containing the sequence number, with the iovec pointing to the peeked data.

TCP connection hijacking and parasites - as a good thing

Posted Aug 19, 2011 17:24 UTC (Fri) by swebb (guest, #78608) [Link]

> Tejun's example program digs around in the target's address space for a nice bit of memory which has execute permission; the contents of that memory are saved and replaced by his "parasite" code.

I covered this technique and its limitations in my Defcon 19 presentation. I created a project called libhijack that allows injection of arbitrary code into new memory mappings. I have a feeling libhijack will get much more powerful with PTRACE_SEIZE.

I'm curious why Linux developers don't implement a DTrace clone. PTrace should die a horrible death.

TCP connection hijacking and parasites - as a good thing

Posted Aug 23, 2011 12:15 UTC (Tue) by i3839 (guest, #31386) [Link]

Why would libhijack become more powerful with PTRACE_SEIZE?
As far as I can tell, it only makes ptracing more transparent,
not more powerful.

This example doesn't do anything that couldn't have been done
with normal ptrace, as far as I can tell.

And the whole approach is total madness. Why not just steal the
connection by passing the socket fd to the new target and closing
it in the original task? For that you only need to inject a couple
of system calls, with less disruptive data injections. No need to
muck around in TCP states, netfilter and all that other madness.

TCP connection hijacking and parasites - as a good thing

Posted Aug 23, 2011 15:00 UTC (Tue) by dlang (✭ supporter ✭, #313) [Link]

you can only pass the socket FD to a process on the same system.

this approach can move the TCP connection to a different system.

TCP connection hijacking and parasites - as a good thing

Posted Aug 23, 2011 21:39 UTC (Tue) by i3839 (guest, #31386) [Link]

I think that the current code only handles local processes too,
at least that was my impression after reading the code, especially
main.c. You're right that this approach could make remote moves
possible.

But damn, it's ugly. I'd say, add an explicit connection moving
API instead of this mess.

TCP connection hijacking and parasites - as a good thing

Posted Aug 20, 2011 7:36 UTC (Sat) by slashdot (guest, #22014) [Link]

Restoring the socket state by injecting packets is quite brilliant, as it guarantees to avoid the risk of being able to use the resume feature to get the TCP stack in a state which is otherwise impossible (and which can cause a kernel crash or exploit).

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds