User: Password:
|
|
Subscribe / Log in / New account

TCP connection hijacking and parasites - as a good thing

TCP connection hijacking and parasites - as a good thing

Posted Aug 12, 2011 19:58 UTC (Fri) by gswoods (subscriber, #37)
Parent article: TCP connection hijacking and parasites - as a good thing

The question I have about this is, doesn't the other end of the connection also keep state? Won't it drop the connection after some period of time?

Checkpointing is, presumably, something that is going to be done when it is expected that the checkpointed process(es) will be suspended for a relatively long period of time. What happens when those packets are reinjected after the remote side has already torn down the connection?


(Log in to post comments)

TCP connection hijacking and parasites - as a good thing

Posted Aug 12, 2011 21:13 UTC (Fri) by dlang (subscriber, #313) [Link]

actually, one of the most valuable use cases for checkpoint/restart is where you only expect to be down for a few seconds because you want to migrate a process from one system to another one.

TCP connection hijacking and parasites - as a good thing

Posted Aug 12, 2011 23:08 UTC (Fri) by martinfick (subscriber, #4455) [Link]

I am not sure it's the most valuable, but most likely the only really usable one give the types of issues you bring up.

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 0:46 UTC (Sat) by Lennie (guest, #49641) [Link]

I wonder if checkpoint/restart would at some point in the future be useful to do regular checkpoints to a shared filesystem.

So when a machine crashes it can (automatically) be started on an other machine.

It might not be useful for all workloads, but maybe some ?

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 1:41 UTC (Sat) by dlang (subscriber, #313) [Link]

the problem is that if the system crashes, you don't have a checkpoint at the time of the crash. at best you have a checkpoint that's fairly recent but is prior to the crash.

If you take that older checkpoint and start it on another machine, you have time-warped the application, and there may be other components that you are talking to that haven't done a similar time warp.

a perfect example of this:

I was evaluating a commercial product that goes out and changes the root password on your system on a frequent basis. If you need the root password you make a request to this system. When I asked the vendor how they did High Availability and Disaster Recovery for the system (i.e. what happens if the hardware it's running on bursts into flames), their answer was that they used vmware and vmware would auto-migrate the service to another server.

When I pointed out that vmware could not be 100% up to date and the old instance may have changed passwords on systems and the new instance won't know about it, their answer turned into "just trust us"

not all apps are subject to problems like this from time warps, but the more it communicates with the outside world, the more likely it is to have a problem.

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 4:49 UTC (Sat) by raven667 (subscriber, #5198) [Link]

The vendor here is probably more right than you give credit for. Either you have a case where the vm live migrates to another machine and there is no lost data or time warp or the hardware crashes and the only important thing is what is committed to disk, which does not introduce any new or special problem. The vm would be booted an another piece of hardware, no time warp only crash recovery, journal replay, etc.

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 5:00 UTC (Sat) by dlang (subscriber, #313) [Link]

no, if the password gets changed on a remote machine and the time warp happens, the password is now known by no system.

even if they were smart enough to not change the remote password until they stored it (and then what happens if they can't change the password? they did _not_ keep a record of old passwords), I've seen too many cases where a dieing system scribbles on the disk to be comfortable with the idea of betting that such a thing does not happen for access to my servers.

the company abandoned the product (shut it down, didn't even sell it off) about 9 months later, so I don't think I was the only person asking such hard questions of them.

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 22:37 UTC (Sat) by raven667 (subscriber, #5198) [Link]

I think you misunderstood my point, I wasnt clear enough. I do not disagree that in a system such as that that data loss from a rolled back transaction record of a password change during crash recovery would be very problematic I only disagree that the introduction of a vm environment meaningfully changes that risk one way or another.

I am surprised that they couldn't give you a better answer though, I can think of three or four ways to mitigate that risk off the top of my head. Maybe their sales force was undertrained or their engineers were chuckleheads.

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 22:46 UTC (Sat) by raven667 (subscriber, #5198) [Link]

One other thing.. Unless you are taking snapshots of a running vm and reverting back to them I do not think you can have the "time warp" you describe. Failing to commit changes to permanent storage is a different issue.

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 22:55 UTC (Sat) by dlang (subscriber, #313) [Link]

the thing is, that's exactly what vmware does when it migrates machines, it takes snapshots of the VM periodically, and then if a system dies, boots the latest snapshot on another system.

this works quite well if you aren't sensitive to time warps, and everything is in one geographic location so that you can have the snapshots stored on shared disks.

if you are sensitive to time warps, and you need to allow for your recovery to be in a different geographic location (so that there is a significant lag in the storage changes getting replicated to the new datacenter) then the approach of just saying "vmware solves our HA/DR issues, we don't need to care about them at an application design level" is a very bad approach to take.

TCP connection hijacking and parasites - as a good thing

Posted Aug 15, 2011 20:50 UTC (Mon) by raven667 (subscriber, #5198) [Link]

I think you are confused and talking about two different things. If a system dies, then the VMs which were running on it crash and are booted on other systems in the cluster, there are no snapshots or time warps involved. This has nothing to do which live migration of machines which does not in any way make machines more highly available except that one manually live migrate off of one host in a cluster to do maintenance on it. Live migration does not cause time warps, at best the system is unavailable on the network for a few moments while the switches relearn what port its MAC address is coming from.

Of course, saying that VM solves our HA/DR problems like so much secret sauce is BS without making the application resilient to the same kind of problems that can happen to bare metal such as unexpected crashes and some way to handle DR which is very different than HA.

Actually VMware does have a feature now called Fault Tolerance where a VM is run in lockstep on two different nodes in a cluster. They are kept running the same instruction for instruction AFAIK so there would be no time warp during a failover event, there are some product demos on youtube showing them playing a video file over a remote desktop and it not skipping a frame while failing over. That could work as an HA solution but would do nothing for DR.

TCP connection hijacking and parasites - as a good thing

Posted Aug 16, 2011 0:08 UTC (Tue) by dlang (subscriber, #313) [Link]

I fully agree that for live migration when nothing is wrong, virtual machines (and containers with checkpoint/restore) can be very nice.

In this case the vendor was claiming that this also solved crash problems because vmware would just restart the application on the other server exactly where it left off when the first server crashed.

I doubt that the vmware Fault Tolerance would work for any system that used random numbers as I don't see how they could find all the places that generated or used the random numbers and make both machines have the exact same results.

TCP connection hijacking and parasites - as a good thing

Posted Aug 16, 2011 20:36 UTC (Tue) by robbe (subscriber, #16131) [Link]

> In this case the vendor was claiming that this also solved crash problems because vmware would just restart the application on the other server exactly where it left off when the first server crashed.

VMware's HA feature just restarts the VM affected by a server crash on another ESX server. From the perspective of the VM it looks like a spontaneous reboot. This guarantees minimum downtime, but does not make the crash invisible to the application.

> I doubt that the vmware Fault Tolerance would work for any system that used random numbers [...]

Remember that all the HW excepting the CPU is emulated. Fault Tolerance works by replicating the external events (e.g. an incoming network packet) seen by the original VM at the same point in time on the backup VM. Usual sources of entropy will be equal to both machines.

Maybe someone could deliberately construct a program that runs differently in the original and the backup VM. The only thing I can think of right now is measuring time via a covert channel (the normal means like RDTSC are covered, of course) -- but this is too vague still to make into an "exploit".

TCP connection hijacking and parasites - as a good thing

Posted Aug 16, 2011 20:52 UTC (Tue) by dlang (subscriber, #313) [Link]

given that the entropy sources include nanosecond timeing, and if the CPU supports it, thermal noise on the CPU, I _really_ doubt that the resulting random numbers would be the same on the two systems.

TCP connection hijacking and parasites - as a good thing

Posted Aug 16, 2011 22:12 UTC (Tue) by raven667 (subscriber, #5198) [Link]

I think you underestimate the state of the art. Considering how often modern systems use random numbers I can't imagine this case not being handled, by disabling any hardware mechanism that could introduce non-determinism at the very least.

Looking at the VMware FT docs the hypervisor for the primary VM very thoroughly records anything that could change state and does not pass it through to the primary until it is transmitted and receipt acknowledged by the hypervisor for the secondary VM. Features such as SMP or hardware MMU are disabled as their state can't be recorded and could introduce non-determinism. Each event is injected into the secondary at the same execution point. That certainly has to work with nanosecond timing, from the point view from inside the secondary VM. According to wall clock time the secondary will always be lagging behind, the demos show lag in the millisecond range, but because events are recorded it can be brought up to current during a failover event, so no state should be lost.

If you are interested you may want to do a little research on the topic, on your own. When I get this set up in my test environment I'll definitely run though creating ssh keys and whatnot to validate my understanding that this does work.

TCP connection hijacking and parasites - as a good thing

Posted Aug 17, 2011 9:25 UTC (Wed) by Lennie (guest, #49641) [Link]

Does that mean you would expect it to work, even if I use something like http://www.issihosts.com/haveged/ as one of the sources of random ?

TCP connection hijacking and parasites - as a good thing

Posted Aug 17, 2011 13:24 UTC (Wed) by raven667 (subscriber, #5198) [Link]

That's what I would expect. I'd love to get my test environment straightened out and then I could determine one way or another.

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 22:48 UTC (Sat) by dlang (subscriber, #313) [Link]

the 'correct' answer (and what competing products do) is to have multiple servers, and have the data replicated to all the servers.

it's fairly common (although complex) at the application level to implement updates of small amounts of information across a geographically distributed cluster of machines (lookup two phase commit). but if you rely on the entire VM state being synced, there is just too much unnecessary data involved to keep things synced in real time.

TCP connection hijacking and parasites - as a good thing

Posted Aug 15, 2011 21:00 UTC (Mon) by raven667 (subscriber, #5198) [Link]

One thing I thought of when mentally designing a system as you described is that you would both need to have complete history but also a large window of near-future passwords. If you restore a host from last year, you need to be able to log into it and if you restore last week's backup of your password management system it needs to know what the current passwords would be otherwise you have no DR, just HA.

TCP connection hijacking and parasites - as a good thing

Posted Aug 16, 2011 0:09 UTC (Tue) by dlang (subscriber, #313) [Link]

yes, you do need to have a record of the historic passwords as well.

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 10:11 UTC (Sat) by Lennie (guest, #49641) [Link]

So I guess the workload where it is easiest to work with is: long running calculations that have mostly read-only data on the network.

TCP connection hijacking and parasites - as a good thing

Posted Aug 13, 2011 20:04 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

Additionally, it's not unusual to checkpoint the state of both endpoints simultaneously.

We're doing exactly that. It's kinda awe-inspiring to see the whole cluster with all the interconnections to 'time-warp' into a previous state.

TCP connection hijacking and parasites - as a good thing

Posted Aug 18, 2011 10:30 UTC (Thu) by etienne (guest, #25256) [Link]

I do not understand how you can hide the originating IP address of the process you are restarting.
Can you only checkpoint/restart on a machine with the same IP address (so the machine you are restarting cannot be present on the network at the checkpoint time because it would be duplicate IP)?

TCP connection hijacking and parasites - as a good thing

Posted Aug 18, 2011 18:19 UTC (Thu) by dlang (subscriber, #313) [Link]

you create a separate IP address for the service group that you want to migrate, then you move that IP address along with the services.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds