Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
PostgreSQL 9.3 beta: Federated databases and more
LWN.net Weekly Edition for May 9, 2013
(Nearly) full tickless operation in 3.10
TCP connection hijacking and parasites - as a good thing
Posted Aug 12, 2011 23:08 UTC (Fri) by martinfick (subscriber, #4455)
Posted Aug 13, 2011 0:46 UTC (Sat) by Lennie (subscriber, #49641)
So when a machine crashes it can (automatically) be started on an other machine.
It might not be useful for all workloads, but maybe some ?
Posted Aug 13, 2011 1:41 UTC (Sat) by dlang (✭ supporter ✭, #313)
If you take that older checkpoint and start it on another machine, you have time-warped the application, and there may be other components that you are talking to that haven't done a similar time warp.
a perfect example of this:
I was evaluating a commercial product that goes out and changes the root password on your system on a frequent basis. If you need the root password you make a request to this system. When I asked the vendor how they did High Availability and Disaster Recovery for the system (i.e. what happens if the hardware it's running on bursts into flames), their answer was that they used vmware and vmware would auto-migrate the service to another server.
When I pointed out that vmware could not be 100% up to date and the old instance may have changed passwords on systems and the new instance won't know about it, their answer turned into "just trust us"
not all apps are subject to problems like this from time warps, but the more it communicates with the outside world, the more likely it is to have a problem.
Posted Aug 13, 2011 4:49 UTC (Sat) by raven667 (subscriber, #5198)
Posted Aug 13, 2011 5:00 UTC (Sat) by dlang (✭ supporter ✭, #313)
even if they were smart enough to not change the remote password until they stored it (and then what happens if they can't change the password? they did _not_ keep a record of old passwords), I've seen too many cases where a dieing system scribbles on the disk to be comfortable with the idea of betting that such a thing does not happen for access to my servers.
the company abandoned the product (shut it down, didn't even sell it off) about 9 months later, so I don't think I was the only person asking such hard questions of them.
Posted Aug 13, 2011 22:37 UTC (Sat) by raven667 (subscriber, #5198)
I am surprised that they couldn't give you a better answer though, I can think of three or four ways to mitigate that risk off the top of my head. Maybe their sales force was undertrained or their engineers were chuckleheads.
Posted Aug 13, 2011 22:46 UTC (Sat) by raven667 (subscriber, #5198)
Posted Aug 13, 2011 22:55 UTC (Sat) by dlang (✭ supporter ✭, #313)
this works quite well if you aren't sensitive to time warps, and everything is in one geographic location so that you can have the snapshots stored on shared disks.
if you are sensitive to time warps, and you need to allow for your recovery to be in a different geographic location (so that there is a significant lag in the storage changes getting replicated to the new datacenter) then the approach of just saying "vmware solves our HA/DR issues, we don't need to care about them at an application design level" is a very bad approach to take.
Posted Aug 15, 2011 20:50 UTC (Mon) by raven667 (subscriber, #5198)
Of course, saying that VM solves our HA/DR problems like so much secret sauce is BS without making the application resilient to the same kind of problems that can happen to bare metal such as unexpected crashes and some way to handle DR which is very different than HA.
Actually VMware does have a feature now called Fault Tolerance where a VM is run in lockstep on two different nodes in a cluster. They are kept running the same instruction for instruction AFAIK so there would be no time warp during a failover event, there are some product demos on youtube showing them playing a video file over a remote desktop and it not skipping a frame while failing over. That could work as an HA solution but would do nothing for DR.
Posted Aug 16, 2011 0:08 UTC (Tue) by dlang (✭ supporter ✭, #313)
In this case the vendor was claiming that this also solved crash problems because vmware would just restart the application on the other server exactly where it left off when the first server crashed.
I doubt that the vmware Fault Tolerance would work for any system that used random numbers as I don't see how they could find all the places that generated or used the random numbers and make both machines have the exact same results.
Posted Aug 16, 2011 20:36 UTC (Tue) by robbe (guest, #16131)
VMware's HA feature just restarts the VM affected by a server crash on another ESX server. From the perspective of the VM it looks like a spontaneous reboot. This guarantees minimum downtime, but does not make the crash invisible to the application.
> I doubt that the vmware Fault Tolerance would work for any system that used random numbers [...]
Remember that all the HW excepting the CPU is emulated. Fault Tolerance works by replicating the external events (e.g. an incoming network packet) seen by the original VM at the same point in time on the backup VM. Usual sources of entropy will be equal to both machines.
Maybe someone could deliberately construct a program that runs differently in the original and the backup VM. The only thing I can think of right now is measuring time via a covert channel (the normal means like RDTSC are covered, of course) -- but this is too vague still to make into an "exploit".
Posted Aug 16, 2011 20:52 UTC (Tue) by dlang (✭ supporter ✭, #313)
Posted Aug 16, 2011 22:12 UTC (Tue) by raven667 (subscriber, #5198)
Looking at the VMware FT docs the hypervisor for the primary VM very thoroughly records anything that could change state and does not pass it through to the primary until it is transmitted and receipt acknowledged by the hypervisor for the secondary VM. Features such as SMP or hardware MMU are disabled as their state can't be recorded and could introduce non-determinism. Each event is injected into the secondary at the same execution point. That certainly has to work with nanosecond timing, from the point view from inside the secondary VM. According to wall clock time the secondary will always be lagging behind, the demos show lag in the millisecond range, but because events are recorded it can be brought up to current during a failover event, so no state should be lost.
If you are interested you may want to do a little research on the topic, on your own. When I get this set up in my test environment I'll definitely run though creating ssh keys and whatnot to validate my understanding that this does work.
Posted Aug 17, 2011 9:25 UTC (Wed) by Lennie (subscriber, #49641)
Posted Aug 17, 2011 13:24 UTC (Wed) by raven667 (subscriber, #5198)
Posted Aug 13, 2011 22:48 UTC (Sat) by dlang (✭ supporter ✭, #313)
it's fairly common (although complex) at the application level to implement updates of small amounts of information across a geographically distributed cluster of machines (lookup two phase commit). but if you rely on the entire VM state being synced, there is just too much unnecessary data involved to keep things synced in real time.
Posted Aug 15, 2011 21:00 UTC (Mon) by raven667 (subscriber, #5198)
Posted Aug 16, 2011 0:09 UTC (Tue) by dlang (✭ supporter ✭, #313)
Posted Aug 13, 2011 10:11 UTC (Sat) by Lennie (subscriber, #49641)
Posted Aug 13, 2011 20:04 UTC (Sat) by Cyberax (✭ supporter ✭, #52523)
We're doing exactly that. It's kinda awe-inspiring to see the whole cluster with all the interconnections to 'time-warp' into a previous state.
Posted Aug 18, 2011 10:30 UTC (Thu) by etienne (subscriber, #25256)
Posted Aug 18, 2011 18:19 UTC (Thu) by dlang (✭ supporter ✭, #313)
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds