Quotes of the week

After this work, Left 4 Dead 2 is running at 315 FPS on Linux. That the Linux version runs faster than the Windows version (270.6) seems a little counter-intuitive, given the greater amount of time we have spent on the Windows version. However, it does speak to the underlying efficiency of the kernel and OpenGL. Interestingly, in the process of working with hardware vendors we also sped up the OpenGL implementation on Windows. Left 4 Dead 2 is now running at 303.4 FPS with that configuration.
The Valve Linux team

We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient.
Posted Aug 10, 2012 0:06 UTC (Fri) by dvdeug (subscriber, #10998) [Link]

Surely it would be better to regularly kill processes and watch them fall instead of randomly kill processes in an active system. Surely the results of the failure could be something like a user not getting a package or not getting a program, where you may not hear of the problem, and if you do, it'll be hard to trace back to the source.

Posted Aug 10, 2012 2:06 UTC (Fri) by dlang (subscriber, #313) [Link]

First off, they are talking about failures much mroe significant than killing random processes, they are including network/system/power/building failures as well.

Secondly, in theory planning for every possible failure and setting up explicit handling for that failure is the best approach, in practice people have blind spots and something will go wrong that they didn't think of. It gets even worse when you start talking about combinations of failures.

As a result, the practice of randomly killing devices/systems/processes actually makes your far more resilient in the long run.

It's rough to get started with this, you need to make a real effort to make your system handle all the normal outages that you can think of, and you have to have management that agrees with this and is willing to accept the occasional outage that results when you find a new problem.

But when you have deliberately taken something down, it's usually far easier to bring it up again than when the same failure happens for real.

Plus you have your logs about what was deliberately "failed", and this can greatly cut down on your troubleshooting time if there is an outage.

Posted Aug 15, 2012 3:12 UTC (Wed) by Trelane (subscriber, #56877) [Link]

They seem to be pretty consistently failing to support linux....

