Weekly edition Kernel Security Distributions Contact Us Search Archives Calendar Subscribe Write for LWN LWN.net FAQ Sponsors

# Strong correlation?

## Strong correlation?

Posted Jun 15, 2007 19:08 UTC (Fri) by vaurora (guest, #38407)
In reply to: Strong correlation? by joern
Parent article: KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Perhaps my wording was influenced by my personal disk failure rate of 100% within one week of the first SMART error - but I stand by it. :)

The story is slightly more complex than 7% -> 15-30% annual failure rate. The disk failure rates in the study from CMU averaged a yearly failure rate of 3%, varying from 0.5% to 13.5% (after throwing out a 7-year-old batch of disks with a failure rate of 24%). The failure rate of the Google disks varied from 1.7% to 8.6%, depending on the age of the disks. I can't find the average in the paper, but eyeballing it and doing the math gives me 6.34% overall. So we can call it 3-7% average.

More importantly, the failure rate of a disk with no errors is lower than the overall average of 3-7% a year. Figures 6 and 7 in the Google paper show the different failure probabilities for disks with and without scan errors. A disk less than 6 months old with no scan errors has only a 2% probability of failure, while a disk with one or more scan errors has a 33% failure probability. Beginning on page 8 of the Google paper, the authors break down the consequences of scan errors based on time since last error, age of disk, and number of errors. For example, a single scan error on a disk older than 2 years results in a nearly 40% probability of failure in the next 6 months. Take a closer look at those graphs; there's more data than I could summarize in the article.

Finally, whether you consider a change in failure rate even from 2% to 33% significant really depends on how much you value your data and how hard it is to get it back. For the average user, the answers are "A lot," and "Nearly impossible." Raise your hand if you've backed up in the last week.

Strong correlation?

Posted Jun 15, 2007 19:26 UTC (Fri) by joern (subscriber, #22392) [Link]

> Finally, whether you consider a change in failure rate even from 2% to 33% significant really depends on how much you value your data and how hard it is to get it back. For the average user, the answers are "A lot," and "Nearly impossible." Raise your hand if you've backed up in the last week.

I've done that today. Admittedly, having shell access to four seperate servers in different locations is uncommon for normal users.

In the end it is a matter of definition what a weak correlation ends and a strong correlation starts. I wouldn't speak of a strong correlation if I'd lose money when betting on the correlated event. So for me it would have to be x% -> 50+%.

Strong correlation?

Posted Jun 15, 2007 21:26 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

In an operation of the size these papers talk about, gut feelings about "strong" and "weak" correlation and the pain of data loss aren't even significant. It's pure numbers. Somebody somewhere has decided how much a data loss costs and probability, repair costs, and interest rates fill out the equation.

Sometimes the cost of data loss is really simple. I had a telephone company customer years ago who said an unreadable tape cost him exactly \$16,000. The tapes contained billing records of calls; without the record, the company simply couldn't bill for the call. Another, arguing against backing up his product source code, showed the cost of hiring engineers to rewrite a megabyte of code from scratch.

In the Google situation, I believe single drive data loss is virtually cost-free. That's because of all that replication and backup. In that situation, the cost of the failure is just the cost of service interruption (or degradation) and drive replacement. And since such interruptions and replacements happen regularly, the only question is whether it's cheaper to replace a drive earlier and thereby suffer the interruption later.

Anyway, my point is that with all the different ways disk drives are used, I'm sure there are plenty where replacing the drive when its expected failure rate jumps to 30% is wise and plenty where doing so at 90% is unwise.

Strong correlation?

Posted Jun 16, 2007 2:16 UTC (Sat) by vaurora (guest, #38407) [Link]

This is an excellent point - the utility of failure probability data depends on the use case. Google in general has all data replicated a minimum of three times (see the GoogleFS paper) and as a result, it is not cost-effective to replace a drive before it actually fails in practice in most situations. For any sort of professional operation with regular backups and/or replication, this data is not particularly useful except as input into how many thousands of new hard drives to order next month. But for an individual user without automated backup systems, it can provide a valuable hint on the utility of conducting that long-delayed manual backup within the next few hours.