Posted Jun 15, 2007 19:08 UTC (Fri) by vaurora
In reply to: Strong correlation?
Parent article: KHB: Real-world disk failure rates: surprises, surprises, and more surprises
Perhaps my wording was influenced by my personal disk failure rate of 100% within one week of the first SMART error - but I stand by it. :)
The story is slightly more complex than 7% -> 15-30% annual failure rate. The disk failure rates in the study from CMU averaged a yearly failure rate of 3%, varying from 0.5% to 13.5% (after throwing out a 7-year-old batch of disks with a failure rate of 24%). The failure rate of the Google disks varied from 1.7% to 8.6%, depending on the age of the disks. I can't find the average in the paper, but eyeballing it and doing the math gives me 6.34% overall. So we can call it 3-7% average.
More importantly, the failure rate of a disk with no errors is lower than the overall average of 3-7% a year. Figures 6 and 7 in the Google paper show the different failure probabilities for disks with and without scan errors. A disk less than 6 months old with no scan errors has only a 2% probability of failure, while a disk with one or more scan errors has a 33% failure probability. Beginning on page 8 of the Google paper, the authors break down the consequences of scan errors based on time since last error, age of disk, and number of errors. For example, a single scan error on a disk older than 2 years results in a nearly 40% probability of failure in the next 6 months. Take a closer look at those graphs; there's more data than I could summarize in the article.
Finally, whether you consider a change in failure rate even from 2% to 33% significant really depends on how much you value your data and how hard it is to get it back. For the average user, the answers are "A lot," and "Nearly impossible." Raise your hand if you've backed up in the last week.
to post comments)