LWN.net Logo

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

June 12, 2007

This article was contributed by Valerie Henson

At this year's USENIX File Systems and Storage Technology Conference, we were treated to two papers studying failure rates in disk populations numbering over 100,000. These kinds of data sets are hard to get - first you have to have 100,000 disks, then you have to record failure-related data faithfully for years on end, and then you have to release the data in a form that doesn't get anyone sued. The storage community has salivated after this kind of real-world data for years, and now we have not one, but two (!) long-term studies of disk failure rates. The conference hall was packed during these two presentations. When the talks were done, we stumbled out into the hallway, dazed and excited by the many surprising results. Heat is negatively correlated with failure! Failures show short AND long-term correlation! SMART errors do mean the drive is more likely to fail, but a third of drives die with no warning at all! The size of the data sets, the quality of analysis, and the non-intuitive results win these two papers a place on the Kernel Hacker's Bookshelf.

The first paper (and winner of Best Paper), was Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?, by Bianca Schroeder and Garth Gibson. They reviewed failure data from a collection of 100,000 disks, over a period of up to 5 years. The disks were part of a variety of HPC clusters and an Internet service provider. Disk failure was defined as the disk being replaced. The date of replacement was also used as the date of the failure, since determining exactly when a disk failed was not possible.

Their first major result was that the real-world annualized failure rate (average percentage of disks failing per year) was much higher than the manufacturer's estimate - an average of 3% vs. the estimated 0.5 - 0.9%. Disk manufacturers obviously can't test disks for a year before shipping them, so they stress test disks in high-temperature, high-vibration, high-workload environments, and use data from previous models to estimate MTTF. Only one set of disks had a real-world failure rate less than the estimated failure rate, and one set of disks had a 13.5% annualized failure rate!

More surprisingly, they found no correlation between failure rate and disk type - SCSI, SATA, or fiber channel. The most reliable disk set was composed of only SATA drives, which are commonly regarded to be less reliable than SCSI or fibre channel.

In another surprise, they debunked the "bathtub model" of disk failure rates. In this theory, disks experience a higher "infant mortality" initial rate of failure, then settle down for a few years of low failure rate, and then begin to wear out and fail. The graph of the probability vs. time looks like a bathtub, flat in the middle and sloping up at the ends. Instead, the real-world failure rate began low and steadily increased over the years. Disks don't have a sweet spot of low failure rate.

Failures within a batch of disks were strongly correlated over both short and long time periods. If a disk had failed in a batch, then there was a significant probability of a second failure up to at least 2 years later. If one disk in your batch has just gone, you are more likely to have another disk failure in the same batch. Scary news for RAID arrays with disks from the same batch. A recent paper in the 2006 Storage Security and Survivability Workshop, Using Device Diversity to Protect Data against Batch-Correlated Disk Failures, by Jehan-François Pâris and Darrell D. E. Long, calculated the increase in RAID reliability from mixing batches of disks. Using more than one kind of disk increases costs, but with the combination of data from these two papers, RAID users can calculate the value of the extra reliability and make the most economical decision.

The second paper, Failure Trends in a Large Disk Drive Population, by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andrè Barroso, reports on disk failure rates at Google. They used a Google tool for recording system health parameters and many other staples of Google software (Mapreduce, Bigtable, etc.) to collect and analyze the data. They focused on SMART statistics - the built-in disk drive monitoring in many modern disk drives, which records statistics about scan errors and blocks relocated.

The first result agrees with the first paper: The annualized failure rate was much higher than estimated, between 1.7% and 8.6%. They next looked for correlation between failure rate and drive utilization (as estimated by the amount of data read or written to the drive). They find a much weaker correlation between higher utilization and failure rate than expected, with low utilization disks often having higher failure rates than medium utilization disks, and, in the case of the 3-year-old vintage of disks, higher than the high utilization group.

Now for the most surprising result. In Google's population of cheap ATA disks, high temperature was negatively correlated with failure! In the authors' words:

In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend.

This correlation held true over a temperature range of 17-55 C. Only in the 3-year-old disk population was there correlation between high temperatures and failure rates. My completely unsupported and untested hypothesis is that drive manufacturers stress test their drives in high temperature environments to simulate longer wear. Perhaps they have unwittingly designed drives that work better in their high-temperature test environment at the expense of a more typical low-temperature field environment.

Finally, they looked at the SMART data gathered from the drives. Overall, any kind of SMART error correlated strongly with disk failure. A scan error occurs when the disk checks data in the background, reading the entire disk. Within 8 months of the first scan error, about 30% of drives would fail completely. A reallocation error occurs when a block can't be written, and the block is reassigned to another location on disk. A reallocation error resulted in about 15% of affected drives failing with 8 months. On the other hand, 36% of the drives that failed had no warning whatsoever, either from SMART errors or from exceptionally high temperatures.

For Google's purposes, the predictive power of SMART is of limited utility. Replacing every disk that had a SMART error would end up replacing good disks that will run for years to come about 70% of the time. For Google, this isn't cost-effective, since all their data is replicated several times. But for an individual user for whom losing their disk is a disaster, replacing the disk at the first sign of a SMART error makes eminent sense. I have personally had two laptop drives start spitting SMART errors in time to get my data off the disk before it died completely.

Overall, these are two exciting papers with long-awaited real-world failure data on large disk populations. We should expect to see more publications analyzing these data sets in the years to come.

Valerie Henson is a Linux file systems consultant specializing in file system check and repair.


(Log in to post comments)

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 14, 2007 2:36 UTC (Thu) by pr1268 (subscriber, #24648) [Link]

The Second disk failure research paper, Failure Trends in a Large Disk Drive Population, wants a login and password to download the PDF. Any way I might get that without having to create a Usenix login account?

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 14, 2007 3:28 UTC (Thu) by rasjidw (guest, #15913) [Link]

Try http://labs.google.com/papers/disk_failures.html.

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 14, 2007 3:55 UTC (Thu) by pr1268 (subscriber, #24648) [Link]

Funny that the Google Labs paper is one I downloaded several months ago (its date is February 2007). In fact, didn't LWN.net publish an article about this around that time?

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 14, 2007 4:16 UTC (Thu) by pr1268 (subscriber, #24648) [Link]

Thank you for the article - this gives some insight into how hard disk manufacturers can make claims for X hours MTTF and similar. And to make general estimates as to the life span of disk drives in general.

Veering slightly off-topic, I have become an exponent for Seagate hard drives - any company who proudly advertises "5 year warranty" on the packaging of their mass-market consumer IDE and SATA drives (whilst most other major manufacturers only give 1-3 years) gets my business. It's not necessarily about whether I'd really have to make a warranty claim (WDC honored such a claim for me back in 2002 with minimal fuss), but rather that they have such a high level of confidence in their craftsmanship as to even advertise such a warranty.

Sorry if this violates any rules for "plugging" a particular brand. But, I'm reminded of what the salesperson at Fry's told me about why Seagate warranties their drives so well: They re-engineered the spindle bearings and motor assembly - the critical points of the disk drive which the salesperson said were most often the cause of total drive failure. Any other ideas/comments?

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 14, 2007 7:03 UTC (Thu) by bronson (subscriber, #4806) [Link]

Er, you listen to salespeople at Fry's??

I've only ever managed clusters of disks numbered in the few tens. Overall I find that hard disks are surprisingly reliable. I think most disks fall out of rotation because of a lack of capacity, not because they break. (I still remember how thrilled I was when they fit 20 GB of data in a 3.5" package... today I can fit 50 of those drives into a single 3.5" package!)

One lesson that I've learned is that manufacturer loyalty is pretty much meaningless. I had an early batch of Quantum 15G drives that were so stone-cold reliable I'm sure they would still be working today. However, Quantum Fireballs would reliably die after two years. I remember Maxtor producing utter crap in the past but I have a set of their 60GB drives still spinning. Seagate used to be fairly mediocre and now they're top notch across the board. When deciding what drive to buy, go to Storage Review and read about the individual models; brand is meaningless.

I also find that motherboard failures (i.e. CPU or memory socket corrosion, weak power supply, etc) tends to destroy the drive as well. I read the Google paper back in Feb and I don't remember them taking this into account... It would be nice to know what portion of failures were drive-only, and what portion was chaotic damage that just happened to include the drive as well.

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 15, 2007 0:51 UTC (Fri) by pr1268 (subscriber, #24648) [Link]

> Er, you listen to salespeople at Fry's??

I listen to them, but I don't always believe everything they say. But, again, my Seagate discussion was more tuned into the idea that their warranty (which they proudly advertise in big print on the sides of the box) is much longer than those of most other name-brand hard disk manufacturers (which is often hidden in the fine print). I agree, brand loyalty doesn't mean much these days, and besides, I used to think WDC (Western Digital) drives were the finest-quality consumer hard disks. Nowadays I don't think WDC drives are nearly as good as they were 5-10 years ago, and I've never had any good experiences with Maxtor. Times change, and so do hard drive manufacturers' quality control (and standards).

Maybe in 5-10 years I'll think the same about Maxtor, WDC, or Fujitsu as I do now about Seagate...

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 15, 2007 4:13 UTC (Fri) by njs (guest, #40338) [Link]

It's a funny thing about us humans, if you put us in a data vacuum, we find some way to fill it, whether it makes sense or not. Most of us have no useful statistics at all on what hard drive brands or models will be reliable; heck, as the article points out, the people *making* the drives don't even know this.

But this ignorance makes us so *uncomfortable* that everyone finds some random fact to base their decisions on, like an anecdote about that time they bought a *** brand drive they and it died after a week and so they never buy *** anymore, or how they heard the new *** brand drives use a fancier production process, or the warranty labels on the side of the box.

Warranties aren't a measure of how proud some engineers somewhere are, they're a measure of some sales/accounting decision about how the cost of providing that warranty will compare to the extra sales they get by putting it on the box. (5 years ago we were using what, 40 GB drives? If that died today, assuming you even still had it online, would you figure out how to ship it back for a new 40 GB under the warranty, or just pick up 400 GB at the local mall? Whenever it's the latter, the warranty costs Seagate nothing, and Seagate knows how many people fall into each camp.)

Hard drives are a commodity. Any given model has some greater-than-zero failure rate; people who care about their data make backups and the failure rate doesn't matter, people who don't care about their data worry and fret over exactly what the best lottery ticket to buy is. Me, I figure hard drives are all close enough in speed I'm never going to notice, but I have the thing sitting right next to me all day long, so I buy drives by checking Silent PC Review's recommended list, and picking the top-rated drive I can find on sale.

Hard Disk Drive Warranties

Posted Jun 15, 2007 5:06 UTC (Fri) by pr1268 (subscriber, #24648) [Link]

Even assuming that hard drive warranties are written by the sales/accounting department, don't you suppose that they looked at return rates of their products in order to make that warranty period?

WDC used to pledge a 3-year warranty. Now it's 1-year (again, assuming their consumer drives--IIRC their "Raptor" series of true-SCSI drives gets a longer warranty). Whether it was the sales/marketing folks at WDC, or it was the engineers, either way, around 3-4 years ago they decided that the warranty claim rate wasn't good enough to justify maintaining the 3-year warranty, so they reduced it to 1-year.

Certainly the folks over at Seagate were wise enough to perform the same cost vs. benefit analysis of pledging such a long warranty, regardless of whether it was the engineering team or the sales/marketing folks. But, with Seagate's substantially longer warranty, I can only assume that their cost vs. benefit analysis demonstrated either of two things: (1) their drives were high-enough quality such that the return rates were low and they could warranty their drives for 5 years whilst remaining profitable, or (2) They could absorb the cost of replacing defective drives under warranty at will for the indicated warranty period given a failure rate no better or worse than the commodity average.

I just don't see (2) above happening without Seagate making drives of such sorry quality and cheap manufacturing costs that they can justify the long warranty (analogy: I sell you a television for $150 which cost me $20 to build, and it has a 20% annual failure rate, so I can justify warrantying it for 5 years and still make a profit of $50), and I don't see them making drives of such unusually high quality that their manufacturing costs (and retail prices) spiral upwards. Their drives are competitively priced with WDC, Maxtor, and Fujitsu.

I don't mean to argue; but rather, I wanted to share my experiences and perhaps invoke a mildly-stimulating discussion. I totally agree that doing some basic consumer research on hard drive quality and features (you mentioned Silent PC) is a good idea for anyone wanting to invest in spinning platter data storage. :-)

Strong correlation?

Posted Jun 15, 2007 10:08 UTC (Fri) by joern (subscriber, #22392) [Link]

> Overall, any kind of SMART error correlated strongly with disk failure.

The likelyhood of a disk failure going up from an annual 7% to 15-30% in eight month is hardly what I call a strong correlation. 80-90% would have been a strong correlation. 30% leads me to the same conclusion Google had:
keep the disk, but also keep a backup somewhere.

And this conclusion is no different for 0.5%, 7% or 30%. If anything changes at all, the number of backups might.

Strong correlation?

Posted Jun 15, 2007 19:08 UTC (Fri) by vaurora (guest, #38407) [Link]

Perhaps my wording was influenced by my personal disk failure rate of 100% within one week of the first SMART error - but I stand by it. :)

The story is slightly more complex than 7% -> 15-30% annual failure rate. The disk failure rates in the study from CMU averaged a yearly failure rate of 3%, varying from 0.5% to 13.5% (after throwing out a 7-year-old batch of disks with a failure rate of 24%). The failure rate of the Google disks varied from 1.7% to 8.6%, depending on the age of the disks. I can't find the average in the paper, but eyeballing it and doing the math gives me 6.34% overall. So we can call it 3-7% average.

More importantly, the failure rate of a disk with no errors is lower than the overall average of 3-7% a year. Figures 6 and 7 in the Google paper show the different failure probabilities for disks with and without scan errors. A disk less than 6 months old with no scan errors has only a 2% probability of failure, while a disk with one or more scan errors has a 33% failure probability. Beginning on page 8 of the Google paper, the authors break down the consequences of scan errors based on time since last error, age of disk, and number of errors. For example, a single scan error on a disk older than 2 years results in a nearly 40% probability of failure in the next 6 months. Take a closer look at those graphs; there's more data than I could summarize in the article.

Finally, whether you consider a change in failure rate even from 2% to 33% significant really depends on how much you value your data and how hard it is to get it back. For the average user, the answers are "A lot," and "Nearly impossible." Raise your hand if you've backed up in the last week.

Strong correlation?

Posted Jun 15, 2007 19:26 UTC (Fri) by joern (subscriber, #22392) [Link]

> Finally, whether you consider a change in failure rate even from 2% to 33% significant really depends on how much you value your data and how hard it is to get it back. For the average user, the answers are "A lot," and "Nearly impossible." Raise your hand if you've backed up in the last week.

I've done that today. Admittedly, having shell access to four seperate servers in different locations is uncommon for normal users.

In the end it is a matter of definition what a weak correlation ends and a strong correlation starts. I wouldn't speak of a strong correlation if I'd lose money when betting on the correlated event. So for me it would have to be x% -> 50+%.

Strong correlation?

Posted Jun 15, 2007 21:26 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

In an operation of the size these papers talk about, gut feelings about "strong" and "weak" correlation and the pain of data loss aren't even significant. It's pure numbers. Somebody somewhere has decided how much a data loss costs and probability, repair costs, and interest rates fill out the equation.

Sometimes the cost of data loss is really simple. I had a telephone company customer years ago who said an unreadable tape cost him exactly $16,000. The tapes contained billing records of calls; without the record, the company simply couldn't bill for the call. Another, arguing against backing up his product source code, showed the cost of hiring engineers to rewrite a megabyte of code from scratch.

In the Google situation, I believe single drive data loss is virtually cost-free. That's because of all that replication and backup. In that situation, the cost of the failure is just the cost of service interruption (or degradation) and drive replacement. And since such interruptions and replacements happen regularly, the only question is whether it's cheaper to replace a drive earlier and thereby suffer the interruption later.

Anyway, my point is that with all the different ways disk drives are used, I'm sure there are plenty where replacing the drive when its expected failure rate jumps to 30% is wise and plenty where doing so at 90% is unwise.

Strong correlation?

Posted Jun 16, 2007 2:16 UTC (Sat) by vaurora (guest, #38407) [Link]

This is an excellent point - the utility of failure probability data depends on the use case. Google in general has all data replicated a minimum of three times (see the GoogleFS paper) and as a result, it is not cost-effective to replace a drive before it actually fails in practice in most situations. For any sort of professional operation with regular backups and/or replication, this data is not particularly useful except as input into how many thousands of new hard drives to order next month. But for an individual user without automated backup systems, it can provide a valuable hint on the utility of conducting that long-delayed manual backup within the next few hours.

KHB: Real-world disk failure rates: temperature

Posted Jun 16, 2007 2:41 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

About those temperature correlations: temperature of what, how do they know, and why did temperatures vary?

KHB: Real-world disk failure rates: temperature

Posted Jun 16, 2007 10:44 UTC (Sat) by vaurora (guest, #38407) [Link]

Part of the SMART protocol included reading temperatures from sensors inside the disk. The record temperatures are of the inside of the disk enclosure. The temperatures vary according to the external temperature, air flow around and in the drive, and the waste heat generated by the drive itself.

KHB: Real-world disk failure rates: temperature

Posted Jun 18, 2007 3:52 UTC (Mon) by giraffedata (subscriber, #1954) [Link]

That puts the question of temperature/failure correlation in a whole different light. The correlation people expect has to do with the idea that if you put a disk drive in a hot environment, it will die sooner than if you put it in a cool environment.

But my guess is that the outside temperature and air flow around the drives doesn't vary among Google's sample, so any temperature difference inside the drive is due to the design of the disk drive. IOW, the correlation shows that the designs that run cooler are also the ones that fail more.

Considering that the engineers do design for and test for failure rates, i.e. the failure rate is the independent variable, I would not expect a drive that runs hotter to fail more. Engineers would have designed it to run that hot.

But I might be convinced that in their struggle to make a drive consume less power, and thus run cooler, the engineers sacrificed longevity. (I don't know enough about disk drive design to know how such a tradeoff would be made, but I'm sure there's a way). That could explain the negative correlation.

KHB: Real-world disk failure rates: temperature

Posted Jun 20, 2007 2:09 UTC (Wed) by mhelsley (subscriber, #11324) [Link]

Considering the energy needed to manufacture the hard drives and their lower lifetimes I wonder it truly results in net energy savings.

KHB: Real-world disk failure rates: temperature

Posted Jun 20, 2007 16:40 UTC (Wed) by giraffedata (subscriber, #1954) [Link]

Considering the energy needed to manufacture the hard drives and their lower lifetimes I wonder if it truly results in net energy savings.

That's an insightful question, but as the true goal is to save resources in general, and not just energy (or fossil fuels and clean and clear air), the right question is actually a little simpler: Does the money a data center saves on electricity make up for the cost of more frequent replacements? The amount energy used in manfuacturing a drive is reflected in its price.

Of course, we're only speculating at this point that there is any correlation between energy efficiency and failure rate. The one really useful result of these two studies is that manufacturers don't know what their drive lifetimes are, so users have been making buying decisions based on wrong numbers.

KHB: Real-world disk failure rates: temperature

Posted Jun 19, 2007 23:21 UTC (Tue) by peterc (subscriber, #1605) [Link]

Note also that those temperature sensors are not calibrated. I wouldn't be surprised if they have a design variability of 10%!

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 22, 2007 3:27 UTC (Fri) by NedLudd (guest, #37615) [Link]

Thank you Valerie for the wonderful write-up. I really enjoy reading your articles!!

--brian

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 22, 2007 11:05 UTC (Fri) by jengelh (subscriber, #33263) [Link]

>they found no correlation between failure rate and disk type - SCSI, SATA, or fiber channel. The most reliable disk set was composed of only SATA drives, which are commonly regarded to be less reliable than SCSI or fibre channel.

No more SCSI myths heh.

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 27, 2007 15:13 UTC (Wed) by dvanzandt (guest, #45962) [Link]

I'm not going to give up my SCSI myths too quickly.<g>

I would like to know more about the rotational speed, # of platters, total capacity, i/o parameters, useage (db vs. file share vs. near-line storage, etc.) of the various drives first.

In addition, it would be very interesting to see which drive "types" died more often due to component and logic failure as opposed to "shedding" media.

Having said all that, I'm happier about specing some SATA rigs for my less affluent customers.

Were there any SAS drives in this mix? I am trying to download the study, but the server keeps timing out. If the answer is in it, "never mind."

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds