KHB: Real-world disk failure rates: surprises, surprises, and more surprises

June 12, 2007

This article was contributed by Valerie Aurora

At this year's USENIX File Systems and Storage Technology Conference, we were treated to two papers studying failure rates in disk populations numbering over 100,000. These kinds of data sets are hard to get - first you have to have 100,000 disks, then you have to record failure-related data faithfully for years on end, and then you have to release the data in a form that doesn't get anyone sued. The storage community has salivated after this kind of real-world data for years, and now we have not one, but two (!) long-term studies of disk failure rates. The conference hall was packed during these two presentations. When the talks were done, we stumbled out into the hallway, dazed and excited by the many surprising results. Heat is negatively correlated with failure! Failures show short AND long-term correlation! SMART errors do mean the drive is more likely to fail, but a third of drives die with no warning at all! The size of the data sets, the quality of analysis, and the non-intuitive results win these two papers a place on the Kernel Hacker's Bookshelf.

The first paper (and winner of Best Paper), was Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?, by Bianca Schroeder and Garth Gibson. They reviewed failure data from a collection of 100,000 disks, over a period of up to 5 years. The disks were part of a variety of HPC clusters and an Internet service provider. Disk failure was defined as the disk being replaced. The date of replacement was also used as the date of the failure, since determining exactly when a disk failed was not possible.

Their first major result was that the real-world annualized failure rate (average percentage of disks failing per year) was much higher than the manufacturer's estimate - an average of 3% vs. the estimated 0.5 - 0.9%. Disk manufacturers obviously can't test disks for a year before shipping them, so they stress test disks in high-temperature, high-vibration, high-workload environments, and use data from previous models to estimate MTTF. Only one set of disks had a real-world failure rate less than the estimated failure rate, and one set of disks had a 13.5% annualized failure rate!

More surprisingly, they found no correlation between failure rate and disk type - SCSI, SATA, or fiber channel. The most reliable disk set was composed of only SATA drives, which are commonly regarded to be less reliable than SCSI or fibre channel.

In another surprise, they debunked the "bathtub model" of disk failure rates. In this theory, disks experience a higher "infant mortality" initial rate of failure, then settle down for a few years of low failure rate, and then begin to wear out and fail. The graph of the probability vs. time looks like a bathtub, flat in the middle and sloping up at the ends. Instead, the real-world failure rate began low and steadily increased over the years. Disks don't have a sweet spot of low failure rate.

Failures within a batch of disks were strongly correlated over both short and long time periods. If a disk had failed in a batch, then there was a significant probability of a second failure up to at least 2 years later. If one disk in your batch has just gone, you are more likely to have another disk failure in the same batch. Scary news for RAID arrays with disks from the same batch. A recent paper in the 2006 Storage Security and Survivability Workshop, Using Device Diversity to Protect Data against Batch-Correlated Disk Failures, by Jehan-François Pâris and Darrell D. E. Long, calculated the increase in RAID reliability from mixing batches of disks. Using more than one kind of disk increases costs, but with the combination of data from these two papers, RAID users can calculate the value of the extra reliability and make the most economical decision.

The second paper, Failure Trends in a Large Disk Drive Population, by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andrè Barroso, reports on disk failure rates at Google. They used a Google tool for recording system health parameters and many other staples of Google software (Mapreduce, Bigtable, etc.) to collect and analyze the data. They focused on SMART statistics - the built-in disk drive monitoring in many modern disk drives, which records statistics about scan errors and blocks relocated.

The first result agrees with the first paper: The annualized failure rate was much higher than estimated, between 1.7% and 8.6%. They next looked for correlation between failure rate and drive utilization (as estimated by the amount of data read or written to the drive). They find a much weaker correlation between higher utilization and failure rate than expected, with low utilization disks often having higher failure rates than medium utilization disks, and, in the case of the 3-year-old vintage of disks, higher than the high utilization group.

Now for the most surprising result. In Google's population of cheap ATA disks, high temperature was negatively correlated with failure! In the authors' words:

In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend.

This correlation held true over a temperature range of 17-55 C. Only in the 3-year-old disk population was there correlation between high temperatures and failure rates. My completely unsupported and untested hypothesis is that drive manufacturers stress test their drives in high temperature environments to simulate longer wear. Perhaps they have unwittingly designed drives that work better in their high-temperature test environment at the expense of a more typical low-temperature field environment.

Finally, they looked at the SMART data gathered from the drives. Overall, any kind of SMART error correlated strongly with disk failure. A scan error occurs when the disk checks data in the background, reading the entire disk. Within 8 months of the first scan error, about 30% of drives would fail completely. A reallocation error occurs when a block can't be written, and the block is reassigned to another location on disk. A reallocation error resulted in about 15% of affected drives failing with 8 months. On the other hand, 36% of the drives that failed had no warning whatsoever, either from SMART errors or from exceptionally high temperatures.

For Google's purposes, the predictive power of SMART is of limited utility. Replacing every disk that had a SMART error would end up replacing good disks that will run for years to come about 70% of the time. For Google, this isn't cost-effective, since all their data is replicated several times. But for an individual user for whom losing their disk is a disaster, replacing the disk at the first sign of a SMART error makes eminent sense. I have personally had two laptop drives start spitting SMART errors in time to get my data off the disk before it died completely.

Overall, these are two exciting papers with long-awaited real-world failure data on large disk populations. We should expect to see more publications analyzing these data sets in the years to come.

Valerie Henson is a Linux file systems consultant specializing in file system check and repair.

Index entries for this article
GuestArticles	Aurora (Henson), Valerie

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 14, 2007 2:36 UTC (Thu) by pr1268 (guest, #24648) [Link] (2 responses)

The Second disk failure research paper, Failure Trends in a Large Disk Drive Population, wants a login and password to download the PDF. Any way I might get that without having to create a Usenix login account?

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 14, 2007 3:28 UTC (Thu) by rasjidw (guest, #15913) [Link] (1 responses)

Try http://labs.google.com/papers/disk_failures.html.

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 14, 2007 3:55 UTC (Thu) by pr1268 (guest, #24648) [Link]

Funny that the Google Labs paper is one I downloaded several months ago (its date is February 2007). In fact, didn't LWN.net publish an article about this around that time?

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 14, 2007 4:16 UTC (Thu) by pr1268 (guest, #24648) [Link] (6 responses)

Thank you for the article - this gives some insight into how hard disk manufacturers can make claims for X hours MTTF and similar. And to make general estimates as to the life span of disk drives in general.

Veering slightly off-topic, I have become an exponent for Seagate hard drives - any company who proudly advertises "5 year warranty" on the packaging of their mass-market consumer IDE and SATA drives (whilst most other major manufacturers only give 1-3 years) gets my business. It's not necessarily about whether I'd really have to make a warranty claim (WDC honored such a claim for me back in 2002 with minimal fuss), but rather that they have such a high level of confidence in their craftsmanship as to even advertise such a warranty.

Sorry if this violates any rules for "plugging" a particular brand. But, I'm reminded of what the salesperson at Fry's told me about why Seagate warranties their drives so well: They re-engineered the spindle bearings and motor assembly - the critical points of the disk drive which the salesperson said were most often the cause of total drive failure. Any other ideas/comments?

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 14, 2007 7:03 UTC (Thu) by bronson (subscriber, #4806) [Link] (1 responses)

Er, you listen to salespeople at Fry's??

I've only ever managed clusters of disks numbered in the few tens. Overall I find that hard disks are surprisingly reliable. I think most disks fall out of rotation because of a lack of capacity, not because they break. (I still remember how thrilled I was when they fit 20 GB of data in a 3.5" package... today I can fit 50 of those drives into a single 3.5" package!)

One lesson that I've learned is that manufacturer loyalty is pretty much meaningless. I had an early batch of Quantum 15G drives that were so stone-cold reliable I'm sure they would still be working today. However, Quantum Fireballs would reliably die after two years. I remember Maxtor producing utter crap in the past but I have a set of their 60GB drives still spinning. Seagate used to be fairly mediocre and now they're top notch across the board. When deciding what drive to buy, go to Storage Review and read about the individual models; brand is meaningless.

I also find that motherboard failures (i.e. CPU or memory socket corrosion, weak power supply, etc) tends to destroy the drive as well. I read the Google paper back in Feb and I don't remember them taking this into account... It would be nice to know what portion of failures were drive-only, and what portion was chaotic damage that just happened to include the drive as well.

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 15, 2007 0:51 UTC (Fri) by pr1268 (guest, #24648) [Link]

> Er, you listen to salespeople at Fry's??

I listen to them, but I don't always believe everything they say. But, again, my Seagate discussion was more tuned into the idea that their warranty (which they proudly advertise in big print on the sides of the box) is much longer than those of most other name-brand hard disk manufacturers (which is often hidden in the fine print). I agree, brand loyalty doesn't mean much these days, and besides, I used to think WDC (Western Digital) drives were the finest-quality consumer hard disks. Nowadays I don't think WDC drives are nearly as good as they were 5-10 years ago, and I've never had any good experiences with Maxtor. Times change, and so do hard drive manufacturers' quality control (and standards).

Maybe in 5-10 years I'll think the same about Maxtor, WDC, or Fujitsu as I do now about Seagate...

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 15, 2007 4:13 UTC (Fri) by njs (subscriber, #40338) [Link] (3 responses)

It's a funny thing about us humans, if you put us in a data vacuum, we find some way to fill it, whether it makes sense or not. Most of us have no useful statistics at all on what hard drive brands or models will be reliable; heck, as the article points out, the people *making* the drives don't even know this.

But this ignorance makes us so *uncomfortable* that everyone finds some random fact to base their decisions on, like an anecdote about that time they bought a *** brand drive they and it died after a week and so they never buy *** anymore, or how they heard the new *** brand drives use a fancier production process, or the warranty labels on the side of the box.

Warranties aren't a measure of how proud some engineers somewhere are, they're a measure of some sales/accounting decision about how the cost of providing that warranty will compare to the extra sales they get by putting it on the box. (5 years ago we were using what, 40 GB drives? If that died today, assuming you even still had it online, would you figure out how to ship it back for a new 40 GB under the warranty, or just pick up 400 GB at the local mall? Whenever it's the latter, the warranty costs Seagate nothing, and Seagate knows how many people fall into each camp.)

Hard drives are a commodity. Any given model has some greater-than-zero failure rate; people who care about their data make backups and the failure rate doesn't matter, people who don't care about their data worry and fret over exactly what the best lottery ticket to buy is. Me, I figure hard drives are all close enough in speed I'm never going to notice, but I have the thing sitting right next to me all day long, so I buy drives by checking Silent PC Review's recommended list, and picking the top-rated drive I can find on sale.

Hard Disk Drive Warranties

Posted Jun 15, 2007 5:06 UTC (Fri) by pr1268 (guest, #24648) [Link] (2 responses)

Even assuming that hard drive warranties are written by the sales/accounting department, don't you suppose that they looked at return rates of their products in order to make that warranty period?

WDC used to pledge a 3-year warranty. Now it's 1-year (again, assuming their consumer drives--IIRC their "Raptor" series of true-SCSI drives gets a longer warranty). Whether it was the sales/marketing folks at WDC, or it was the engineers, either way, around 3-4 years ago they decided that the warranty claim rate wasn't good enough to justify maintaining the 3-year warranty, so they reduced it to 1-year.

Certainly the folks over at Seagate were wise enough to perform the same cost vs. benefit analysis of pledging such a long warranty, regardless of whether it was the engineering team or the sales/marketing folks. But, with Seagate's substantially longer warranty, I can only assume that their cost vs. benefit analysis demonstrated either of two things: (1) their drives were high-enough quality such that the return rates were low and they could warranty their drives for 5 years whilst remaining profitable, or (2) They could absorb the cost of replacing defective drives under warranty at will for the indicated warranty period given a failure rate no better or worse than the commodity average.

I just don't see (2) above happening without Seagate making drives of such sorry quality and cheap manufacturing costs that they can justify the long warranty (analogy: I sell you a television for $150 which cost me $20 to build, and it has a 20% annual failure rate, so I can justify warrantying it for 5 years and still make a profit of $50), and I don't see them making drives of such unusually high quality that their manufacturing costs (and retail prices) spiral upwards. Their drives are competitively priced with WDC, Maxtor, and Fujitsu.

I don't mean to argue; but rather, I wanted to share my experiences and perhaps invoke a mildly-stimulating discussion. I totally agree that doing some basic consumer research on hard drive quality and features (you mentioned Silent PC) is a good idea for anyone wanting to invest in spinning platter data storage. :-)

Hard Disk Drive Warranties

Posted Mar 28, 2014 1:46 UTC (Fri) by know_it_all (guest, #96208) [Link] (1 responses)

Google's study was recently referenced at a forum which caused me to search for references to it and come across the forum. Even though this is an ancient article in terms of disk drive design cycles, I can shed some insight for others that come across the forum discussion.

The transducer used to write and read data from the disk drive flies over the disk at time frame of the article was somewhere around 5 to 10 nm. The tracks written to the disk and read sensor was a around 1 um and was a magneto resistive read transducer element (MR or GMR) constructed at the trailing edge of the read/write transducer (commonly called the R/W head. and having a resistance of around 35-100 ohms and sensing the magnetization of the disk having a change of < 1-2 ohms of resistance by passing a current through the sensor. The read back signal would typically be < 5 mV and be amplified by wide band low noise amplifiers up to a signal level of 100-300 mVpp an the actuator's amplifier before being passed back differentially to the disk drive's card electronics over a capton flexible tape wired circuit (called a flex tape).

SMART tracking systems measure amplitude and waveform properties of the read back signal measured by the read channel and firmware to predict degradation and wear out mechanism. They also monitor and track the positioning error (commonly called TMR or track mis-registration) of the servo system to follow the magnetically written positioning information on the disk.

An example of a error mechanism that SMART can predict as trend for using system to identify a failure mechanism of the drive in which the head may slowly accumulate debris on the air bearing surface of the head and degrade the signal amplitude causing loss of SNR and increase in bit errors and eventually sector read errors as the SNR margin is lost.

While drives are assembled in a extremely clean environment similar to silicon processes, there exist extremely important cleaning processes and assembly conditions to be maintained at the plant sites to insure that the manufacturing processes to make the components of the drive do not have particulate contamination in the disk enclosure.

When hard particles and especially conductive hard particles come in contact to the sensor during operation, the result can be an instantaneous detrimental scratch to the disk or to the sensor that either partially damages it and/or may cause partial demagnetization of the microscopic hard magnets used for bias the magnetic layers making up the MR or GMR structure of the sensor. So sudden degradation manifesting itself as low amplitude, instability of the properties in the read signal, or even simply "kills it" will result in the sudden failure described in the articles in a sudden manner that is undetectable to be signaled by a SMART trend algorithm to give early warning of the failure.

These type of impact type failures to not follow principles of temperature induced silicon failures as some of the commentary attempt to attribute for failure mechanisms and can be component dependent depending on cleaning systems at the time of component manufacture and assembly.

Other factors that are affected by temperature do exist. These involve the sensitivity to fly height due to viscosity of air and change in magnetic property of coercivity (as to how difficult to write or switch direction of the disk magnetization with the head's magnetic field). At lower temperature the air is thicker, so the head may fly higher. The disk's magnetic film coercivity property may increase making the disk harder to write. As a result, any mechanisms such as smears, particulate pickup of the head surface or damage to the writer by hard particles that may result in increased spacing can cause poor overwrite of disk and loss of written data as a function of the write process. The SMART algorithms monitor the overwrite property, but in the same manner as described for the read transducer of the head, will not predict sudden failure from particulate damage to the writer from a hard particle.

All drive manufactures have invested heavily in state of the art cleaning equipment and clean rooms for assembly cleanliness and continuously improve designs for disk flatness and the detection and avoidance of any hard particles that might be embedded to the disk surfaces and impact reliability. Current drive devices fly the read write transducer at around 1 nm off the disk and employ means to protrude and retract the head to pull back the critical write/read transducer and allow the head to fly at several nm off the surface to limit the amount of exposure to damage.

The observations that failure rates may also be a factor of rest time of the drive come from the fact that particles will not attach to a spinning disk, so if a disk is shut off, any particulate in the air of the disk enclosure is going to settle on a surface and by molecular attraction may bond itself to the surface. When the drive is started up at a later time, the head may sweep it away during the access or potentially pound it into the disk surface creating a hard particle and opportunity for failure. The disk drives have air flow filters directing air flow off the disk pack to trap and remove particulates from air flow within the disk enclosure. These filters may also contain sacrificial elements to avoid corrosion of the surfaces of the enclosure, disks, and heads avoiding early failure.

The reader should conclude from the above dissertation that all drive manufacture's engineering team are working responsibly to insure the disk drive devices incorporate the SMART algorithms to detect those types of events that are detectable by monitoring procedures, but that not all failure mechanisms manifest themselves in a manner in which the algorithm is going to be able to detect and report to the system in advance of other failure events as is desirable both for manufacturer and system integrators.

Hard Disk Drive Warranties

Posted Apr 12, 2014 23:41 UTC (Sat) by nix (subscriber, #2304) [Link]

This rates as one of the best comments ever on LWN, I think. (Despite slight grammar garbling, perhaps due to a non-native speaker, causing tricky parsing here and there.)

It was worth waiting seven years for. Bravo!

Strong correlation?

Posted Jun 15, 2007 10:08 UTC (Fri) by joern (guest, #22392) [Link] (4 responses)

> Overall, any kind of SMART error correlated strongly with disk failure.

The likelyhood of a disk failure going up from an annual 7% to 15-30% in eight month is hardly what I call a strong correlation. 80-90% would have been a strong correlation. 30% leads me to the same conclusion Google had:
keep the disk, but also keep a backup somewhere.

And this conclusion is no different for 0.5%, 7% or 30%. If anything changes at all, the number of backups might.

Strong correlation?

Posted Jun 15, 2007 19:08 UTC (Fri) by vaurora (guest, #38407) [Link] (3 responses)

Perhaps my wording was influenced by my personal disk failure rate of 100% within one week of the first SMART error - but I stand by it. :)

The story is slightly more complex than 7% -> 15-30% annual failure rate. The disk failure rates in the study from CMU averaged a yearly failure rate of 3%, varying from 0.5% to 13.5% (after throwing out a 7-year-old batch of disks with a failure rate of 24%). The failure rate of the Google disks varied from 1.7% to 8.6%, depending on the age of the disks. I can't find the average in the paper, but eyeballing it and doing the math gives me 6.34% overall. So we can call it 3-7% average.

More importantly, the failure rate of a disk with no errors is lower than the overall average of 3-7% a year. Figures 6 and 7 in the Google paper show the different failure probabilities for disks with and without scan errors. A disk less than 6 months old with no scan errors has only a 2% probability of failure, while a disk with one or more scan errors has a 33% failure probability. Beginning on page 8 of the Google paper, the authors break down the consequences of scan errors based on time since last error, age of disk, and number of errors. For example, a single scan error on a disk older than 2 years results in a nearly 40% probability of failure in the next 6 months. Take a closer look at those graphs; there's more data than I could summarize in the article.

Finally, whether you consider a change in failure rate even from 2% to 33% significant really depends on how much you value your data and how hard it is to get it back. For the average user, the answers are "A lot," and "Nearly impossible." Raise your hand if you've backed up in the last week.

Strong correlation?

Posted Jun 15, 2007 19:26 UTC (Fri) by joern (guest, #22392) [Link] (2 responses)

> Finally, whether you consider a change in failure rate even from 2% to 33% significant really depends on how much you value your data and how hard it is to get it back. For the average user, the answers are "A lot," and "Nearly impossible." Raise your hand if you've backed up in the last week.

I've done that today. Admittedly, having shell access to four seperate servers in different locations is uncommon for normal users.

In the end it is a matter of definition what a weak correlation ends and a strong correlation starts. I wouldn't speak of a strong correlation if I'd lose money when betting on the correlated event. So for me it would have to be x% -> 50+%.

Strong correlation?

Posted Jun 15, 2007 21:26 UTC (Fri) by giraffedata (guest, #1954) [Link] (1 responses)

In an operation of the size these papers talk about, gut feelings about "strong" and "weak" correlation and the pain of data loss aren't even significant. It's pure numbers. Somebody somewhere has decided how much a data loss costs and probability, repair costs, and interest rates fill out the equation.

Sometimes the cost of data loss is really simple. I had a telephone company customer years ago who said an unreadable tape cost him exactly $16,000. The tapes contained billing records of calls; without the record, the company simply couldn't bill for the call. Another, arguing against backing up his product source code, showed the cost of hiring engineers to rewrite a megabyte of code from scratch.

In the Google situation, I believe single drive data loss is virtually cost-free. That's because of all that replication and backup. In that situation, the cost of the failure is just the cost of service interruption (or degradation) and drive replacement. And since such interruptions and replacements happen regularly, the only question is whether it's cheaper to replace a drive earlier and thereby suffer the interruption later.

Anyway, my point is that with all the different ways disk drives are used, I'm sure there are plenty where replacing the drive when its expected failure rate jumps to 30% is wise and plenty where doing so at 90% is unwise.

Strong correlation?

Posted Jun 16, 2007 2:16 UTC (Sat) by vaurora (guest, #38407) [Link]

This is an excellent point - the utility of failure probability data depends on the use case. Google in general has all data replicated a minimum of three times (see the GoogleFS paper) and as a result, it is not cost-effective to replace a drive before it actually fails in practice in most situations. For any sort of professional operation with regular backups and/or replication, this data is not particularly useful except as input into how many thousands of new hard drives to order next month. But for an individual user without automated backup systems, it can provide a valuable hint on the utility of conducting that long-delayed manual backup within the next few hours.

KHB: Real-world disk failure rates: temperature

Posted Jun 16, 2007 2:41 UTC (Sat) by giraffedata (guest, #1954) [Link] (5 responses)

About those temperature correlations: temperature of what, how do they know, and why did temperatures vary?

KHB: Real-world disk failure rates: temperature

Posted Jun 16, 2007 10:44 UTC (Sat) by vaurora (guest, #38407) [Link] (4 responses)

Part of the SMART protocol included reading temperatures from sensors inside the disk. The record temperatures are of the inside of the disk enclosure. The temperatures vary according to the external temperature, air flow around and in the drive, and the waste heat generated by the drive itself.

KHB: Real-world disk failure rates: temperature

Posted Jun 18, 2007 3:52 UTC (Mon) by giraffedata (guest, #1954) [Link] (2 responses)

That puts the question of temperature/failure correlation in a whole different light. The correlation people expect has to do with the idea that if you put a disk drive in a hot environment, it will die sooner than if you put it in a cool environment.

But my guess is that the outside temperature and air flow around the drives doesn't vary among Google's sample, so any temperature difference inside the drive is due to the design of the disk drive. IOW, the correlation shows that the designs that run cooler are also the ones that fail more.

Considering that the engineers do design for and test for failure rates, i.e. the failure rate is the independent variable, I would not expect a drive that runs hotter to fail more. Engineers would have designed it to run that hot.

But I might be convinced that in their struggle to make a drive consume less power, and thus run cooler, the engineers sacrificed longevity. (I don't know enough about disk drive design to know how such a tradeoff would be made, but I'm sure there's a way). That could explain the negative correlation.

KHB: Real-world disk failure rates: temperature

Posted Jun 20, 2007 2:09 UTC (Wed) by mhelsley (guest, #11324) [Link] (1 responses)

Considering the energy needed to manufacture the hard drives and their lower lifetimes I wonder it truly results in net energy savings.

KHB: Real-world disk failure rates: temperature

Posted Jun 20, 2007 16:40 UTC (Wed) by giraffedata (guest, #1954) [Link]

Considering the energy needed to manufacture the hard drives and their lower lifetimes I wonder if it truly results in net energy savings.

That's an insightful question, but as the true goal is to save resources in general, and not just energy (or fossil fuels and clean and clear air), the right question is actually a little simpler: Does the money a data center saves on electricity make up for the cost of more frequent replacements? The amount energy used in manfuacturing a drive is reflected in its price.

Of course, we're only speculating at this point that there is any correlation between energy efficiency and failure rate. The one really useful result of these two studies is that manufacturers don't know what their drive lifetimes are, so users have been making buying decisions based on wrong numbers.

KHB: Real-world disk failure rates: temperature

Posted Jun 19, 2007 23:21 UTC (Tue) by peterc (guest, #1605) [Link]

Note also that those temperature sensors are not calibrated. I wouldn't be surprised if they have a design variability of 10%!

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 22, 2007 3:27 UTC (Fri) by NedLudd (guest, #37615) [Link]

Thank you Valerie for the wonderful write-up. I really enjoy reading your articles!!

--brian

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 22, 2007 11:05 UTC (Fri) by jengelh (guest, #33263) [Link] (1 responses)

>they found no correlation between failure rate and disk type - SCSI, SATA, or fiber channel. The most reliable disk set was composed of only SATA drives, which are commonly regarded to be less reliable than SCSI or fibre channel.

No more SCSI myths heh.

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

Posted Jun 27, 2007 15:13 UTC (Wed) by dvanzandt (guest, #45962) [Link]

I'm not going to give up my SCSI myths too quickly.<g>

I would like to know more about the rotational speed, # of platters, total capacity, i/o parameters, useage (db vs. file share vs. near-line storage, etc.) of the various drives first.

In addition, it would be very interesting to see which drive "types" died more often due to component and logic failure as opposed to "shedding" media.

Having said all that, I'm happier about specing some SATA rigs for my less affluent customers.

Were there any SAS drives in this mix? I am trying to download the study, but the server keeps timing out. If the answer is in it, "never mind."