|
|
Log in / Subscribe / Register

Failure Trends in a Large Disk Drive Population

Google Labs has released a paper [PDF] that details the failure modes from a large population of hard disk drives. "Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported." (Thanks to Hale Landis).

to post comments

Failure Trends in a Large Disk Drive Population

Posted Mar 2, 2007 21:54 UTC (Fri) by maney (subscriber, #12630) [Link] (1 responses)

Well, this is a rarity - a solidly technical item that showed up on slashdot before LWN. (caveat: ones that appear close together I'll usually see first here, as slashdot is near the bottom of the list of sites I stop by more or less daily). Even more unexpectedly, there was a followup mention there of another paper about disk drives that's in some ways even more interesting: Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?. The usual failure model seems not to fit observed failure rates very well at all, at all...

Both articles will be of interest to anyone who cares about disk drives' lifespans.

Failure Trends in a Large Disk Drive Population

Posted Mar 2, 2007 22:37 UTC (Fri) by dlang (guest, #313) [Link]

actually, weren't both of these papers here a couple weeks ago?

Failure Trends in a Large Disk Drive Population

Posted Mar 9, 2007 14:10 UTC (Fri) by NRArnot (subscriber, #3033) [Link]

This is extremely well worth reading, especially if you are operating a hardware-RAID (such as 3Ware) and monitoring with SMART.

Clearly much folklore is wrong. Some conclusions include

The best pre-failure indicator is when your drive reallocates its first block.

Nearly half the drives failed without SMART giving any hint that was coming.

Over-cooled disks (20C) are not obviously more reliable than ones that run warm (up to 40C), in fact the converse. (I presume they are designed to run at the temperature that you typically get inside a desktop PC, namely 30 to 35C. It's what any real engineer would do!)

Little or no evidence that [S]ATA disks running continuously are less reliable or last less long than ones that are powered on and off daily.

It's not just me that has trouble with drives which are seriously degraded (so slooooow!), yet SMART still says that they are still perfect. (And hard to track down in a hardware RAID set: how to benchmark a particular drive when you suspect one of slowing a whole array? )

I wish they'd said what make(s) they studied, and whether they were desktop-grade or enterprise-grade. I guess their lawyers wouldn't let them. Though 4 years ago did they do enterprise-grade ATA at all?


Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds