LWN.net Logo

Failure Trends in a Large Disk Drive Population

Failure Trends in a Large Disk Drive Population

Posted Mar 9, 2007 14:10 UTC (Fri) by NRArnot (subscriber, #3033)
Parent article: Failure Trends in a Large Disk Drive Population

This is extremely well worth reading, especially if you are operating a hardware-RAID (such as 3Ware) and monitoring with SMART.

Clearly much folklore is wrong. Some conclusions include

The best pre-failure indicator is when your drive reallocates its first block.

Nearly half the drives failed without SMART giving any hint that was coming.

Over-cooled disks (20C) are not obviously more reliable than ones that run warm (up to 40C), in fact the converse. (I presume they are designed to run at the temperature that you typically get inside a desktop PC, namely 30 to 35C. It's what any real engineer would do!)

Little or no evidence that [S]ATA disks running continuously are less reliable or last less long than ones that are powered on and off daily.

It's not just me that has trouble with drives which are seriously degraded (so slooooow!), yet SMART still says that they are still perfect. (And hard to track down in a hardware RAID set: how to benchmark a particular drive when you suspect one of slowing a whole array? )

I wish they'd said what make(s) they studied, and whether they were desktop-grade or enterprise-grade. I guess their lawyers wouldn't let them. Though 4 years ago did they do enterprise-grade ATA at all?


(Log in to post comments)

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.