LWN.net Logo

Monitor disks with the S.M.A.R.T. monitoring tools

Monitor disks with the S.M.A.R.T. monitoring tools

Posted Mar 14, 2008 12:24 UTC (Fri) by NRArnot (subscriber, #3033)
In reply to: Monitor disks with the S.M.A.R.T. monitoring tools by hmh
Parent article: Monitor disks with the S.M.A.R.T. monitoring tools

Manufacturers don't want to RMA disks that still "work", just because they are no longer
working as well as they did when shipped. That's why they set stupid SMART thresholds.
However, if you monitor the SMART counters yourself, you can get advance warning that a disk
is starting to deteriorate, and swap it at that time. 

Unless you then put the removed disk into a test rig or unimportant system and exercise it for
months or even years, you will never know if you caught a failing disk before failure or just
replaced a good disk. However, the value of the data is usually much greater than the cost of
the disk, so it's quite an easy decision.

Google published some statistics on SMART's predictive value  and on disk reliability in
general. (One surprise: keeping disks cooled under 30C *reduces* life expectancy!)
http://labs.google.com/papers/disk_failures.html


(Log in to post comments)

Monitor disks with the S.M.A.R.T. monitoring tools

Posted Mar 14, 2008 21:45 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

However, the value of the data is usually much greater than the cost of the disk, so it's quite an easy decision.

I don't think that's true. Often, the data is relatively unimportant, like a Google web page cache or a small part of a stream of undifferentiated experimental data. The rest of the time, the data is easily reconstructable, e.g. by copying from a mirror disk or backup tape. People set up storage systems so that the value of preserving the data is commensurate with the cost of preserving it. If you perturb that system by replacing drives more often based on SMART data, I think you'll have a net loss.

On the other hand, if you could exploit SMART data so as to get the same reliability with fewer redundant copies, that would be a win.

Either the Google paper or another that came out around the same time concluded that the best policy was to wait for a drive to fail, then replace it.

One surprise: keeping disks cooled under 30C *reduces* life expectancy

If you want to jump to conclusions, but the study didn't actually isolate the cooling policy. It merely showed that drives that failed tended to be the ones that were cooler. That's a long way from saying if you speed up the fans, the disks will fail more. Just as likely is that the cool drives were of models where the engineers traded durability for low power consumption. Remember the one great consistent, fully controlled, correlation these studies show is between failure rate and model.

Monitor disks with the S.M.A.R.T. monitoring tools

Posted Mar 20, 2008 5:01 UTC (Thu) by roelofs (subscriber, #2599) [Link]

Either the Google paper or another that came out around the same time concluded that the best policy was to wait for a drive to fail, then replace it.

...for some definition of "fail." Keep in mind that performance drops, sometimes significantly, before unrecoverable data loss occurs.

Greg

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds