|
Monitor disks with the S.M.A.R.T. monitoring toolsMonitor disks with the S.M.A.R.T. monitoring toolsPosted Mar 13, 2008 22:18 UTC (Thu) by xav (subscriber, #18536)Parent article: Monitor disks with the S.M.A.R.T. monitoring tools
Mmmh .. so apparently it can predict 30% of 60% of the failures, which is less than 20% of the failures. Doesn't seem too useful.
(Log in to post comments)
Monitor disks with the S.M.A.R.T. monitoring tools Posted Mar 13, 2008 22:46 UTC (Thu) by hmh (subscriber, #3838) [Link] SMART failure prediction is crap, really. Most vendors set the thresholds too low. When you get one beyond the safety limit, it is usually too late for anything. But the SMART attributes, error logging, and the self tests are really useful. And so are smartd's mails to root when anything weird happens :-) I do self tests often, and long tests at least once a week. These find marginal and bad sectors in the RAID 1 components well before it becomes an issue. mdadm "array checks" also should be able to do it, but I've found that the SMART long test in my current set of disks is a lot more sensitive than simply telling the disk to read every sector. Your mileage will vary, of course :-)
Monitor disks with the S.M.A.R.T. monitoring tools Posted Mar 14, 2008 12:24 UTC (Fri) by NRArnot (subscriber, #3033) [Link] Manufacturers don't want to RMA disks that still "work", just because they are no longer working as well as they did when shipped. That's why they set stupid SMART thresholds. However, if you monitor the SMART counters yourself, you can get advance warning that a disk is starting to deteriorate, and swap it at that time. Unless you then put the removed disk into a test rig or unimportant system and exercise it for months or even years, you will never know if you caught a failing disk before failure or just replaced a good disk. However, the value of the data is usually much greater than the cost of the disk, so it's quite an easy decision. Google published some statistics on SMART's predictive value and on disk reliability in general. (One surprise: keeping disks cooled under 30C *reduces* life expectancy!) http://labs.google.com/papers/disk_failures.html
Monitor disks with the S.M.A.R.T. monitoring tools Posted Mar 14, 2008 21:45 UTC (Fri) by giraffedata (subscriber, #1954) [Link] However, the value of the data is usually much greater than the cost of the disk, so it's quite an easy decision. I don't think that's true. Often, the data is relatively unimportant, like a Google web page cache or a small part of a stream of undifferentiated experimental data. The rest of the time, the data is easily reconstructable, e.g. by copying from a mirror disk or backup tape. People set up storage systems so that the value of preserving the data is commensurate with the cost of preserving it. If you perturb that system by replacing drives more often based on SMART data, I think you'll have a net loss. On the other hand, if you could exploit SMART data so as to get the same reliability with fewer redundant copies, that would be a win. Either the Google paper or another that came out around the same time concluded that the best policy was to wait for a drive to fail, then replace it.
One surprise: keeping disks cooled under 30C *reduces* life expectancy If you want to jump to conclusions, but the study didn't actually isolate the cooling policy. It merely showed that drives that failed tended to be the ones that were cooler. That's a long way from saying if you speed up the fans, the disks will fail more. Just as likely is that the cool drives were of models where the engineers traded durability for low power consumption. Remember the one great consistent, fully controlled, correlation these studies show is between failure rate and model.
Monitor disks with the S.M.A.R.T. monitoring tools Posted Mar 20, 2008 5:01 UTC (Thu) by roelofs (subscriber, #2599) [Link] Either the Google paper or another that came out around the same time concluded that the best policy was to wait for a drive to fail, then replace it....for some definition of "fail." Keep in mind that performance drops, sometimes significantly, before unrecoverable data loss occurs. Greg
|
Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.