LWN.net Logo

Advertisement

Advanced thin client solution for Linux, based on Open Source. Mix Windows and Linux applications on the same desktop.

Advertise here

By Forrest Cook
March 11, 2008

The S.M.A.R.T. Monitoring Tools (Smartmontools) is a cross-platform set of utilities that are able to monitor operating data from hard drives:

Advertisement

The smartmontools package contains two utility programs (smartctl and smartd) to control and monitor storage systems using the Self-Monitoring, Analysis and Reporting Technology System (SMART) built into most modern ATA and SCSI hard disks. In many cases, these utilities will provide advanced warning of disk degradation and failure. It should run on any modern Darwin (Mac OSX), Linux, FreeBSD, NetBSD, OpenBSD, Solaris, OS/2, eComStation, QNX, or Windows system.

Wikipedia defines SMART as the Self-Monitoring, Analysis, and Reporting Technology: "Mechanical failures, which are usually predictable failures, account for 60 percent of drive failure. The purpose of S.M.A.R.T. is to warn a user or system administrator of impending drive failure while time remains to take preventative action — such as copying the data to a replacement device. Approximately 30% of failures can be predicted by S.M.A.R.T."

Version 5.38 of Smartmontools was recently announced. Improvements include:

  • Several Libata/Marvell driver improvements.
  • New additions to the drive database.
  • ATA-8 updates.
  • New Dragonfly support.
  • Support for the QNX operating system.
  • A new no-fork option for smartd.
  • Better support for systems with large numbers of disks.
  • Improvements to the descriptions of the SMART Attribute list.
  • A workaround for a Samsung firmware bug.
  • Improvements to the CCISS support system.
  • New selective self-test command line options.
  • Build system portability improvements.
  • Numerous bug fixes.

Building Smartmontools was straightforward. The code was downloaded and unpacked. The usual configure, make and make install steps were performed on an Ubuntu 7.04 system with no troubles. The operation instructions from the README file were followed and the software was able to discover data from the one hard drive on the test system. This example output shows the wide variety of drive information that Smartmontools can display. The drive appears to be healthy.

If you are a systems administrator who needs to keep track of hard drive reliability data, Smartmontools be able to provide some useful drive information. With the addition of a small amount of glue-logic scripting, it should not be too difficult to set up an automated drive monitoring system.


(Log in to post comments)

Monitor disks with the S.M.A.R.T. monitoring tools

Posted Mar 13, 2008 13:04 UTC (Thu) by nix (subscriber, #2304) [Link]

You mean an automated drive monitoring system like, say, smartd(8) in the smartmontools? :)

SMART & Failures...

Posted Mar 13, 2008 17:31 UTC (Thu) by leoc (subscriber, #39773) [Link]

Google put out an interesting paper about this very topic.

Monitor disks with the S.M.A.R.T. monitoring tools

Posted Mar 13, 2008 21:24 UTC (Thu) by malex (subscriber, #15692) [Link]

Unfortunately, smartmontools still can't provide the SMART test information from external USB
drives - one has to use manufacturer's MSWindows(TM) based Software to do that. It's a pity.
If that capability were present I wouldn't be sitting without current backups while I"m
waiting for a replacement drive to arrive.

Monitor disks with the S.M.A.R.T. monitoring tools

Posted Mar 18, 2008 22:58 UTC (Tue) by jimparis (subscriber, #38647) [Link]

This is not necessarily a problem with smartmontools. USB ATA passthrough is still new -- even Mark Lord, author of hdparm and a big contributer to ATA/IDE code in Linux, expressed suprise at finding an enclosure that actually supports it.

Some vendors (Cypress) have invented their own custom protocol for getting SMART data this way, and there has been some recent discussion about including support for it in smartmontools...

Monitor disks with the S.M.A.R.T. monitoring tools

Posted Mar 19, 2008 17:58 UTC (Wed) by malex (subscriber, #15692) [Link]

I've realized that in time and switched to using a combination of linux-supported eSATA card
and an eSATA enclosure. Now, SMART works, my backups are fast and I just don't care anymore
for USB2. smartmontools work great with my current setup.

Monitor disks with the S.M.A.R.T. monitoring tools

Posted Mar 13, 2008 22:18 UTC (Thu) by xav (subscriber, #18536) [Link]

Mmmh .. so apparently it can predict 30% of 60% of the failures, which is less than 20% of the
failures. Doesn't seem too useful.

Monitor disks with the S.M.A.R.T. monitoring tools

Posted Mar 13, 2008 22:46 UTC (Thu) by hmh (subscriber, #3838) [Link]

SMART failure prediction is crap, really.  Most vendors set the thresholds too low.  When you
get one beyond the safety limit, it is usually too late for anything.

But the SMART attributes, error logging, and the self tests are really useful.  And so are
smartd's mails to root when anything weird happens :-)

I do self tests often, and long tests at least once a week. These find marginal and bad
sectors in the RAID 1 components well before it becomes an issue.

mdadm "array checks" also should be able to do it, but I've found that the SMART long test in
my current set of disks is a lot more sensitive than simply telling the disk to read every
sector.  Your mileage will vary, of course :-)

Monitor disks with the S.M.A.R.T. monitoring tools

Posted Mar 14, 2008 12:24 UTC (Fri) by NRArnot (subscriber, #3033) [Link]

Manufacturers don't want to RMA disks that still "work", just because they are no longer
working as well as they did when shipped. That's why they set stupid SMART thresholds.
However, if you monitor the SMART counters yourself, you can get advance warning that a disk
is starting to deteriorate, and swap it at that time. 

Unless you then put the removed disk into a test rig or unimportant system and exercise it for
months or even years, you will never know if you caught a failing disk before failure or just
replaced a good disk. However, the value of the data is usually much greater than the cost of
the disk, so it's quite an easy decision.

Google published some statistics on SMART's predictive value  and on disk reliability in
general. (One surprise: keeping disks cooled under 30C *reduces* life expectancy!)
http://labs.google.com/papers/disk_failures.html

Monitor disks with the S.M.A.R.T. monitoring tools

Posted Mar 14, 2008 21:45 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

However, the value of the data is usually much greater than the cost of the disk, so it's quite an easy decision.

I don't think that's true. Often, the data is relatively unimportant, like a Google web page cache or a small part of a stream of undifferentiated experimental data. The rest of the time, the data is easily reconstructable, e.g. by copying from a mirror disk or backup tape. People set up storage systems so that the value of preserving the data is commensurate with the cost of preserving it. If you perturb that system by replacing drives more often based on SMART data, I think you'll have a net loss.

On the other hand, if you could exploit SMART data so as to get the same reliability with fewer redundant copies, that would be a win.

Either the Google paper or another that came out around the same time concluded that the best policy was to wait for a drive to fail, then replace it.

One surprise: keeping disks cooled under 30C *reduces* life expectancy

If you want to jump to conclusions, but the study didn't actually isolate the cooling policy. It merely showed that drives that failed tended to be the ones that were cooler. That's a long way from saying if you speed up the fans, the disks will fail more. Just as likely is that the cool drives were of models where the engineers traded durability for low power consumption. Remember the one great consistent, fully controlled, correlation these studies show is between failure rate and model.

Monitor disks with the S.M.A.R.T. monitoring tools

Posted Mar 20, 2008 5:01 UTC (Thu) by roelofs (subscriber, #2599) [Link]

Either the Google paper or another that came out around the same time concluded that the best policy was to wait for a drive to fail, then replace it.

...for some definition of "fail." Keep in mind that performance drops, sometimes significantly, before unrecoverable data loss occurs.

Greg

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.