June 12, 2007
This article was contributed by Valerie Henson
At this year's USENIX File
Systems and Storage Technology Conference, we were treated to two
papers studying failure rates in disk populations numbering over
100,000. These kinds of data sets are hard to get - first you have to
have 100,000 disks, then you have to record failure-related data
faithfully for years on end, and then you have to release the data in
a form that doesn't get anyone sued. The storage community has
salivated after this kind of real-world data for years, and now we
have not one, but two (!) long-term studies of disk failure rates. The
conference hall was packed during these two presentations. When the
talks were done, we stumbled out into the hallway, dazed and excited
by the many surprising results. Heat is negatively correlated with
failure! Failures show short AND long-term correlation! SMART errors
do mean the drive is more likely to fail, but a third of drives die
with no warning at all! The size of the data sets, the quality of
analysis, and the non-intuitive results win these two papers a place
on the Kernel Hacker's Bookshelf.
The first paper (and winner of Best Paper), was Disk failures
in the real world: What does an MTTF of 1,000,000 hours mean to
you?, by Bianca Schroeder and Garth Gibson. They reviewed failure
data from a collection of 100,000 disks, over a period of up to 5
years. The disks were part of a variety of HPC clusters and an
Internet service provider. Disk failure was defined as the disk being
replaced. The date of replacement was also used as the date of the
failure, since determining exactly when a disk failed was not
possible.
Their first major result was that the real-world annualized failure
rate (average percentage of disks failing per year) was
much higher than the manufacturer's estimate - an
average of 3% vs. the estimated 0.5 - 0.9%. Disk manufacturers
obviously can't test disks for a year before shipping them, so they
stress test disks in high-temperature, high-vibration, high-workload
environments, and use data from previous models to estimate MTTF.
Only one set of disks had a real-world failure rate less than the
estimated failure rate, and one set of disks had a 13.5% annualized
failure rate!
More surprisingly, they found no correlation between failure rate and
disk type - SCSI, SATA, or fiber channel. The most reliable disk set
was composed of only SATA drives, which are commonly regarded to be
less reliable than SCSI or fibre channel.
In another surprise, they debunked the "bathtub model" of disk failure
rates. In this theory, disks experience a higher "infant mortality"
initial rate of failure, then settle down for a few years of low
failure rate, and then begin to wear out and fail. The graph of the
probability vs. time looks like a bathtub, flat in the middle and
sloping up at the ends. Instead, the real-world failure rate began
low and steadily increased over the years. Disks don't have a sweet
spot of low failure rate.
Failures within a batch of disks were strongly correlated over both
short and long time periods. If a disk had failed in a batch, then
there was a significant probability of a second failure up to at least
2 years later. If one disk in your batch has just gone, you are more
likely to have another disk failure in the same batch. Scary news for
RAID arrays with disks from the same batch. A recent paper in the 2006 Storage Security and
Survivability Workshop, Using
Device Diversity to Protect Data against Batch-Correlated Disk
Failures, by Jehan-François Pâris and Darrell D. E. Long,
calculated the increase in RAID reliability from mixing batches of
disks. Using more than one kind of disk increases costs, but with the
combination of data from these two papers, RAID users can calculate
the value of the extra reliability and make the most economical
decision.
The second paper, Failure Trends
in a Large Disk Drive Population, by Eduardo Pinheiro,
Wolf-Dietrich Weber and Luiz Andrè Barroso, reports on disk
failure rates at Google. They used a Google tool for recording system
health parameters and many other staples of Google software
(Mapreduce, Bigtable, etc.) to collect and analyze the data. They
focused on SMART statistics - the built-in disk drive monitoring in
many modern disk drives, which records statistics about scan errors
and blocks relocated.
The first result agrees with the first paper: The annualized failure
rate was much higher than estimated, between 1.7% and 8.6%. They next
looked for correlation between failure rate and drive utilization (as
estimated by the amount of data read or written to the drive). They
find a much weaker correlation between higher utilization and failure
rate than expected, with low utilization disks often having higher
failure rates than medium utilization disks, and, in the case of the
3-year-old vintage of disks, higher than the high utilization group.
Now for the most surprising result. In Google's population of cheap
ATA disks, high temperature was negatively correlated
with failure! In the authors' words:
In fact, there is a clear trend showing that lower temperatures are
associated with higher failure rates. Only at very high temperatures
is there a slight reversal of this trend.
This correlation held true over a temperature range of 17-55 C. Only
in the 3-year-old disk population was there correlation between high
temperatures and failure rates. My completely unsupported and
untested hypothesis is that drive manufacturers stress test their
drives in high temperature environments to simulate longer wear.
Perhaps they have unwittingly designed drives that work better in
their high-temperature test environment at the expense of a more
typical low-temperature field environment.
Finally, they looked at the SMART data gathered from the drives.
Overall, any kind of SMART error correlated strongly with disk
failure. A scan error occurs when the disk checks data in the
background, reading the entire disk. Within 8 months of the first
scan error, about 30% of drives would fail completely. A reallocation
error occurs when a block can't be written, and the block is
reassigned to another location on disk. A reallocation error resulted
in about 15% of affected drives failing with 8 months. On the other
hand, 36% of the drives that failed had no warning whatsoever, either
from SMART errors or from exceptionally high temperatures.
For Google's purposes, the predictive power of SMART is of limited
utility. Replacing every disk that had a SMART error would end
up replacing good disks that will run for years to come about 70% of the
time. For Google, this isn't cost-effective, since all their data is
replicated several times. But for an individual user for whom losing
their disk is a disaster, replacing the disk at the first sign of a
SMART error makes eminent sense. I have personally had two laptop
drives start spitting SMART errors in time to get my data off the disk
before it died completely.
Overall, these are two exciting papers with long-awaited real-world
failure data on large disk populations. We should expect to see more
publications analyzing these data sets in the years to come.
Valerie Henson is a Linux file systems consultant specializing in file
system check and repair.
(
Log in to post comments)