LWN.net Logo

The 2006 Linux File Systems Workshop

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 5:44 UTC (Thu) by Los__D (guest, #15263)
Parent article: The 2006 Linux File Systems Workshop

"Another interesting change in hardware is the rate of increase in capacity versus the rate of reduction in I/O errors per bit. In order for a disk to have the same overall number of I/O errors, every time capacity doubles, the per-bit I/O error rate must halve. Needless to say, this isn't happening, so I/O errors are actually more common even though the per-bit error rate has dropped."

I don't understand how these are connected... Isn't the I/O error an artifact of reading or writing, and not a defect in the hardware? If it is, then it hardly matters, as you only care about I/O errors per second, or per bit read/written, not about I/O errors per bit of the total disksize.

Or maybe I misunderstood something?

Dennis


(Log in to post comments)

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 7:19 UTC (Thu) by arjan (subscriber, #36785) [Link]

You can express the error rate as a probability like this:

what is the probability that I can read this sector X days in the future

That number seems to be more or less constant (compared to the capacity increase), but with the number of sectors increasing the

what is the probability that I get an IO error in ANY sector in X days

goes up....

Disk error rates

Posted Jul 8, 2006 16:27 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

The article failed to specify completely the denominators in the error rates mentioned.

The article means to say that the error rate per bit read has gone down, but the error rate per disk per day has gone up. And with the implicit assumption that people have one filesystem per disk and use all the space, then the error rate per filesystem per day has gone up. It obviously assumes a system that, as it keeps more data, accesses more data too.

Oh, and an error is an instance of trying to read back a particular piece of data and not being able to. It usually means permanent data loss.

The reason the article brings it up is that if the error rate per filesystem per day goes up, so does the FSCK rate per filesystem per day, and with the cost per FSCK also going up proportional to the filesystem size, the FSCK cost per day goes up.

The 2006 Linux File Systems Workshop

Posted Oct 19, 2006 0:02 UTC (Thu) by eatsapizza (guest, #41199) [Link]

How do folks resolve the difference between MTBF for a disk and the
bit error rates?

I've read of MTBF's of about 1MHours, and bit error rates of about
10E-15 (so the Mean Bits between Failures is 1E15).

The reliability of a drive is R=exp(-t/MTBF), so the reliability
of a drive for a year is about...
R=exp(-8760/1E6) = about 99%

But if you're reading from your disk...... even say only 1MB/s ..
about 1 to 3% of its possible read rate,
you would read, in a year.. 2.523E14 bits,
and then:
R= exp(-2.523E14/1E15) = 77.7% .. a very different result, and
that's assuming the drive isn't really that busy.

I know there is RAID 0+1 and RAID5, and RAID5 (6+2), all of which
make things better, but how can the single disk result be so
much different?

eatsapizza

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds