Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
PostgreSQL 9.3 beta: Federated databases and more
LWN.net Weekly Edition for May 9, 2013
(Nearly) full tickless operation in 3.10
Runtime filesystem consistency checking
Posted Apr 4, 2012 12:07 UTC (Wed) by nix (subscriber, #2304)
Posted Apr 4, 2012 13:16 UTC (Wed) by ashvin (guest, #83894)
Posted Apr 12, 2012 12:47 UTC (Thu) by nye (guest, #51576)
My ZFS experience on a ~5TB pool consisting of six commodity HDs under fairly light load (ie. it's a home file server) is that every couple of months scrub detects checksum errors in a block or a small handful of blocks, without any corresponding read/write errors being given by the device.
Not sure if that's the situation you're talking about.
(Also, the same experience has taught me that Western Digital should be avoided like ebola. I actually wonder if their green series might be drives that have failed QC and been re-badged for the unsuspecting consumer.)
Posted Apr 13, 2012 10:33 UTC (Fri) by etienne (subscriber, #25256)
I am not sure to interpret exactly what is happening on my own PC, but I suspect something like:
- one block of sectors develop a bit fault in the magnetic data
- the ECC correct it each times the PC reads the sector resulting in a *very long* delay of few seconds
- the Linux driver do not noticed there was long ECC correction and do not decide to rewrite the sector identically to get the magnetic data corrected
- long term a second error will appear on the magnetic data and the ECC will no more be sufficient.
I do not know why the sector is not rewritten by the Linux driver, I know that I did solve same problem on another PC by touching a file in a directory, forcing the sector containing the directory entry to be rewritten.
I never noticed the problem when the "old" ATA/IDE driver was used, but I am not sure I interpret correctly what happens on my PC during the last few days...
Posted Apr 13, 2012 12:34 UTC (Fri) by james (subscriber, #1325)
And it can do all of that without having to worry about which operating system is running, or it's a database using raw access, or if it's a light layer using BIOS calls but no filesystem. It can preserve this information across reformats.
In your case, by causing the sector containing the directory entry to be rewritten, the disk probably decided that this was a great time to remap in a spare sector, so it actually went to a different part of the disk. (Unless the filesystem you were using put the new directory entry somewhere else anyway.)
And ECC correction doesn't take seconds; re-reading the same sector repeatedly in the hope that you can get a last good read does.
¹ If you've got command queueing turned on, several requests outstanding, and there's a delay, which sector caused the problem?
Posted Apr 13, 2012 14:51 UTC (Fri) by jzbiciak (✭ supporter ✭, #5246)
It will also let you fire up background health checks (these can take quite a long time to complete -- as long as a day, as I recall) that may help turn up other problems.
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds