LWN.net Logo

Using fsck to defend against disk failures?

Using fsck to defend against disk failures?

Posted Jan 27, 2008 15:45 UTC (Sun) by anton (guest, #25547)
In reply to: ext3 metaclustering by rfunk
Parent article: ext3 metaclustering

That and the "spreading inconsistency" theory and some other things I have read by people writing about fsck are failure types that I have never seen or read a first-hand report of, so I guess they are just myths or a perverted form of wishful thinking.

The kinds of disk failures I have seen have always been different. In particular, even if a drive developed a bad block, it recognized that itself (very slowly) and returned an error rather than wrong data. I'm not sure if fsck programs are up to dealing with a bad block of this kind in the metadata, but if a drive has a bad block, that's certainly a good time to replace the drive and restore the data from backup. Or you run RAID 1 or RAID 5, you just need to replace the drive (and make it known to the RAID driver).

Moreover, even if a disk drive deteriorates over time, that's more likely to hit the data first rather than the meta-data. But fsck checks only some kinds of errors in the meta-data, so if fsck is your defense against bad blocks, you don't value your data at all. Making a backup is more likely to unveil bad blocks than fsck (also in data), and has obvious additional benefits.

Finally, a good way (much better than fsck) to test the drive for bad blocks is "smartctl -t long", even though I am sceptical about the predictive capabilities of SMART.

Overall, I am very sceptical about the value of fsck for dealing with hardware failures, and a little bit less sceptical about its value when dealing with software failures (but I think I have not been bitten by a file system bug yet); in many cases (especially the hardware ones) we have to restore from backup anyway.


(Log in to post comments)

Using fsck to defend against disk failures?

Posted Jan 27, 2008 16:32 UTC (Sun) by nix (subscriber, #2304) [Link]

My mum's ancient 486 laptop had a really strange disk failure this 
Christmas. It started with a single bad sector, but then within about 
fifteen minutes one third of the sectors on the disk (in contiguous runs 
of varying length) were returning, not bad sectors, but `sector not 
found', i.e. the drive couldn't even find the sector address markers.

What I suspect may have happened, based on my extensive lack of experience 
in hard drive design, is that all the G forces the head assembly is 
exposed to whenever a seek happens had over time twisted the head reading 
the farthest side of whichever platter didn't contain the servo track out 
of true, so that when the servo track said it was over track X, the 
topmost heads were actually midway between tracks or something like that. 
In that position they couldn't read the sector addresses, couldn't find 
any data, and whoompfh, goodbye data.

(I've never heard of this failure mode anywhere else, and perhaps it was 
something different, but still, it was very strange. Disks *can* go mostly 
bad all at once. It's just rare.)

Disk failures

Posted Jan 27, 2008 21:58 UTC (Sun) by anton (guest, #25547) [Link]

Disk drives have not used servo tracks for a long time, because one could no longer align all the heads precisely enough (e.g., because of thermal expansion). Instead, servo information exists on each platter, interspersed in some way with the data. I don't know when this change happened; a 15+-year old disk (486 generation) might still have a servo track. But couldn't the symptoms also be explained by the failure of just one of the heads?

Disk failures

Posted Jan 27, 2008 22:55 UTC (Sun) by nix (subscriber, #2304) [Link]

I said it was a prehistoric system, and indeed anything more modern than 
about, what, 1991 won't have this problem.

I'm not sure if a head failure could cause a failure to find sector 
address markers: I'm not sure if you could even distinguish the two cases 
without digging into the drive. (As I said, my expertise in hard drive 
engineering is notable mainly by its absence.)

It's just that heads are solid-state, and solid-state stuff doesn't die 
all that often, while the head assembly itself is being wrenched all over 
the place: simple bending could explain this, I think.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds