Disks deteriorate underneath the filesystem. Usually it's age causing
increasingly large bad spots. The filesystem may have no bugs, but things
can happen to the disk to corrupt the filesystem. I'd rather know about
that.
Posted Jan 17, 2008 23:15 UTC (Thu) by sbergman27 (guest, #10767)
[Link]
Then have "badblocks" run independently of e2fsck at 3AM once a week or so? Instead of
forcing the user and conference attendees to sit through a mandatory hour-long e2fsck at the
whim of the machine, hoping that the e2fsck happens to catch what badblocks is far better
designed to catch?
ext3 metaclustering
Posted Jan 17, 2008 23:31 UTC (Thu) by rfunk (subscriber, #4054)
[Link]
Hour-long e2fsck?
1. Partition your disk -- root, /var, /usr, /home on different filesystems.
2. If you ever do enable the automatic checks, set the mounts-before-check count on
each filesystem to be a different prime number. That way multiple filesystems almost
never get checked at the same time.
I've never had an fsck on a non-server system (which seems to be the topic here) go
anywhere near an hour. Maybe five minutes at most.
In my experience, badblocks is far far slower than e2fsck.
And running anything automatically at 3am generally isn't an option on
conference-presentation laptops.
ext3 metaclustering
Posted Jan 18, 2008 14:56 UTC (Fri) by fatrat (subscriber, #1518)
[Link]
Not sure that partitions help here. If we are taking personal box/laptop /home is the only
thing I care about and it'll have all the disk space as well.
ext3 metaclustering
Posted Jan 18, 2008 15:06 UTC (Fri) by rfunk (subscriber, #4054)
[Link]
OK, then you won't mind if I rm -rf /usr on your machine. :-)
Try: du -shc /var /usr /home
(There's also the root stuff not in those, but it's harder to measure that.)
You may be surprised at how much is in /var and /usr.
ext3 metaclustering
Posted Jan 18, 2008 15:19 UTC (Fri) by fatrat (subscriber, #1518)
[Link]
My home dir contains ~82 Gb. Compared to that, /usr and /var don't contain a lot (under 10gb).
I'm sure most people are similar, hence my comment.
ext3 metaclustering
Posted Jan 18, 2008 15:45 UTC (Fri) by rfunk (subscriber, #4054)
[Link]
10GB is still a big important chunk of disk, whether the rest is 20GB or 82GB. Checking
it separately *will* speed up each check, and separating it into a separate filesystem will
make sure that errors on one part won't mess up the other part.
(Come to think of it, I suspect that the fsck speed is more dependent on number of files
than data size, though I don't know for sure.)
ext3 metaclustering
Posted Jan 19, 2008 22:22 UTC (Sat) by Frej (subscriber, #4165)
[Link]
Partitioning is fixing the symptoms, not the problem.
Multiple fscks
Posted Jan 30, 2008 3:28 UTC (Wed) by Max.Hyre (subscriber, #1054)
[Link]
[S]et the mounts-before-check count on
each filesystem to be a different prime number. That way multiple filesystems almost
never get checked at the same time.
Even better is setting the mounts/count to the same number on all filesystems, then use tunefs to set the starting count to a different value on each.
Voila! Never a multiple fsck.
ext3 metaclustering
Posted Jan 17, 2008 23:19 UTC (Thu) by magila (subscriber, #49627)
[Link]
Disks these days are pretty good at hiding bad sectors from the host. If it gets bad enough
that the OS starts seeing bad data then the drive is probably on it's last legs and will soon
fail completely. In any case monitoring the SMART logs will usually catch a drive that is
gradually degrading without the frustrating fsck delays.
ext3 metaclustering
Posted Jan 17, 2008 23:33 UTC (Thu) by rfunk (subscriber, #4054)
[Link]
True, but how many people monitor SMART logs on a laptop, or even a desktop?
More to the point, how many of the people disabling the auto-fsck monitor their SMART
logs?
ext3 metaclustering
Posted Jan 18, 2008 22:00 UTC (Fri) by nix (subscriber, #2304)
[Link]
smartd can send you emails when things go suspiciously wrong.
ext3 metaclustering
Posted Jan 18, 2008 22:05 UTC (Fri) by rfunk (subscriber, #4054)
[Link]
True. How many people have system-level email working properly on their laptops, and
are able to get such emails?
ext3 metaclustering
Posted Jan 18, 2008 22:14 UTC (Fri) by nix (subscriber, #2304)
[Link]
Um, anyone competent? All sorts of other email, some security-important,
gets sent by various daemons and shouldn't just be binned or ignored... of
course a lot of people aren't competent :/
ext3 metaclustering
Posted Jan 18, 2008 22:20 UTC (Fri) by rfunk (subscriber, #4054)
[Link]
My programmer coworkers have enough trouble with the task, and they're techies.
Forget about the non-techie user that is adopting Linux more and more.
Everyone sets up their GUI mail program, and totally ignores the system-level MTA
(sendmail/postfix/exim). They just never get those emails.
(Sysadmin types being the exception, of course, but they're few and far between these
days.)
ext3 metaclustering
Posted Jan 19, 2008 18:35 UTC (Sat) by raxyx (subscriber, #50026)
[Link]
So THAT's that these MTAs are for. Cool.
> Everyone sets up their GUI mail program, and totally ignores the system-level MTA
> (sendmail/postfix/exim). They just never get those emails.
Full ack on that. On some of my Debian machines, during the boot sequence, the thing that
takes the most time to get loaded is exim4, so one day I got fed up with it and removed it,
didn't notice any difference afterwards. I guess I'm going to rethink that move :-)
lightweight MTAs for outgoing mail only
Posted Jan 19, 2008 20:08 UTC (Sat) by liamh (subscriber, #4872)
[Link]
I have taken to removing exim4 and installing either ssmtp or nullmailer
aptitude install ssmtp exim4- exim4-base- exim4-config- exim4-daemon-light-
Just enough MTA to get the word out. Since few people want/need a full MTA, this seems like
it should be the default. But I don't smart disk monitoring; a few years back I tried it and
it led to some unreliable system behavior.
ext3 metaclustering
Posted Jan 19, 2008 1:56 UTC (Sat) by cortana (subscriber, #24596)
[Link]
Well, Debian configures smartd to both mail root and display a notification on the desktops of
currently-logged-in users. :)
ext3 metaclustering
Posted Jan 17, 2008 23:56 UTC (Thu) by dberkholz (subscriber, #23346)
[Link]
Google published a paper fairly recently on a large study of disk failures. As I recall, they
found that SMART logs were not reliable indicators.
ext3 metaclustering
Posted Jan 18, 2008 4:12 UTC (Fri) by magila (subscriber, #49627)
[Link]
Notice I said gradually degrading. SMART won't help in the event of a catastrophic mechanical
failure, which is what most of the unanticipated failures in the Google study probably were.
Fsck doesn't help in that case either though. It's only the kinds of failures that cause a
slow accumulation of bad sectors that fsck would matter for, and those are the kinds of
failures that SMART is piratically guaranteed to catch.
ext3 metaclustering
Posted Jan 18, 2008 8:51 UTC (Fri) by njs (guest, #40338)
[Link]
piratically... guaranteed...?
ext3 metaclustering
Posted Jan 18, 2008 22:03 UTC (Fri) by nix (subscriber, #2304)
[Link]
That's SMArrrT for you.
Using fsck to defend against disk failures?
Posted Jan 27, 2008 15:45 UTC (Sun) by anton (guest, #25547)
[Link]
That and the "spreading inconsistency" theory and some other things I
have read by people writing about fsck are failure types that I have
never seen or read a first-hand report of, so I guess they are just
myths or a perverted form of wishful thinking.
The kinds of disk failures I have seen have always been different.
In particular, even if a drive developed a bad block, it recognized
that itself (very slowly) and returned an error rather than wrong
data. I'm not sure if fsck programs are up to dealing with a bad
block of this kind in the metadata, but if a drive has a bad block,
that's certainly a good time to replace the drive and restore the data
from backup. Or you run RAID 1 or RAID 5, you
just need to replace the drive (and make it known to the RAID driver).
Moreover, even if a disk drive deteriorates over time, that's more
likely to hit the data first rather than the meta-data. But fsck
checks only some kinds of errors in the meta-data, so if fsck is your
defense against bad blocks, you don't value your data at all. Making
a backup is more likely to unveil bad blocks than fsck (also in data),
and has obvious additional benefits.
Finally, a good way (much better than fsck) to test the drive for
bad blocks is "smartctl -t long", even though I am sceptical about the
predictive capabilities of SMART.
Overall, I am very sceptical about the value of fsck for dealing
with hardware failures, and a little bit less sceptical about its
value when dealing with software failures (but I think I have not been
bitten by a file system bug yet); in many cases (especially the
hardware ones) we have to restore from backup anyway.
Using fsck to defend against disk failures?
Posted Jan 27, 2008 16:32 UTC (Sun) by nix (subscriber, #2304)
[Link]
My mum's ancient 486 laptop had a really strange disk failure this
Christmas. It started with a single bad sector, but then within about
fifteen minutes one third of the sectors on the disk (in contiguous runs
of varying length) were returning, not bad sectors, but `sector not
found', i.e. the drive couldn't even find the sector address markers.
What I suspect may have happened, based on my extensive lack of experience
in hard drive design, is that all the G forces the head assembly is
exposed to whenever a seek happens had over time twisted the head reading
the farthest side of whichever platter didn't contain the servo track out
of true, so that when the servo track said it was over track X, the
topmost heads were actually midway between tracks or something like that.
In that position they couldn't read the sector addresses, couldn't find
any data, and whoompfh, goodbye data.
(I've never heard of this failure mode anywhere else, and perhaps it was
something different, but still, it was very strange. Disks *can* go mostly
bad all at once. It's just rare.)
Disk failures
Posted Jan 27, 2008 21:58 UTC (Sun) by anton (guest, #25547)
[Link]
Disk drives have not used servo tracks for a long time, because one
could no longer align all the heads precisely enough (e.g., because of
thermal expansion). Instead, servo information exists on each
platter, interspersed in some way with the data. I don't know when
this change happened; a 15+-year old disk (486 generation) might still
have a servo track. But couldn't the symptoms also be explained by
the failure of just one of the heads?
Disk failures
Posted Jan 27, 2008 22:55 UTC (Sun) by nix (subscriber, #2304)
[Link]
I said it was a prehistoric system, and indeed anything more modern than
about, what, 1991 won't have this problem.
I'm not sure if a head failure could cause a failure to find sector
address markers: I'm not sure if you could even distinguish the two cases
without digging into the drive. (As I said, my expertise in hard drive
engineering is notable mainly by its absence.)
It's just that heads are solid-state, and solid-state stuff doesn't die
all that often, while the head assembly itself is being wrenched all over
the place: simple bending could explain this, I think.