User: Password:
|
|
Subscribe / Log in / New account

crash vs. power drop

crash vs. power drop

Posted Sep 10, 2009 17:50 UTC (Thu) by ncm (subscriber, #165)
In reply to: crash vs. power drop by njs
Parent article: POSIX v. reality: A position on O_PONIES

Scragged sectors are a risk, but not the main risk, of power drops. What's worse is blocks still in buffer RAM on the drive that the system thought were already physically on disk, because the drive told it so.

http://lwn.net/Articles/352002/

When the drives lie to the file system, the file system can't guarantee anything. Most drives (i.e. the ones you have) lie. Drives that don't lie cost way more, and are slower; it's much cheaper to add some battery backup, and even some redundancy, than to buy the expensive drives. There's no point in discussing fine points of data ordering if you haven't got one or the other.


(Log in to post comments)

crash vs. power drop

Posted Sep 11, 2009 6:27 UTC (Fri) by zmi (guest, #4829) [Link]

> Second, more subtle but probably more important, drives lie about what is
physically on disk.

That's why in a RAID you really *must* turn off the drive's write cache.
I've tried to explain that in the XFS FAQ:
http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_w...
and also in the questions below that one.

Short: I've got a new 2TB WD drive with 64MB cache, we intend to use them
in a RAID. Take 16 of these drives, it adds to 1024MB (1GB) of write cache.
So in the worst situation,
1) you've got an UPS, but your power supply fails and the PC/server is out
of power
2) drives have their caches full, so up to 1GB of data is lost, where the
filesystem believed they are on disk. There's a *very* high chance that
lots of metadata is included in the cached writes.
3) each of the 16 drives could write "half sectors", effectively destroying
the previous and the actual content.

In all this discussion, it would have been worth noting that if you really
*care* about your data, you *must* turn off the drive write cache. Yes,
power failures are not so often in countries with good power supply. Still,
I use an UPS and in the last half year, had
a) my daughter playing around turning the power of the server off
b) a dead power supply in my workstation
and so, even with an UPS, "drive write cache off" is a must. Simply put a
hdparm -W0 /dev/sda
in your boot scripts.

Note that still this only helps in 1) and 2), but for problem 3) there's
nothing anybody but the disk manufacturers can do. I must say that I have
no evidence of ever having had that problem somewhere. It might be that
happened when there are "strange filesystem problems" after a crash, but
you can't tell for sure.

As for the rename: Really, there should only ever be the chance of having
either the old file or the new one, and the filesystem should care about
this even for crash situations.

Note: In Linux you can tune writeback behaviour in /etc/sysctl.conf:
# start writeback at around 16MB, max. 250MB
vm.dirty_background_bytes = 16123456
vm.dirty_bytes = 250123456
# older kernels had this:
#vm.dirty_background_ratio = 5
#vm.dirty_ratio = 10
# write blocks to disk after 1 second (default: 3000ms)
vm.dirty_expire_centisecs = 1000
vm.dirty_writeback_centisecs = 100

Note that dirty_bytes/dirty_ratio is to block new writes once the cache has
that many bytes to write. On systems with 8GB RAM or more, you could end up
having gigabytes of disk cache.

Sorry for putting all in one post, but I hope it helps people who care
about their data to have some tunings to start with.

mfg zmi

crash vs. power drop

Posted Sep 11, 2009 11:40 UTC (Fri) by Cato (subscriber, #7643) [Link]

Thanks for explaining all that.

On the topic of writing 'half sectors' due to a power drop: the author of http://lwn.net/Articles/351521/ has done quite a lot of tests on various hard drives, and generally found that they usually don't do this, though some instances have done. He has a useful program that can test any drive for this behaviour, though it's mostly intended to test for out of order writes due to caching - I believe only some drives lie about this.

crash vs. power drop

Posted Sep 14, 2009 14:13 UTC (Mon) by nye (guest, #51576) [Link]

This is why barriers *should* be a good thing.

If all you care about is benchmark results, there's an obvious incentive for the drive to claim falsely that some data has been physically written to disk.

On the other hand, there's far less incentive to lie about barriers. If all you're saying is that '*if and when* this data has been written, then that other data will have also been written', you can still happily claim that it's all done when it isn't, without breaking the barrier commitment.

When you have that commitment, it's possible to build a transaction system upon it which works even under the assumption that the drive will lie. You're not going to achieve the full benchmark speed, but it's going to be far better than turning off the cache.

Of course, whether drive manufacturers see it that way is another matter. Is there any data on whether drives actually honour write barriers? It would be interesting to see if there are indeed drives that aren't expensive enough to report accurately on when data has been written, but still honour the barrier request.

crash vs. power drop

Posted Sep 14, 2009 15:06 UTC (Mon) by ncm (subscriber, #165) [Link]

This is where Scott Adams's "Which is more likely" principle is useful. We just frame the question, thus:

Drive manufacturer A can sell almost equally as many drives made this way as that way, but "that way" costs more development time and might make it come out a little slower in benchmarks. Some purchase decisions depend on claiming it's made "that way". The manufacturer can make it "that way" or just say it is, but not. Which is more likely?

crash vs. power drop

Posted Sep 14, 2009 16:12 UTC (Mon) by nye (guest, #51576) [Link]

A fair point, but in all things there's a strong economic incentive to claim that a product does something it doesn't, if that will make more people buy it, and yet most things don't claim do do something which is simply factually not true.

Usually when it's a feature that either works or doesn't - so it's not a subjective measurement - it's likely that a product does at least technically do what it says it does.

Presumably the argument is that the manufacturers aren't specifically claiming a particular feature, but the disk is behaving in a particular way that just happens to be not what the user expected - so they're not technically lying. This does seem to weaken the idea that they're doing it to improve the chances of people buying it though, if they're not stating it as a feature.

Just out of interest, I've just spent a while trying to see if I can find out what the difference is between the Samsung HE502IJ and HD502IJ - two drives which are identical on paper, but one is sold as 'RAID-class'. Neither are even remotely expensive enough not to lie about their actions, so what's the difference? Well, some forum post claims that one has a 'rotational vibration sensor', whatever that means.

In conclusion, people who try to sell you things are all liars and cheats, and I intend to grow a beard and live out the rest of my days as a hermit, never having to worry about these things again. Perhaps I shall raise yaks.

crash vs. power drop

Posted Sep 15, 2009 12:52 UTC (Tue) by Cato (subscriber, #7643) [Link]

I believe that TLER (time limited error recovery) is one characteristic of enterprise drives - simply means fewer retries so that the RAID controller or OS gets a drive failure more quickly, and knows to use redundancy to complete the I/O.

crash vs. power drop

Posted Sep 15, 2009 12:49 UTC (Tue) by Cato (subscriber, #7643) [Link]

Maybe the best way round this is to make sure that performance benchmarks always include a 'reliability benchmark' that detects drives which are lying about writes to hard disk.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds