Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 23, 2013
An "enum" for Python 3
An unexpected perf feature
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
RAID 6 is slightly less bad. If you want to avoid problems with crashes, outages, you should have multiple hot standbys. if you want performance you should use RAID 10.
Either way you should use a backup as your data loss reduction strategy.
Ext3 and RAID: silent data killers?
Posted Sep 1, 2009 7:46 UTC (Tue) by job (guest, #670)
Posted Sep 1, 2009 8:05 UTC (Tue) by drag (subscriber, #31333)
With RAID 5 the amount of time it takes to recover is so long nowadays that the chances of having a double fault is pretty good. It was one thing to have 20GB with 30MB/s performance, but it's quite another to have 1000GB with 50MB/s performance...
Posted Sep 11, 2009 1:18 UTC (Fri) by Pc5Y9sbv (guest, #41328)
My cheap MD RAID5 with three 500 GB SATA drives allows me to have 1TB and approximately 100 MB/s per drive throughput, which implies a full scan to re-add a replacement drive might take 2 hours or so (reading all 500 GB from 2 drives and writing 500 GB to the third at 75% of full speed). I have never been in a position where this I/O time was worrisome as far as a double fault hazard. Having a commodity box running degraded for several days until replacement parts are delivered is a more common consumer-level concern, which has not changed with drive sizes.
Posted Sep 3, 2009 5:05 UTC (Thu) by k8to (subscriber, #15413)
Meanwhile, you also get vastly better performance, and higher reliability of implementation.
It's really a no brainer unless you're poor.
Posted Sep 3, 2009 5:26 UTC (Thu) by dlang (✭ supporter ✭, #313)
in digging further I discovered that they key to performance was to have enough queries in flight to keep all disk heads fully occupied (one outstanding query per drive spindle), and you can do this with both raid 6 and raid 10.
Posted Sep 1, 2009 8:11 UTC (Tue) by drag (subscriber, #31333)
RAID = availability/performance
BACKUPS = data protection.
Anything other way of looking at is pretty much doomed to be flawed.
Posted Sep 1, 2009 15:43 UTC (Tue) by Cato (subscriber, #7643)
Posted Sep 1, 2009 16:05 UTC (Tue) by jonabbey (subscriber, #2736)
Posted Sep 1, 2009 16:47 UTC (Tue) by martinfick (subscriber, #4455)
Backups are good for certain limited chores such as backing up your version control system! :) But ONLY if you have a mechanism to verify the sanity of your previous backup and the original before making the next backup. Else, you are back to backing up corrupted data.
A good version control system protects you from corruption and accidental deletion since you can always go to an older version. And the backup system with checksums (often built into VCS) should protect the version control system.
If you don't have space for VCing your data you don't likely really have space for backing it up either, so do not accept this as an excuse to not vcs your data instead of backing it up.
Posted Sep 1, 2009 17:44 UTC (Tue) by Cato (subscriber, #7643)
rsnapshot is pretty good as a 'sort of' version control system for any type of file including binaries. It doesn't do any compression, just rsync plus hard links, but works very well within its design limits. It can backup filesystems including the hard links (use rsync -avH in the config file), and is focused on 'pull' backups i.e. backup server ssh's into the server to be backed up. It's used by some web hosting providers who back up tens of millions of files every few hours, with scans taking a surprisingly short time due to the efficiency of rsync. Generally rsnapshot is best if you have a lot of disk space available, and not much time to run the backups in.
rdiff-backup may be closer to what you are thinking of - unlike rsnapshot it only stores the deltas between versions of a file, and stores permissions etc as metadata (so you don't have to have root on the box being backed up to rsync arbitrary files). It's a bit slower than rsnapshot but a lot of people like it. It does include checksums which is a very attractive feature.
duplicity is somewhat like rsnapshot, but can also do encryption, so it's more suitable for backup to a system you don't control.
There are a lot of these tools around, based on Mike Rubel's original ideas, but these ones seem the most actively discussed.
For a non-rsync backup, dar is excellent but not widely mentioned - it includes per-block encryption and compression, and per-file checksums, and is generally much faster for recovery than tar, where you must read through the whole archive to recover.
rdiff-backup, like VCS tools, will have difficulty with files of 500 MB or more - it's been reported that such files don't get backed up, or are not delta'ed. Very large files that change frequently (databases, VM images, etc) are a problem for all these tools.
Posted Sep 1, 2009 17:55 UTC (Tue) by dlang (✭ supporter ✭, #313)
there are lots of things that can happen to your computer (including your house burning down) that will destroy everything on it.
no matter how much protection you put into your storage system, you still need backups.
Posted Sep 1, 2009 18:05 UTC (Tue) by martinfick (subscriber, #4455)
Thus, locality is unrelated to whether your are using backups or version control. Yes, it is better to put it on another computer, or, at least another physical device. But, this is in no way an argument for using backups instead of version control.
Posted Sep 1, 2009 18:05 UTC (Tue) by joey (subscriber, #328)
I'd agree, but you may not have memory to VCS your data. Git, in particular, scales memory usage badly with large data files.
Posted Sep 1, 2009 18:16 UTC (Tue) by martinfick (subscriber, #4455)
Posted Sep 2, 2009 0:39 UTC (Wed) by drag (subscriber, #31333)
If your using version control for backups then that is your backup. Your
sentence does not really make a whole lot of sense and is nonsensical.
There is no difference.
My favorite form of backup is to use Git to sync data on geographically
disparate machines. But this is only suitable for text data. If I have to
backup photographs then source code management systems are utter shit.
> Backups are horrible to recover from.
They are only horrible to recover with if the backup was done poorly. If
you (or anybody else) does a shitty job of setting them up then it's your
(or their's) fault they are difficult.
Backing up is a concept.
Anyways its much more horrible to recover data that has ceased to
> Backups provide no sane automatable mechanism for pruning older data
> (backups) that doesn't suffer from the same corruption/accidental deletion
> problem that originals have, but worse, amplified since they don't even
> have a good mechanism for sanity checking (usage)! Backups tend to backup
> corrupted data without complaining.
Your doing it wrong.
The best form of backup is to full backups to multiple places. Ideally they
should be in a different region. You don't go back and prune data or clean
them up. Thats WRONG. Incremental backups are only useful to reduce the
amount of dataloss between full backups. A full copy of _EVERYTHING_ is a
requirement. And you save it for as long as that data is valuable. Usually
It depends on what your doing but a ideal setup would be like this:
* On-site backups every weekend. Full backups. Stored for a few months.
* Incremental backups twice a day, and resets at the weekend with the full
* Every month 2 full backups are stored for 2-3 years.
* Off-site backups 1 a month, stored for 5 years.
That would probably be a good idea for most small/medium businesses.
If your relying on a server or a single datacenter to store your data
reliably then your a fool. I don't give a shit on how high quality your
server hardware is or file system or anything. A single fire, vandalism,
hardware failure, disaster, sabotage, or any number of things can utterly
Posted Sep 3, 2009 7:51 UTC (Thu) by Cato (subscriber, #7643)
Posted Sep 3, 2009 5:06 UTC (Thu) by k8to (subscriber, #15413)
Posted Sep 4, 2009 10:38 UTC (Fri) by nix (subscriber, #2304)
Furthermore, reliability is fine *if* you can be sure that once RAID parity computations have happened the stripe will always hit the disk, even if there is a power failure. With battery-backed RAID, this is going to be true (modulo RAID controller card failure or a failure of the drive you're writing to). Obviously if the array is sufficiently degraded reliability isn't going to be good anymore, but doesn't everyone know that?
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds