User: Password:
|
|
Subscribe / Log in / New account

Ext3 and RAID: silent data killers?

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 16:47 UTC (Tue) by martinfick (subscriber, #4455)
In reply to: Ext3 and RAID: silent data killers? by drag
Parent article: Ext3 and RAID: silent data killers?

BACKUPS are poor, version control is the only sane backup. Backups are horrible to recover from. Backups provide no sane automatable mechanism for pruning older data (backups) that doesn't suffer from the same corruption/accidental deletion problem that originals have, but worse, amplified since they don't even have a good mechanism for sanity checking (usage)! Backups tend to backup corrupted data without complaining.

Backups are good for certain limited chores such as backing up your version control system! :) But ONLY if you have a mechanism to verify the sanity of your previous backup and the original before making the next backup. Else, you are back to backing up corrupted data.

A good version control system protects you from corruption and accidental deletion since you can always go to an older version. And the backup system with checksums (often built into VCS) should protect the version control system.

If you don't have space for VCing your data you don't likely really have space for backing it up either, so do not accept this as an excuse to not vcs your data instead of backing it up.


(Log in to post comments)

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 17:44 UTC (Tue) by Cato (subscriber, #7643) [Link]

Since I've researched this a lot recently, here are some rsync/librsync based tools that work somewhat like version control systems but are intended for system backups. They qualify as 'near-CDP' since rsync is efficient at scanning for changes.

rsnapshot is pretty good as a 'sort of' version control system for any type of file including binaries. It doesn't do any compression, just rsync plus hard links, but works very well within its design limits. It can backup filesystems including the hard links (use rsync -avH in the config file), and is focused on 'pull' backups i.e. backup server ssh's into the server to be backed up. It's used by some web hosting providers who back up tens of millions of files every few hours, with scans taking a surprisingly short time due to the efficiency of rsync. Generally rsnapshot is best if you have a lot of disk space available, and not much time to run the backups in.

rdiff-backup may be closer to what you are thinking of - unlike rsnapshot it only stores the deltas between versions of a file, and stores permissions etc as metadata (so you don't have to have root on the box being backed up to rsync arbitrary files). It's a bit slower than rsnapshot but a lot of people like it. It does include checksums which is a very attractive feature.

duplicity is somewhat like rsnapshot, but can also do encryption, so it's more suitable for backup to a system you don't control.

There are a lot of these tools around, based on Mike Rubel's original ideas, but these ones seem the most actively discussed.

For a non-rsync backup, dar is excellent but not widely mentioned - it includes per-block encryption and compression, and per-file checksums, and is generally much faster for recovery than tar, where you must read through the whole archive to recover.

rdiff-backup, like VCS tools, will have difficulty with files of 500 MB or more - it's been reported that such files don't get backed up, or are not delta'ed. Very large files that change frequently (databases, VM images, etc) are a problem for all these tools.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 17:55 UTC (Tue) by dlang (subscriber, #313) [Link]

unless your version control stores your data somewhere other than on your computer, it's a poor substitute for a backup.

there are lots of things that can happen to your computer (including your house burning down) that will destroy everything on it.

no matter how much protection you put into your storage system, you still need backups.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 18:05 UTC (Tue) by martinfick (subscriber, #4455) [Link]

Local backups suffer from the same problem as local version control.

Thus, locality is unrelated to whether your are using backups or version control. Yes, it is better to put it on another computer, or, at least another physical device. But, this is in no way an argument for using backups instead of version control.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 18:05 UTC (Tue) by joey (subscriber, #328) [Link]

> If you don't have space for VCing your data you don't likely really have
> space for backing it up either, so do not accept this as an excuse to not
> vcs your data instead of backing it up.

I'd agree, but you may not have memory to VCS your data. Git, in particular, scales memory usage badly with large data files.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 18:16 UTC (Tue) by martinfick (subscriber, #4455) [Link]

If you have disk space, you have memory: it's called swap. Use it appropriately. With ~$60 TB disks, there is no excuse for either not having enough memory or enough space to VC your data.

Ext3 and RAID: silent data killers?

Posted Sep 2, 2009 0:39 UTC (Wed) by drag (guest, #31333) [Link]

> BACKUPS are poor, version control is the only sane backup.

If your using version control for backups then that is your backup. Your
sentence does not really make a whole lot of sense and is nonsensical.
There is no difference.

My favorite form of backup is to use Git to sync data on geographically
disparate machines. But this is only suitable for text data. If I have to
backup photographs then source code management systems are utter shit.

> Backups are horrible to recover from.

They are only horrible to recover with if the backup was done poorly. If
you (or anybody else) does a shitty job of setting them up then it's your
(or their's) fault they are difficult.

Backing up is a concept.

Anyways its much more horrible to recover data that has ceased to
exist.

> Backups provide no sane automatable mechanism for pruning older data
> (backups) that doesn't suffer from the same corruption/accidental deletion
> problem that originals have, but worse, amplified since they don't even
> have a good mechanism for sanity checking (usage)! Backups tend to backup
> corrupted data without complaining.

Your doing it wrong.

The best form of backup is to full backups to multiple places. Ideally they
should be in a different region. You don't go back and prune data or clean
them up. Thats WRONG. Incremental backups are only useful to reduce the
amount of dataloss between full backups. A full copy of _EVERYTHING_ is a
requirement. And you save it for as long as that data is valuable. Usually
5 years.

It depends on what your doing but a ideal setup would be like this:
* On-site backups every weekend. Full backups. Stored for a few months.
* Incremental backups twice a day, and resets at the weekend with the full
backup.
* Every month 2 full backups are stored for 2-3 years.
* Off-site backups 1 a month, stored for 5 years.
etc. etc.

That would probably be a good idea for most small/medium businesses.

If your relying on a server or a single datacenter to store your data
reliably then your a fool. I don't give a shit on how high quality your
server hardware is or file system or anything. A single fire, vandalism,
hardware failure, disaster, sabotage, or any number of things can utterly
destroy _everything_.

Ext3 and RAID: silent data killers?

Posted Sep 3, 2009 7:51 UTC (Thu) by Cato (subscriber, #7643) [Link]

On full backups: one of the nice things about rsnapshot and similar rsync-based tools is that every backup is both a full backup and an incremental backup. Full in that previous backups can be deleted without any effect on this backup (thanks to hard links), and incremental in that the data transfer required is proportional to the specific data blocks that have changed (thanks to rsync).


Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds