User: Password:
|
|
Subscribe / Log in / New account

Fighting massive data loss bugs

Fighting massive data loss bugs

Posted Jul 23, 2009 22:54 UTC (Thu) by Cato (subscriber, #7643)
In reply to: Fighting massive data loss bugs by Baylink
Parent article: Fighting small bugs

I've now had a look at the machine - there are two disks which have mostly LVMed partitions, and one disk is showing classic signs of LVM errors resulting in ext3 corruption (not the one with the root FS, but one with several LVMs for local backup). I used SystemRescueCD initially and the LVM commands showed the LVM state was quite messed up, plus log messages like this:

Jul 23 19:06:57 sysresccd attempt to access beyond end of device
Jul 23 19:06:57 sysresccd sda: rw=0, want=198724030, limit=66055248
Jul 23 19:07:20 sysresccd attempt to access beyond end of device
Jul 23 19:07:20 sysresccd sda: rw=0, want=198723798, limit=66055248

One weird thing is that the LVM on the main disk that hosted the root FS didn't show any LVM related errors, but that was the one with the major corruption. Of course the backup LVM had major corruption but I wasn't focusing on that. In fact a generally odd thing is that the Ubuntu logs didn't show any errors on either disk (i.e. ext3 or LVM type errors), apart from the 'FS remounted' one, yet SystemRescueCD showed them right away.

Another weird thing is that despite the root FS being remounted read-only, the logs in /var were still being written to for 10 days after the first root corruption - surely this is a bug as it can only increase FS corruption.

I haven't yet run a memory test but the system doesn't show any other signs of bad RAM such as randomly crashing applications. The logs also don't show any disk hardware errors.

Anyway, the lesson is simple: never, ever use LVM again. Gparted is pretty good these days for resizing/moving partitions, and the time I have saved on LVM is far less than the hassle of this recovery exercise.

Sorry for going so far off topic, but perhaps LWN would like to write a piece on data loss bugs and how best the community should address them - maybe starting with LVM...


(Log in to post comments)

Fighting massive data loss bugs

Posted Jul 23, 2009 22:56 UTC (Thu) by dlang (subscriber, #313) [Link]

is /var a separate mount? if so it should keep going even if / gets remounted ro

if the underlying device becomes ro, the OS can buffer writes in ram that it wants to get to the filesystem, but can't because it's ro.

this causes more lost data, but not more corruption.

Fighting massive data loss bugs

Posted Jul 23, 2009 23:27 UTC (Thu) by Cato (subscriber, #7643) [Link]

In this case there was no separate /var FS, and the updates to /var/log/messages have persisted across at least one reboot. So somehow the root FS was mounted readonly (I got errors on trying to write to files so it really was readonly), yet the log files were being updated on /var ...

Fighting massive data loss bugs

Posted Jul 25, 2009 20:22 UTC (Sat) by Cato (subscriber, #7643) [Link]

Now that I've rebuilt the PC... I actually lost the contents of two filesystems (root and backup), each on a separate physical disk, and hosted on LVM logical volumes within separate volume groups. There weren't any physical disk errors, or noticeable errors relating to PATA/SATA cables, memory, etc. The only common factor is that both LVM VGs were, well, handled by LVM. It's also suspicious that only one LVM FS was uncorrupted, plus all of the non-LVM FSs.

I suspect that some combination of disk write caching plus LVM and possibly ext3 caused these problems. At least some of the problem was purely at the LVM level, since I couldn't even access the VGs on the backup disk, and got LVM errors.

In the hope that it helps someone else:

- To help avoid integrity problems in future, I used the ext3 'data=journal,barriers=1' options in fstab, and also used tune2fs to set the journal_data option on the root FS (only way that worked for root). I also disabled disk level write caching with "hdparm -W0 /dev/sdX' on both hard disks. This will have some performance cost but this PC is ridiculously fast for light email and web surfing anyway.

- I've dropped SpiderOak for online backup - it didn't back up most of the files (on two PCs, in different ways), generated a corrupt ZIP file on recovering some files via web interface, and the GUI client got stuck recovering files, and generally makes it hard to track backups/restores.

- I have implemented local backups with rsnapshot, which is really outstanding for multi-verson rsync based, and will extend this for online backups, possibly using DAR to encrypt and compress for remote backups.

- Sbackup (Simple Backup) is great for really quick backup setup (literally 2 minutes to install, configure and have first backup running), but I wouldn't rely on that alone.

Also, if you haven't used etckeeper before, it's worth a try - version control for the whole of /etc using git, hg, bzr, or darcs, and also tracks APT package installs that generate /etc changes. Great if you need to replicate some or all of the setup at a later date.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds