Fighting massive data loss bugs

Posted Jul 23, 2009 6:59 UTC (Thu) by Cato (guest, #7643)
Parent article: Fighting small bugs

This is important stuff, and well worth supporting. Right now, however, I'm far more concerned with the really huge bugs - the ones that lose large amounts of data. Here's a recent example...

A relative's PC runs Ubuntu 8.04 (stable version) - a couple of weeks ago there was a filesystem corruption on the root FS, which uses ext3 on LVM. The kernel remounted the FS as readonly, but without notifying the user. I noticed this only recently, so I did a remote login - there weren't any hardware or kernel errors in the logs, only the remount message. The system was still usable and bootable at that point.

I ran e2fsck to fix the block device (FS was still read-only) - thousands of errors were found. One of these must have been in a key library, so that executing any command failed, although most files were still there. I now have to spend at least a day driving over there, re-installing Ubuntu, recovering from backups, etc.

This PC was built a year ago, with only high quality components (robust PSU, good motherboard capacitors, etc) and a conservative setup, including a UPS, and I chose Linux largely so I could maintain it remotely without hassles. Clearly this has not worked...

There are several things wrong here:

- the data loss bug itself - given the reports on Launchpad I strongly suspect a kernel or e2fsck bug, probably the latter. This is the second time I've had data loss due to an ext3 corruption without hardware errors - both times on LVM, perhaps that's a factor. I don't think I've ever lost a whole filesystem on Windows with FAT or NTFS other than with hardware errors.

- the fact that a stable version of a major Linux distro can have such major data loss problems over a year after its release

- (somewhat on topic) lack of an unmissable and persistent notification to the user (or ideally a remote administrator) that a significant error has happened (kernel noticing FS corruption and remounting the FS read-only. In fact, https://bugs.launchpad.net/bugs/28622 which covers this could be called a 'papercut' bug if the consequences weren't more serious

- the lack of really good and low cost online remote backup for Linux - I use SpiderOak which worked well in this case, and I find better than JungleDisk or Dropbox, but on another PC it has silently not done any backups for over a month.

- lack of continuous fscks (with meaningful notification to user or administrator) for PCs that are left switched on most of the time.

I use Linux and ext3 because I like reliability - I'm really stunned to find a massive data loss bug like this in 2009 on such a mature filesystem in a stable distro. Obviously such bugs are hard to reproduce but they are reported a lot.

Ubuntu 9.04 Jaunty apparently has a post-release kernel update that *introduces* a new ext3 data loss bug, yet this update has not been pulled... https://bugs.launchpad.net/ubuntu/+source/linux/+bug/346691 has the details. I really like Ubuntu but the non-handling of this data loss regression is rather horrifying - simply removing the updated kernel would be enough. I just installed Ubuntu 9.04 on a friend's PC last weekend - fortunately it was only as a recovery OS alongside Windows, in light of this.

Fighting massive data loss bugs

Posted Jul 23, 2009 7:16 UTC (Thu) by Cato (guest, #7643) [Link]

Correction: the Ubuntu Jaunty data loss bug mentioned in last paragraph is not ext3 related - most likely it's in the ata_piix module.

Fighting massive data loss bugs

Posted Jul 23, 2009 11:06 UTC (Thu) by mjthayer (guest, #39183) [Link] (1 responses)

Quite agree, although I don't see a contradiction (in my own experience, the "paper cut" sort of bugs are the ones that you fix while you are taking a break from tracking down the serious ones). My own priority here would be, on the one hand bugs with serious consequences (see your example), and on the other, easily fixed bugs which have a reasonable chance of being noticed by more than one or two people. Again, my experience is that you can often quickly tell the last category when a group of users start collaborating over the bugtracker and successfully isolate the bug enough to make it trivial to fix.

Fighting massive data loss bugs

Posted Jul 23, 2009 11:09 UTC (Thu) by mjthayer (guest, #39183) [Link]

Slightly (but only slightly) off-topic, I feel that it is a shame that people suggesting ways non-programmers can contribute to free software don't emphasise this sort of collaboration enough. It can be done with minimal (though not non-existent) technical skills and no knowledge of programming, and I'm pretty sure that seeing bugs fixed as a result of this sort of work is the sort of rewarding experience that makes people want to use free software.

Fighting massive data loss bugs

Posted Jul 23, 2009 14:04 UTC (Thu) by Baylink (guest, #755) [Link] (7 responses)

I could ask why the root FS was on an LVM, but I probably shouldn't. :-)

Fighting massive data loss bugs

Posted Jul 23, 2009 16:55 UTC (Thu) by Cato (guest, #7643) [Link] (6 responses)

Fair point, but isn't LVM supposed to be production quality these days? This particular corruption doesn't have the indicators of LVM being involved (such as writing beyond the end of the volume.) Mostly I thought by avoiding RAID and LVM snapshots it was safe enough. However I will remove LVM on the recovered machine, mostly so I can enable ext3 barriers which LVM doesn't permit.

Most of the FS corruption reports I've seen don't mention LVM, so I suspect a bug elsewhere. One report had repeatable corruption on VMware VMs with Ubuntu 8.04 guests, for example, and other corruption reports are on ext3, XFS, JFS, etc, with common factor being an Intel chipset. Quite a few are probably due to dodgy hardware, which makes this hard to pin down. In fact for all I know this is due to bad RAM or a hard disk problem that doesn't appear in the system logs.

Assuming LVM is stable, it's quite easy to recover an LVM machine these days - Knoppix, SystemRescueCD and others support LVM. It's only if there's disk corruption that LVM makes things harder, but that's what backups are for.

I am discovering that SpiderOak is not as good at recovery as it should be - client doesn't work on one machine, and the web download feature generates an invalid ZIP for one directory... So I'm open to recommendations for inexpensive online backup for Linux machines that don't involve rolling your own (I've already done that and want something easier to maintain - but I may go with rsnapshot in future just to avoid the hassles of backup services that don't quite work the way they should).

Fighting massive data loss bugs

Posted Jul 23, 2009 18:07 UTC (Thu) by Baylink (guest, #755) [Link] (4 responses)

I suppose it is, but personally, the only unusual thing I want / and /boot on is "real" hardware RAID 1. When things start to go to hell, the fewer distractions you have, the better off you are.

I'm a sysadmin; I'm paid to be paranoid.

Fighting massive data loss bugs

Posted Jul 23, 2009 22:54 UTC (Thu) by Cato (guest, #7643) [Link] (3 responses)

I've now had a look at the machine - there are two disks which have mostly LVMed partitions, and one disk is showing classic signs of LVM errors resulting in ext3 corruption (not the one with the root FS, but one with several LVMs for local backup). I used SystemRescueCD initially and the LVM commands showed the LVM state was quite messed up, plus log messages like this:

Jul 23 19:06:57 sysresccd attempt to access beyond end of device
Jul 23 19:06:57 sysresccd sda: rw=0, want=198724030, limit=66055248
Jul 23 19:07:20 sysresccd attempt to access beyond end of device
Jul 23 19:07:20 sysresccd sda: rw=0, want=198723798, limit=66055248

One weird thing is that the LVM on the main disk that hosted the root FS didn't show any LVM related errors, but that was the one with the major corruption. Of course the backup LVM had major corruption but I wasn't focusing on that. In fact a generally odd thing is that the Ubuntu logs didn't show any errors on either disk (i.e. ext3 or LVM type errors), apart from the 'FS remounted' one, yet SystemRescueCD showed them right away.

Another weird thing is that despite the root FS being remounted read-only, the logs in /var were still being written to for 10 days after the first root corruption - surely this is a bug as it can only increase FS corruption.

I haven't yet run a memory test but the system doesn't show any other signs of bad RAM such as randomly crashing applications. The logs also don't show any disk hardware errors.

Anyway, the lesson is simple: never, ever use LVM again. Gparted is pretty good these days for resizing/moving partitions, and the time I have saved on LVM is far less than the hassle of this recovery exercise.

Sorry for going so far off topic, but perhaps LWN would like to write a piece on data loss bugs and how best the community should address them - maybe starting with LVM...

Fighting massive data loss bugs

Posted Jul 23, 2009 22:56 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

is /var a separate mount? if so it should keep going even if / gets remounted ro

if the underlying device becomes ro, the OS can buffer writes in ram that it wants to get to the filesystem, but can't because it's ro.

this causes more lost data, but not more corruption.

Fighting massive data loss bugs

Posted Jul 23, 2009 23:27 UTC (Thu) by Cato (guest, #7643) [Link]

In this case there was no separate /var FS, and the updates to /var/log/messages have persisted across at least one reboot. So somehow the root FS was mounted readonly (I got errors on trying to write to files so it really was readonly), yet the log files were being updated on /var ...

Fighting massive data loss bugs

Posted Jul 25, 2009 20:22 UTC (Sat) by Cato (guest, #7643) [Link]

Now that I've rebuilt the PC... I actually lost the contents of two filesystems (root and backup), each on a separate physical disk, and hosted on LVM logical volumes within separate volume groups. There weren't any physical disk errors, or noticeable errors relating to PATA/SATA cables, memory, etc. The only common factor is that both LVM VGs were, well, handled by LVM. It's also suspicious that only one LVM FS was uncorrupted, plus all of the non-LVM FSs.

I suspect that some combination of disk write caching plus LVM and possibly ext3 caused these problems. At least some of the problem was purely at the LVM level, since I couldn't even access the VGs on the backup disk, and got LVM errors.

In the hope that it helps someone else:

- To help avoid integrity problems in future, I used the ext3 'data=journal,barriers=1' options in fstab, and also used tune2fs to set the journal_data option on the root FS (only way that worked for root). I also disabled disk level write caching with "hdparm -W0 /dev/sdX' on both hard disks. This will have some performance cost but this PC is ridiculously fast for light email and web surfing anyway.

- I've dropped SpiderOak for online backup - it didn't back up most of the files (on two PCs, in different ways), generated a corrupt ZIP file on recovering some files via web interface, and the GUI client got stuck recovering files, and generally makes it hard to track backups/restores.

- I have implemented local backups with rsnapshot, which is really outstanding for multi-verson rsync based, and will extend this for online backups, possibly using DAR to encrypt and compress for remote backups.

- Sbackup (Simple Backup) is great for really quick backup setup (literally 2 minutes to install, configure and have first backup running), but I wouldn't rely on that alone.

Also, if you haven't used etckeeper before, it's worth a try - version control for the whole of /etc using git, hg, bzr, or darcs, and also tracks APT package installs that generate /etc changes. Great if you need to replicate some or all of the setup at a later date.

Fighting massive data loss bugs

Posted Jul 23, 2009 21:42 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

My #1 guess would be failing RAM. DIMMs don't often go bad, but it does happen and there is nothing in most PCs that will detect it, you just start to see the wrong bits, and of course most of those bits are either coming from or going to files, so it's easy to blame the filesystem.

Filesystem bugs are like any other bugs, they tend to be repeatable, they do something stupid and wrong but not entirely ridiculous (e.g. they don't flip a few bits in the middle of a file, but overwrite an entire block with something else) and so on. If you see weird problems, and especially if you see problems that don't have any clear pattern, that's _much_ more likely to be bad RAM.