Fighting massive data loss bugs
Fighting massive data loss bugs
Posted Jul 23, 2009 6:59 UTC (Thu) by Cato (guest, #7643)Parent article: Fighting small bugs
A relative's PC runs Ubuntu 8.04 (stable version) - a couple of weeks ago there was a filesystem corruption on the root FS, which uses ext3 on LVM. The kernel remounted the FS as readonly, but without notifying the user. I noticed this only recently, so I did a remote login - there weren't any hardware or kernel errors in the logs, only the remount message. The system was still usable and bootable at that point.
I ran e2fsck to fix the block device (FS was still read-only) - thousands of errors were found. One of these must have been in a key library, so that executing any command failed, although most files were still there. I now have to spend at least a day driving over there, re-installing Ubuntu, recovering from backups, etc.
This PC was built a year ago, with only high quality components (robust PSU, good motherboard capacitors, etc) and a conservative setup, including a UPS, and I chose Linux largely so I could maintain it remotely without hassles. Clearly this has not worked...
There are several things wrong here:
- the data loss bug itself - given the reports on Launchpad I strongly suspect a kernel or e2fsck bug, probably the latter. This is the second time I've had data loss due to an ext3 corruption without hardware errors - both times on LVM, perhaps that's a factor. I don't think I've ever lost a whole filesystem on Windows with FAT or NTFS other than with hardware errors.
- the fact that a stable version of a major Linux distro can have such major data loss problems over a year after its release
- (somewhat on topic) lack of an unmissable and persistent notification to the user (or ideally a remote administrator) that a significant error has happened (kernel noticing FS corruption and remounting the FS read-only. In fact, https://bugs.launchpad.net/bugs/28622 which covers this could be called a 'papercut' bug if the consequences weren't more serious
- the lack of really good and low cost online remote backup for Linux - I use SpiderOak which worked well in this case, and I find better than JungleDisk or Dropbox, but on another PC it has silently not done any backups for over a month.
- lack of continuous fscks (with meaningful notification to user or administrator) for PCs that are left switched on most of the time.
I use Linux and ext3 because I like reliability - I'm really stunned to find a massive data loss bug like this in 2009 on such a mature filesystem in a stable distro. Obviously such bugs are hard to reproduce but they are reported a lot.
Ubuntu 9.04 Jaunty apparently has a post-release kernel update that *introduces* a new ext3 data loss bug, yet this update has not been pulled... https://bugs.launchpad.net/ubuntu/+source/linux/+bug/346691 has the details. I really like Ubuntu but the non-handling of this data loss regression is rather horrifying - simply removing the updated kernel would be enough. I just installed Ubuntu 9.04 on a friend's PC last weekend - fortunately it was only as a recovery OS alongside Windows, in light of this.
