I caught the bad RAM pretty quickly, and considered myself lucky that I
hadn't obviously lost any big chunks of data... I had seen the ReiserFS
journal check making some noise in dmesg but everything seemed to work.
However, replacing the RAM didn't solve the stability problem. The nature
of the problem changed... it became a random system freeze. At the time, I
didn't realize that I had a new problem - hidden filesystem corruption.
After a couple of big scares with "md" after the system had randomly
frozen, I made a full backup of the filesystem. I continued using the
computer, but the stability problem seemed to be getting worse. I
installed a brand new monster power supply and over the course of the next
month or two I burned a lot of money replacing the rest of the system,
thoroughly confused that I hadn't nailed the problem. (Mockingly, it often
seemed that replacing a part would make the problem go away for a day or
two, leading me to believe I'd fixed it until it slapped me in the face in
the middle of my work yet again.)
My full filesystem backup became handy after I was unable to bring the
filesystem online one time. reiserfsck made lots of noise about problems
with my data and was unable to repair it. I was frustrated to have lost a
month's worth of data, but thrilled that I had a backup at all.
Sadly, I lost the filesystem a few more times and burned even more time
and money on the computer before I realized that with all the hardware
having been replaced, I needed to consider what I had considered to be the
unlikely cause: the software. I became suspicious of reiserfs. This time,
rather than restoring again from my old reiserfs image, I made an ext3
partition, mounted the reiserfs image read-only and migrated.
My system never froze again.
I don't know enough about the reiserfs design to know how plausible my
hypothesis is, but it seems that the bad RAM I dealt with a long time ago
had led to a reiserfs filesystem which was "doomed". I assume the bad RAM
provided the initial corruption, some sort of corruption that made the
reiserfs kernel code fall on its face. Sometimes, the system accessed the
"wrong" bit of corrupted data and the kernel would panic or hang somewhere
inside reiserfs, spreading the corruption in the process.
There's a shocking bit of irony in this particular failure mode. Because
the backup I always restored from was a reiserfs image taken with dd, the
only way I was ever going to escape the crashes and repeated loss of my
data was to abandon reiserfs.
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds