A nasty file corruption bug - fixed
A nasty file corruption bug - fixed
Posted Jan 2, 2007 4:51 UTC (Tue) by iabervon (subscriber, #722)Parent article: A nasty file corruption bug - fixed
This, of course, leaves out three-quarters of the story, in which quite a number of people, including Linus, found a number of things which were confusing or actual bugs, but weren't actually the real issue. There was quite a bit of argument about whether dirty bits on pages or page tables were getting lost in complicated situations in the VM (including Linus finding something that probably was a bug, and probably would cause the right sort of corruption, but fixing it didn't solve the problem), but it turned out not to be the issue at all.
I'm not sure I actually completely follow what was going on, but I think it's a bit more subtle than the article concludes. If the PTE is already dirty, further writes don't lead to set_pte_dirty() being called. But the buffer heads may be cleaned by the filesystem after the PTE is initially marked dirty and before later writes. Then, when the page is finally done, the buffer heads are already marked clean, so they're skipped. Linus finally found that, when the bug triggered, the kernel was deciding to write out the page, at a point where there was no activity, and then doing nothing because all of the buffer heads were clean.
(Linus had previously thought the issue was that, somewhere, a dirty bit was getting cleared when I/O was completed rather than when I/O started. If you clear the dirty bit when I/O is completed, you'd lose any writes which happen during I/O. But he couldn't find anywhere this was happening, because the real issue was different.)
