Couple of clarifications
Posted Jan 13, 2007 6:11 UTC (Sat) by Nick
Parent article: A nasty file corruption bug - fixed
The article is quite good, but there may have been one thing unclear or not exactly right (to me, at least).
Firstly, there was no bug in 2.6.18 or earlier. Two bugs were introduced with the dirty shared mmap accounting patches: one was that pte dirty information would be thrown away, the other was removal of some vital lock coverage that exposed a race.
Secondly, the actual problem was not IO started before set_page_dirty() being called. As other people have noted, the buffers will be marked clean _before_ the IO starts, and set_page_dirty will redirty all buffers including the ones currently under IO.
The main problem was very simple: ptes were getting their dirty bits cleared without transferring that dirtyness into the page. Now this *appeared* to be safe, because that was only happening when we wanted to clean the dirty information, before starting page writeback. Now if the filesystem had previously cleaned some buffers, many filesystems will not write them out again when doing this page writeback.
Data is lost when the memory represented by one of these "clean" buffers has actually been modified via this pte.
Before the page dirty tracking patches went in, such a situation would also see the writeout of such a buffer to be skipped (because the dirty state is only in the pte, and not known to the pagecache). The difference is: that dirty info in the pte does not get chucked away -- it will get transferred to the page (and its buffers) either with msync, or when that memory is unmapped (munmap or exit).
Was that even slightly understandable or helpful? :)
to post comments)