LWN.net Logo

A gnarly 2.6.19 file corruption bug

When Linus released 2.6.19, he expressed a certain degree of confidence about its quality:

It's one of those rare "perfect" kernels. So if it doesn't happen to compile with your config (or it does compile, but then does unspeakable acts of perversion with your pet dachshund), you can rest easy knowing that it's all your own d*mn fault, and you should just fix your evil ways.

While this kernel may have lived up to expectations in a number of ways, it would appear that somebody's evil ways have messed things up - and dachshunds would be well advised to keep a low profile. It seems that this kernel can corrupt ext3 filesystems - behavior which was not in the original set of design goals.

The good news (for users) is that the bug is hard to trigger, and that most access patterns work just fine. The bulk of the trouble seems to come with a certain Bittorrent client, which has an unusual access pattern at best. On occasion, parts of a page will end up being written as zeroes, through to the end of the page. Please do not expect your editor to explain why this is happening; it seems that nobody really understands that yet. The solution, however, may involve some relatively serious low-level memory management surgery.

The apparent origin of the problem is a change in how dirty pages are tracked in the kernel. Prior to 2.6.19, this information lived in the page tables; the 2.6.19 kernel, however, moves some of this information into the page structure. This change enables better tracking of dirty pages in the system, which is a good thing, but it could also be bringing some old bugs out to play.

Not all of those bugs are necessarily in the kernel; at one point, Linus went off and wrote a demonstration program showing how a buggy program would work with older kernels but get surprising results in 2.6.19. What it comes down to is that if a program maps a file into memory, it cannot put data into that memory beyond the current length of the file and expect that data to make it to disk. It was a nice demonstration, but this behavioral change does not appear to be behind the problem reports.

Confusion surrounding the propagation and management of the page dirty bits is at the top of the suspect list, as of this writing. Nobody seems to be able to point at anything specific, however, beyond the fact that the code appears to be rather badly messed up. Says Linus:

A lot of this is actually historical cruft. Some of it may even be code that was never supposed to work, but because we maintained _other_ dirty bits in the PTE's, and never touched them before, we never even realized that the code that played with PG_dirty was totally insane.

So the approach being taken by Linus is to rework the dirty page accounting code into something a little more reasonable. To that end, test_clear_page_dirty() is no more, having been pronounced "insane" by Linus. Instead, the new code tries for a better defined sense of when the dirty bit on a page can be cleared; it comes down to either (1) the page is being written to backing store, or (2) the page is no longer relevant (when a file is truncated, for example). In typical fashion, Linus fixed enough to make his own configuration work, leaving the rest as an exercise for the reader.

He makes no claims that this rework will have solved the problem, only that it makes the code more sane than it was before. As of this writing, there have been no responses from the people who are able to reproduce this problem. If the problem goes away - and the developers can convince themselves that it has not just been papered over - then some version of this fix will likely need to be prepared for a 2.6.19 update. Then, maybe, the dachshunds can come out of hiding.


(Log in to post comments)

"When Linux released 2.6.19..."

Posted Dec 21, 2006 4:24 UTC (Thu) by PaulDickson (subscriber, #478) [Link]

I think you meant "When Linus released...", unless the code is now releasing itself.

A gnarly 2.6.19 file corruption bug

Posted Dec 21, 2006 9:51 UTC (Thu) by nix (subscriber, #2304) [Link]

The original reporter pretty much provided an object lesson in how to get bugs fixed, as well: countless test runs with slightly tweaked kernels and even one accidental mispatching that provided an important clue.

We need more reporters like that. :)

A gnarly 2.6.19 file corruption bug

Posted Dec 21, 2006 10:26 UTC (Thu) by k8to (subscriber, #15413) [Link]

Which bittorrent client might that be? I don't have a lot of interest in corrupting my filesystem.

A gnarly 2.6.19 file corruption bug

Posted Dec 21, 2006 10:43 UTC (Thu) by cate (subscriber, #1359) [Link]

IIRC from the reporter, the bug corrupts only the downloaded file, and the bittorent client finds the error (wrong checksum). So it should not got unnoticed, and it doesn't corrupt other files.

A gnarly 2.6.19 file corruption bug

Posted Dec 21, 2006 12:11 UTC (Thu) by Randakar (guest, #27808) [Link]

Looking at the link to Linus' testcase, it seems to be 'rtorrent'.

http://lwn.net/Articles/215115/

"Btw, here's a simpler test-program that actually shows the difference
between 2.6.18 and 2.6.19 in action, and why it could explain why a
program like rtorrent might show corruption behavious that it didn't show
before."

rtorrent

Posted Dec 21, 2006 14:17 UTC (Thu) by Webexcess (guest, #197) [Link]

rtorrent is a very nice program.

I've also found that it's a great burn-in for a system. It's really good at testing ram, cpu and network (for lots of "light" connections). Apparently it can be a good filesystem test too!

A gnarly 2.6.19 file corruption bug

Posted Dec 21, 2006 23:17 UTC (Thu) by k8to (subscriber, #15413) [Link]

Ah, I suspected as much. I use that program regularly, and it makes heavy use of mmap.

I guess I'll move back to 2.6.18 for now.

A gnarly 2.6.19 file corruption bug

Posted Jan 1, 2007 18:15 UTC (Mon) by erich (subscriber, #7127) [Link]

Note that the Debian and Ubuntu (apparently) 2.6.18 kernels have the same problem. Maybe other distributions as well.
But I guess there will be a fixed 2.6.18 kernel coming in any day now.

A gnarly 2.6.19 file corruption bug

Posted Jan 2, 2007 20:30 UTC (Tue) by rvfh (subscriber, #31018) [Link]

Ubuntu does not use 2.6.18 nor 2.6.19, so Ubuntu users are safe AFA this bug is concerned.

  • Edgy: 2.6.17
  • Feisty: 2.6.20

File corruption? Or memory corruption?

Posted Dec 21, 2006 10:55 UTC (Thu) by rankincj (subscriber, #4865) [Link]

My 2.6.19.1 kernel tripped up when all I was doing was compiling xine-lib with gcc-4.1.1. Yes, I was compiling on an ext3 filesystem. No, I wasn't using BitTorrent.

See bug 7707 in Bugzilla: "Eeek! page_mapcount(page) went negative! (-1)"

Can't wipe the grin from my face...

Posted Dec 22, 2006 6:02 UTC (Fri) by xoddam (subscriber, #2322) [Link]

> dachshunds would be well advised to keep a low profile.

Isn't a low profile the very essence of dachshundity?

Can't wipe the grin from my face...

Posted Dec 25, 2006 9:15 UTC (Mon) by BackSeat (subscriber, #1886) [Link]

> Isn't a low profile the very essence of dachshundity?

Worth the year's LWN sub on its own. Thank you, and Merry Christmas!

Can't wipe the grin from my face...

Posted Dec 28, 2006 0:31 UTC (Thu) by csamuel (✭ supporter ✭, #2624) [Link]

Can't...resist...

There is an excellent Les Barker poem on this theme which he later recorded on The Mrs Ackroyd Bands Gnus and Roses album.

We now return you to your regularly scheduled Linux news..

A gnarly 2.6.19 file corruption bug

Posted Dec 29, 2006 22:13 UTC (Fri) by dwheeler (guest, #1216) [Link]

Problem fixed and the fix has been confirmed.

A gnarly 2.6.19 file corruption bug

Posted Jan 1, 2007 20:24 UTC (Mon) by sadyc (guest, #29140) [Link]

The actual code can be found here

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds