Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for June 20, 2013
Pencil, Pencil, and Pencil
Dividing the Linux desktop
LWN.net Weekly Edition for June 13, 2013
A report from pgCon 2013
Try to nail down whether your problem is LVM, one of your disks dying, or ext4, before changing things like that. Otherwise you'll be debugging for a long time to come...
Improving ext4: bigalloc, inline data, and metadata checksums
Posted Nov 30, 2011 4:49 UTC (Wed) by ringerc (subscriber, #3071)
I recently had a batch of disks in a backup server start eating data because of a HDD firmware bug. It does happen.
Posted Nov 30, 2011 8:29 UTC (Wed) by hmh (subscriber, #3838)
Posted Nov 30, 2011 12:02 UTC (Wed) by tialaramex (subscriber, #21167)
Screams RAM or cache fault to me. It's that word "occasionally" which does it. Bugs tend to be systematic. Their symptoms may be bizarre, but there's usually something consistent about them, because after all someone has specifically (albeit accidentally) programmed the computer to do exactly whatever it was that happened. Even the most subtle Heisenbug will have some sort of pattern to it.
Yoe should be especially suspicious of their "blame ext4" idea if this "corruption" is one or two corrupted bits rather than big holes in the file. Disks don't tend to lose individual bits. Disk controllers don't tend to lose individual bits. Filesystems don't tend to lose individual bits. These things all deal in blocks, when they lose something they will tend to lose really big pieces.
But dying RAM, heat-damaged CPU cache, or a serial link with too little margin of error, those lose bits. Those are the places to look when something mysteriously becomes slightly corrupted.
Low-level network protocols often lose bits. But because there are checksums in so many layers you won't usually see this in a production system even when someone has goofed (e.g. not implemented Ethernet checksums at all) because the other layers act as a safety net.
Posted Nov 30, 2011 12:44 UTC (Wed) by tialaramex (subscriber, #21167)
Posted Nov 30, 2011 15:42 UTC (Wed) by pr1268 (subscriber, #24648)
The corruption I was getting was not merely "one or two bits" but rather a hole in the OGG file big enough to cause an audible "skip" in the playback—large enough to believe it was a whole block disappearing from the filesystem. Also, the discussion of write barriers came up; I have noatime,data=ordered,barrier=1 as mount options for this filesystem in my /etc/fstab file—I'm pretty sure those are the "safe" defaults (but I could be wrong).
Posted Nov 30, 2011 17:31 UTC (Wed) by rillian (subscriber, #11344)
That means that a few bit errors will cause the decoder to drop ~100 ms of audio at a time, and tools will report this as 'hole in data'. To see if it's disk or filesystem corruption, look for pages of zeros in a hexdump around where the glitch is.
Posted Dec 1, 2011 3:17 UTC (Thu) by quotemstr (subscriber, #45331)
Posted Dec 1, 2011 10:07 UTC (Thu) by mpr22 (subscriber, #60784)
Posted Dec 1, 2011 18:25 UTC (Thu) by rillian (subscriber, #11344)
The idea with the Ogg checksums was to protect the listener's ears (and possibly speakers) from corrupt output. It's also nice to have a built-in check for data corruption in your archives, which is working as designed here.
What you said is valid for video, because we're more tolerant of high frequency visual noise, and because the extra data dimensions and longer prediction intervals mean you can get more useful information from a corrupt frame than you do with audio. Making the checksum optional for the packet data is one of the things we'd do if we ever revised the Ogg format.
Posted Dec 2, 2011 22:09 UTC (Fri) by giraffedata (subscriber, #1954)
What you're trying to do with moving back to ext3 is what the Jargon File calls shotgun debugging: trying out some radical move in hopes that this will fix your problem.
That's not shotgun debugging (and not what the Jargon File calls it). The salient property of a shotgun isn't that it makes radical changes, but that it makes widespread changes. So you hit what you want to hit without aiming at it.
Shotgun debugging is trying lots of little things, none that you particularly believe will fix the bug.
In this case, the fallback to ext3 is fairly well targeted: the problem came contemporaneously with this one major and known change to the system, so it's not unreasonable to try undoing that change.
The other comments give good reason to believe this is not the best way forward, but it isn't because it's shotgun debugging.
There must be a term for the debugging mistake in which you give too much weight to the one recent change you know about in the area; I don't know what it is. (I've lost count of how many people accused me of breaking their Windows system because after I used it, there was a Putty icon on the desktop and something broke soon after that).
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds