User: Password:
|
|
Subscribe / Log in / New account

Improving ext4: bigalloc, inline data, and metadata checksums

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 4:49 UTC (Wed) by ringerc (subscriber, #3071)
In reply to: Improving ext4: bigalloc, inline data, and metadata checksums by yoe
Parent article: Improving ext4: bigalloc, inline data, and metadata checksums

... or bad RAM, bad CPU cache, CPU heat, an unrelated kernel bug in (eg) a disk controller, a disk controller firmware bug, a disk firmware bug, or all sorts of other exciting possibilities.

I recently had a batch of disks in a backup server start eating data because of a HDD firmware bug. It does happen.


(Log in to post comments)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 8:29 UTC (Wed) by hmh (subscriber, #3838) [Link]

Please disclose disc model and fw level, this kind of stuff is important as it often helps someone avoid data loss...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 12:02 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

"occasionally ogginfo(1) reports corrupt OGG files"

Screams RAM or cache fault to me. It's that word "occasionally" which does it. Bugs tend to be systematic. Their symptoms may be bizarre, but there's usually something consistent about them, because after all someone has specifically (albeit accidentally) programmed the computer to do exactly whatever it was that happened. Even the most subtle Heisenbug will have some sort of pattern to it.

Yoe should be especially suspicious of their "blame ext4" idea if this "corruption" is one or two corrupted bits rather than big holes in the file. Disks don't tend to lose individual bits. Disk controllers don't tend to lose individual bits. Filesystems don't tend to lose individual bits. These things all deal in blocks, when they lose something they will tend to lose really big pieces.

But dying RAM, heat-damaged CPU cache, or a serial link with too little margin of error, those lose bits. Those are the places to look when something mysteriously becomes slightly corrupted.

Low-level network protocols often lose bits. But because there are checksums in so many layers you won't usually see this in a production system even when someone has goofed (e.g. not implemented Ethernet checksums at all) because the other layers act as a safety net.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 12:44 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

Bah, should specify pr1268 not Yoe. Sorry.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 15:42 UTC (Wed) by pr1268 (subscriber, #24648) [Link]

The corruption I was getting was not merely "one or two bits" but rather a hole in the OGG file big enough to cause an audible "skip" in the playback—large enough to believe it was a whole block disappearing from the filesystem. Also, the discussion of write barriers came up; I have noatime,data=ordered,barrier=1 as mount options for this filesystem in my /etc/fstab file—I'm pretty sure those are the "safe" defaults (but I could be wrong).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 17:31 UTC (Wed) by rillian (subscriber, #11344) [Link]

Ogg files have block-level checksums too.

That means that a few bit errors will cause the decoder to drop ~100 ms of audio at a time, and tools will report this as 'hole in data'. To see if it's disk or filesystem corruption, look for pages of zeros in a hexdump around where the glitch is.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 3:17 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

Maybe ogg's block-level checksums aren't such a good idea after all. Most likely, a few wrong bits won't affect the sound output much, and a 100ms skip sounds much worse than just playing a single wrong sample. Checksums make sense for things that must stay intact, but I don't think most multimedia benefits from this kind of robustness.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 10:07 UTC (Thu) by mpr22 (subscriber, #60784) [Link]

A few wrong bits in a Vorbis stream seem likely to give you more than just "one wrong sample".

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 18:25 UTC (Thu) by rillian (subscriber, #11344) [Link]

Indeed. A few corrupt bits in a compressed format can result in a whole block of nasty noise in the output.

The idea with the Ogg checksums was to protect the listener's ears (and possibly speakers) from corrupt output. It's also nice to have a built-in check for data corruption in your archives, which is working as designed here.

What you said is valid for video, because we're more tolerant of high frequency visual noise, and because the extra data dimensions and longer prediction intervals mean you can get more useful information from a corrupt frame than you do with audio. Making the checksum optional for the packet data is one of the things we'd do if we ever revised the Ogg format.


Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds