End-to-end is great, and it absolutely makes sense that special purpose systems like databases may want both additional guarantees and low-overhead access to the drive. But basically none of my important data is in a database; it's scattered all over my hard drive in ordinary files, in a dozen or more formats. If the filesystem *is* your database, as it is for ordinary desktop storage, then that's the only place you can reasonably put your integrity checking.
Backups are also great, but there are cases (slow quiet unreported corruption that can easily persist undetected for weeks+, see upthread) where they do not protect you.
(In some cases you can actually increase integrity too -- if your app checks its checksum when loading a file and it fails, then the data is lost but at least you know it; if btrfs checks a checksum while loading a block and it fails, then it can go pull an uncorrupted copy from the RAID mirror and prevent the data from being lost at all.)
>If performance matters, it should be limited by the sustained streaming capacity of the file system, and then delays from redundant checksum operations really do hurt.
Again, I'm not convinced. My year-old laptop does SHA-1 at 200 MB/s (using one core only); the fastest hard-drive in the world (according to storagereview.com) streams at 135 MB/s. Not that you want to devote a CPU to this sort of thing, and RAID arrays can stream faster than a single disk, but CRC32 goes *way* faster than SHA-1 too, and my laptop has neither RAID nor a fancy 15k RPM server drive anyway.
And anyway my desktop is often seek-bound, alas, and yours is too; it does make things slow, but I don't see why it should make me care less about my data.
Posted Dec 7, 2008 21:33 UTC (Sun) by ncm (subscriber, #165)
[Link]
For most uses we would benefit from the file system doing as much as it can, and even backing itself up -- although we'd like to be able to bypass whatever gets in the way. But if the file system does less, at first, the first thing to checksum is the metadata.