I've only lived with maybe a few dozen disks in my life, but I've still corruption like that too -- in this case, it turned out that the disk was fine, but one of the connections on the RAID card was bad, and was silently flipping single bits on reads that went to that disk (so it was nondeterministic, depending on which mirror got hit on any given cache fill, and quietly persisted even after the usual fix of replacing the disk).
Luckily the box happened to be hosting a modern DVCS server (the first, in fact), which was doing its own strong validation on everything it read from the disk, and started complaining very loudly. No saying how much stuff on this (busy, multi-user, shared) machine would have gotten corrupted before someone noticed otherwise, though... and backups are no help, either.
I totally understand not being able to implement everything at once, but if there comes a day when there are two great filesystems and one is a little slower but has checksumming, I'm choosing the checksumming one. Saving milliseconds (of computer time) is not worth losing years (of work).
Posted Dec 5, 2008 22:52 UTC (Fri) by ncm (subscriber, #165)
[Link]
Checksumming only the file system's metadata and log, but not the user-level data, is a reasonable compromise. Then applications that matter (e.g. PostgreSQL, or your DVCS) can provide their own data checksums (and not pay twice) and operate on a reliable file system.
This suggests a reminder for applications providing their own checksums: mix in not just the data, but your own metadata (block number, file id). Getting the right checksum on the wrong block is just embarrassing.
Tux3: the other next-generation filesystem
Posted Dec 5, 2008 23:58 UTC (Fri) by njs (guest, #40338)
[Link]
>Checksumming only the file system's metadata and log, but not the user-level data, is a reasonable compromise
Well, maybe...
Within reason, my goal is to have a much confidence as possible in my data's safety, with as little investment of my time and attention. Leaving safety up to individual apps is a pretty wretched system for achieving this -- it defaults to "unsafe", then I have to manually figure out which stuff needs more guarantees, which I'll screw up, plus I have to worry about all the bugs that may exist in the eleventeen different checksumming systems being used in different codebases... This is the same reason I do whole disk backups instead of trying to pick and choose which files to save, or leaving backup functionality up to each individual app. (Not as crazy as an idea as it sounds -- that DVCS basically has its own backup system, for instance; but I'm not going around adding that functionality to my photo editor and word processor too.)
Obviously if checksumming ends up causing unacceptable slowdowns, then compromises have to be made. But I'm pretty skeptical; it's not like CRC (or even SHA-1) is expensive compared to disk access latency, and the Btrfs and ZFS folks seem to think usable full disk checksumming is possible.
If it's possible I want it.
Tux3: the other next-generation filesystem
Posted Dec 6, 2008 8:26 UTC (Sat) by ncm (subscriber, #165)
[Link]
This is another case where the end-to-end argument applies. Either (a) it's a non-critical application, and backups (which you have to do anyway) provide enough reliability; or (b) it's a critical application, and the file system can't provide enough assurance anyway, and what it could do would interfere with overall performance.
Similarly, if your application is seek-bound, it's in trouble anyway. If performance matters, it should be limited by the sustained streaming capacity of the file system, and then delays from redundant checksum operations really do hurt.
Hence the argument for reliable metadata, anyway: the application can't do that for itself, and it had better not depend on metadata operations being especially fast. Traditionally, serious databases used raw block devices to avoid depending on file system metadata.
Tux3: the other next-generation filesystem
Posted Dec 6, 2008 8:55 UTC (Sat) by njs (guest, #40338)
[Link]
End-to-end is great, and it absolutely makes sense that special purpose systems like databases may want both additional guarantees and low-overhead access to the drive. But basically none of my important data is in a database; it's scattered all over my hard drive in ordinary files, in a dozen or more formats. If the filesystem *is* your database, as it is for ordinary desktop storage, then that's the only place you can reasonably put your integrity checking.
Backups are also great, but there are cases (slow quiet unreported corruption that can easily persist undetected for weeks+, see upthread) where they do not protect you.
(In some cases you can actually increase integrity too -- if your app checks its checksum when loading a file and it fails, then the data is lost but at least you know it; if btrfs checks a checksum while loading a block and it fails, then it can go pull an uncorrupted copy from the RAID mirror and prevent the data from being lost at all.)
>If performance matters, it should be limited by the sustained streaming capacity of the file system, and then delays from redundant checksum operations really do hurt.
Again, I'm not convinced. My year-old laptop does SHA-1 at 200 MB/s (using one core only); the fastest hard-drive in the world (according to storagereview.com) streams at 135 MB/s. Not that you want to devote a CPU to this sort of thing, and RAID arrays can stream faster than a single disk, but CRC32 goes *way* faster than SHA-1 too, and my laptop has neither RAID nor a fancy 15k RPM server drive anyway.
And anyway my desktop is often seek-bound, alas, and yours is too; it does make things slow, but I don't see why it should make me care less about my data.
Tux3: the other next-generation filesystem
Posted Dec 7, 2008 21:33 UTC (Sun) by ncm (subscriber, #165)
[Link]
For most uses we would benefit from the file system doing as much as it can, and even backing itself up -- although we'd like to be able to bypass whatever gets in the way. But if the file system does less, at first, the first thing to checksum is the metadata.
Tux3: the other next-generation filesystem
Posted Dec 16, 2008 1:42 UTC (Tue) by daniel (subscriber, #3181)
[Link]
I've only lived with maybe a few dozen disks in my life, but I've still corruption like that too -- in this case, it turned out that the disk was fine, but one of the connections on the RAID card was bad, and was silently flipping single bits on reads that went to that disk (so it was nondeterministic, depending on which mirror got hit on any given cache fill, and quietly persisted even after the usual fix of replacing the disk).
Luckily the box happened to be hosting a modern DVCS server (the first, in fact), which was doing its own strong validation on everything it read from the disk, and started complaining very loudly. No saying how much stuff on this (busy, multi-user, shared) machine would have gotten corrupted before someone noticed otherwise, though... and backups are no help, either.
Our ddnap-style checksumming at replication time would have caught that corruption promptly.
if there comes a day when there are two great filesystems and one is a little slower but has checksumming, I'm choosing the checksumming one. Saving milliseconds (of computer time) is not worth losing years (of work).
It is not milliseconds, it is a significant fraction of your CPU, no matter how powerful. But yes, if you want extra checking is important to you, should be able to have it. Whether block checksums belong in the filesystem rather than volume manager is another question. There may be a powerful efficiency argument that checksumming has to be done by the filesystem, not the volume manager. If so, I would like to see it.
Anyway, when the time comes that block checksumming rises to the top of the list of things to do, we will make sure Tux3 has something respectable, one way or another. Note that checksumming at replication time already gets nearly all the benefit at a very modest CPU cost.
If you want to rank the relative importance of features, replication way beats checksumming. It takes you instantly from having no backup or really awful backup, to having great backup with error detection. So getting to that state with minimal distractions seems like an awfully good idea.
Tux3: the other next-generation filesystem
Posted Dec 21, 2008 12:26 UTC (Sun) by njs (guest, #40338)
[Link]
> Our ddnap-style checksumming at replication time would have caught that corruption promptly.
What is that, and how does it work? I'm curious...
In general, I don't see how replication can help in the situation I encountered -- basically, some data on the disk magically changed without OS intervention. The only way to distinguish between that and a real data change is if you are somehow hooked into the OS and watching the writes it issues. Maybe ddsnap does that?
>It is not milliseconds, it is a significant fraction of your CPU, no matter how powerful.
Can you elaborate? On my year-old laptop, crc32 over 4k-blocks does >625 MiB/s on one core (adler32 is faster still), and the disk with perfect streaming manages to write at ~60 MiB/s, so by my calculation the worst case is 5% CPU. Enough that it could matter occasionally, but in fact seek-free workloads are very rare... and CPUs continue to follow Moore's law (checksumming is parallelizable), so it seems to me that that number will be <1% by the time tux3 is in production :-).
No opinion on volume manager vs. filesystem (so long as the interface doesn't devolve into distinct camps of developers pushing responsibilities off on each other); I could imagine there being locality benefits if your merkle tree follows the filesystem topology, but eh.
>If you want to rank the relative importance of features, replication way beats checksumming.
Fair enough, but I'll just observe that since I do have a perfectly adequate backup system in place already, replication doesn't get *me* anything extra, while checksumming does :-).