Quote of the week

[Posted July 15, 2020 by corbet]

I don't know what it is about filesystems turning into religions that do not brook questioning but what I am seeing in these emails is what turns me off of btrfs every time it is brought up in the same way I couldn't stand reiser, ZFS, or various other filesystems.. I realize filesystems take a lot of faith as people have to put something they value into a leap of faith it will be there the next day.. but it seems to morph quickly into some sort of fanatical evangelical movement.

— Stephen John Smoogen

Quote of the week

Posted Jul 16, 2020 7:16 UTC (Thu) by ncm (guest, #165) [Link] (15 responses)

We know that disk drives produce errors. The manufacturers are happy to tell you how many. When I last checked, is was one bit per 10**13. We push that many bits through in a short time, nowadays, so maybe the rate is better now, but not too much better. If they could get much better, they would cram in more bits and sell those at an error rate you will tolerate.

Most of those bit errors will be in your data, but some will be in the metadata.

Better file systems can be configured to keep their own checksums on metadata and data, but they can't always guarantee that the contents of a block they find to have a bad checksum can be recovered. With a bit of RAID, there might be another copy. I don't know if they would necessarily have another copy of the metadata. Anyway not all file systems can keep checksums, and not all that can always do.

So there is always a hint of chaos lurking in any disk drive even when the software is right.

Quote of the week

Posted Jul 16, 2020 8:32 UTC (Thu) by anton (subscriber, #25547) [Link] (9 responses)

This comment is an example of the fear-based ZFS sales pitch that cautioned me to steer clear of ZFS, and is therefore a perfect example of the original quote. It cautioned me because the sales pitch does not agree with my knowledge about how HDDs work (HDDs use error-correcting codes, with some error detection distance between correcting to the right side and producing an undetected bit error; they also do not write blindly to some track the stepper motor has moved to, because they have not used stepper motors for a long time, and instead use servo motors with feedback from the content of the disk to determine where they are), nor with my experience of file system and HDD failures.

Nevertheless, we have recently used ZFS on a server, because it has features we want (snapshots), and because it's better at handling disk failure for RAID-1 setups than btrfs. Maybe a sales pitch based on features would be better for ZFS.

Quote of the week

Posted Jul 16, 2020 11:36 UTC (Thu) by ianmcc (subscriber, #88379) [Link]

I recently upgraded my system and used ZFS. Having played around with it for a bit, I would be hesitant to use a system that *doesn't* provide raid/mirroring facilities, and ease of use similar to ZFS (at least for anything beyond scratch storage). Copying my (1 year old Samsung EVO) SSD to the new drive did get a handful of unrecoverable errors. By chance they only affected files that weren't important.

All of my disks now are in mirrored pairs. Upgrading them to larger space is just a matter of replacing one of the pair with a larger disk, resilvering, and replacing the other disk. (Or something fancier, such as striping). I haven't used snapshots much yet, but that looks very good for having some off-site backup as well.

As far as I understand it, most HDD failures are physical failures in mechanical parts that are not repairable by in-built error-correction. Recovering data from a head HDD is expensive (probably impractical for a home user? Not sure). After a HDD failure a few years ago where I lost a lot of photos that didn't have a backup, I've now got all my critical stuff in 'the cloud', but it is also nice to know that my desktop itself has some resilience to hardware failures.

Quote of the week

Posted Jul 16, 2020 13:44 UTC (Thu) by mchouque (subscriber, #62087) [Link] (2 responses)

Sure hard drives have CRC but you shouldn't think about checksuming filesystems as a cure for bad media: it's just a way to catch bad media, bad transport, bad sender, bad RAM, bad firmware, bad controller and so on.

Because the real and true value is not so much about checking whether your media is lying to you about the data it's reading back to you you but really about knowing what you are reading and got from the whole chain (eg: media -> firmware -> transport -> controller -> memory) is exactly what you wanted to write in the first place.

With other non checksumming filesystems, when an IO goes from the kernel to the disk controller to end up on a disk, you have no idea if it was corrupted in flight or not.

Maybe you're just writing random crap that was corrupted in memory, during transfer, by your controller, by your disk array, by your FC infrastructure and so on. In that case when you read it back, your hard drive, array or whatever will happily say it's all good since it's what it got in the first place. Or equally, maybe what it read was correct but it was corrupted on its way to you: you just don't know.

And that's what a checksumming filesystems is good at.

You're thinking filesystem checksums are about catching media corruption while you should be looking at the big picture: it's an end-to-end checksum.

Quote of the week

Posted Jul 16, 2020 21:39 UTC (Thu) by barryascott (subscriber, #80640) [Link] (1 responses)

No not CRC.
HDD use 512 bit ECC for each 512 byte blocks and can recover 10 bit error runs I recall.

Of course our are correct that if the dat is corrupted on the bus that counts for nothing.

Barry

Quote of the week

Posted Jul 17, 2020 5:01 UTC (Fri) by ncm (guest, #165) [Link]

I guess your comment indicates that you are not aware that correcting a limited run of bad bits is exactly what CRC is designed to enable. It will detect all errors up to that range, and then have a pretty good chance of noticing more widely-separated errors. There are other algorithms with similar characteristics.

So a block that gets past that will most often have two smaller clusters of bad bits, or one bigger cluster. Before delivering that bad block, it will read that sector over and over, maybe a hundred times before it gives up. If it gets one good read, it will write that to a new sector, and maybe mark the old spot as bad, and not write anything more there.

After a while, you will have a growing collection of those copied-off blocks, with a higher than typical likelihood, in each, of having undetected errors in it.

Quote of the week

Posted Jul 16, 2020 17:37 UTC (Thu) by zlynx (guest, #2285) [Link] (2 responses)

Other things that should really worry you is that SAS Protection Information includes the block addresses.

Because it is a *real life problem* that block addresses get corrupted during a write operation. Which means that data block your system wrote went to the *wrong block*. Now the new data is not where it is expected *and* it just wrote over some unknown block of data.

ZFS or btrfs can help with that because by having multiple copies of the data when it goes to read it or scrub it, it will notice that data is either missing or overwritten and can recover from another copy.

Quote of the week

Posted Jul 16, 2020 18:04 UTC (Thu) by ncm (guest, #165) [Link] (1 responses)

Thus, you want the block checksum to fold in the block number, because otherwise a block that was written to the wrong place will be reported to have a good checksum, and you won't notice when you read it back.

Furthermore, the block that was supposed to have been overwritten (instead of the one that was) still has a good checksum, even though its contents are stale, and wrong. To detect this you need a generation counter, the number of writes that have occurred in the filesystem, folded into the checksum, and have each block's metadata also include the generation count. Again, I don't know which filesystems do this, if any.

Filesystem integrity maintenance is a lot like security: you need a threat model, a list of error types you hope to detect, and a list of which of those you hope to be able correct automatically. The more of each you have, the more complex it all gets, and the less confidence you can have in the code that implements it all.

Quote of the week

Posted Jul 16, 2020 18:09 UTC (Thu) by zlynx (guest, #2285) [Link]

I'm not sure about zfs, but btrfs does all of that. The metadata tree has sequence numbers and higher level tree nodes checksum the lower nodes.

Quote of the week

Posted Aug 3, 2020 13:48 UTC (Mon) by Darkstar (guest, #28767) [Link] (1 responses)

Every time I hear this argument that harddisks have become better, have better internal error correction, etc, I steer people to the excellent paper "Parity Lost and Parity Regained" by Krioukov et.al. (https://www.usenix.org/legacy/events/fast08/tech/full_pap...).

That paper, even though it is 12 years old by now, gives very accurate models for those errors, and even if you tweak the percentages down to compensate for "new developments" in HDD engineering, you still get non-negligible chances for visible errors during the lifetime of the harddisk.

Yes, there are error checks and corrections, and yes, these might drop the BER for another order of magnitude or two. But there are also software bugs, race conditions, etc. in the drive firmware that counter these improvements.

And these issues actually happen in practice. We have thousands of harddisks deployed in the field inside storage systems, and we see the blocks that just so happen to not make it to disk (or back) correctly.

So while the actual numbers might be considered "fear-mongering", and you might argue that it should be one or two orders of magnitude higher or lower, if you have hundreds or thousand disks deployed it will have an impact.

Quote of the week

Posted Aug 4, 2020 8:28 UTC (Tue) by Wol (subscriber, #4433) [Link]

You don't need hundreds or thousands of disks. I did a quick calculation for raid, and worked out that if you rebuild a three disk array using modern decently large disks, you should on average get ONE soft error. That's a "reset, retry, it's gone away" error.

A single very large modern hard drive read end-to-end is also large enough to average one error.

Cheers,
Wol

Quote of the week

Posted Jul 17, 2020 0:42 UTC (Fri) by Fowl (subscriber, #65667) [Link]

Surely whatever AEAD scheme you're using would protect against that, and many other failures.

You *are* using encryption on all your disks, aren't you? ;)

Quote of the week

Posted Jul 17, 2020 1:11 UTC (Fri) by Wol (subscriber, #4433) [Link] (2 responses)

> We know that disk drives produce errors. The manufacturers are happy to tell you how many. When I last checked, is was one bit per 10**13. We push that many bits through in a short time, nowadays, so maybe the rate is better now, but not too much better. If they could get much better, they would cram in more bits and sell those at an error rate you will tolerate.

And those errors are unlikely to be in the data on the hard disk. Reset the bus, re-read, and the correct data will be returned.

So chances are the manufacturers CAN'T improve on those error stats, because they're the background level of errors caused by things like cosmic rays cascading through the electronics and flipping the odd random bit here and there ...

Okay, I edit the raid wiki, but to some extent for those who want filesystems that protect their data I'd recommend dm-integrity/raid/lvm/ext. That uses the KISS principle with each layer "doing its thing". The trouble with btrfs/zfs et al is they try to be the swiss army knife - the filesystem that does everything. Totally at odds with the Unix principle of "do one thing and do it well".

But it's like the monolithic/micro kernel holy wars, it depends what you want or need which approach is actually best for you.

Cheers,
Wol

Quote of the week

Posted Jul 17, 2020 5:25 UTC (Fri) by ncm (guest, #165) [Link] (1 responses)

Data as stored on the platter is an analog affair that it is the job of the disk electronics to tease into a collection of bits that match what you sent to it. The analog signal never matches exactly what was written, but is almost always close enough to get the same bits out. By spacing transitions out more, it could be more sure of getting the right bits back, but not so many bits would fit.

So, no, there really are errors coming off the platter, that the controller can, most of the time, tease back to the correct bits--when it can tell they were wrong. But some errors produce the right checksum, and the controller is none the wiser. The designer has a target error rate, and producing fewer errors than the budget allows means the bits are not packed in as tightly as they could have been, and the disk advertises less capacity, for less money. If you care about errors, you will have made arrangements to tolerate some.

Those are in addition to any errors resulting from noise on the bus, which is, ultimately another analog affair, albeit with better margins than on the platter. The bus doesn't get to re-try, and anyway can't tell if it should have.

Quote of the week

Posted Jul 22, 2020 12:15 UTC (Wed) by ras (subscriber, #33059) [Link]

> But some errors produce the right checksum, and the controller is none the wiser.

The same is true for any error system - including ZFS's. But add enough bits and the problem becomes negligible. The 512 bits someone here said a disk controller uses is more then enough to ensure you are unlikely to see an error before the universe goes dark.

What ZFS and btrfs do provide is end to end checking. The drive controller can only correct errors that occur on the disk. Bus errors, SATA controller errors, DRAM errors, cosmic rays hitting CPU cache's all happen after that check is done and so go through unnoticed. Doing the checking in the CPU catches them, which is where ZFS and btrfs win.

Quote of the week

Posted Jul 30, 2020 14:15 UTC (Thu) by azz (subscriber, #371) [Link]

> When I last checked, is was one bit per 10**13.

When I built a new RAID array in 2018, I copied over 27.2 TB of data from my existing arrays, and b2summed everything before and afterwards. I found and fixed 16 single-bit errors - which is 1.4 x 10**13 - so pretty much spot on for that number!