LWN: Comments on "Tux3: the other next-generation filesystem"

Tux3: the other next-generation filesystem

njs — Sun, 21 Dec 2008 12:26:35 +0000

> Our ddnap-style checksumming at replication time would have caught that corruption promptly.

What is that, and how does it work? I'm curious...

In general, I don't see how replication can help in the situation I encountered -- basically, some data on the disk magically changed without OS intervention. The only way to distinguish between that and a real data change is if you are somehow hooked into the OS and watching the writes it issues. Maybe ddsnap does that?

>It is not milliseconds, it is a significant fraction of your CPU, no matter how powerful.

Can you elaborate? On my year-old laptop, crc32 over 4k-blocks does >625 MiB/s on one core (adler32 is faster still), and the disk with perfect streaming manages to write at ~60 MiB/s, so by my calculation the worst case is 5% CPU. Enough that it could matter occasionally, but in fact seek-free workloads are very rare... and CPUs continue to follow Moore's law (checksumming is parallelizable), so it seems to me that that number will be <1% by the time tux3 is in production :-).

No opinion on volume manager vs. filesystem (so long as the interface doesn't devolve into distinct camps of developers pushing responsibilities off on each other); I could imagine there being locality benefits if your merkle tree follows the filesystem topology, but eh.

>If you want to rank the relative importance of features, replication way beats checksumming.

Fair enough, but I'll just observe that since I do have a perfectly adequate backup system in place already, replication doesn't get *me* anything extra, while checksumming does :-).

Tux3: the other next-generation filesystem

joern — Sat, 20 Dec 2008 13:08:05 +0000

>> I expect we can just ignore the problem

In that case I am a step ahead of you. :)

The situation may be easier to reach than you expect. Removable media can move from a beefy machine to some embedded device with 8M of RAM. Might not be likely for tux3, but is reasonably likely for logfs.

And order is important. If B is rewritten _after_ C, the promise made by C' is released. If it is rewritten _before_ C, both promises exist in parallel.

What I did to handle this problem may not apply directly to tux3, as the filesystem designs don't match 100%. Logfs has the old-fashioned indirect blocks and stores a "promise" by marking a pointer in the indirect block as such. Each commit walks a list of promise-containing indirect blocks and writes all promises to the journal.

On mount the promises are added to an in-memory btree. Each promise occupies about 32 bytes - while it would occupy a full page if stored in the indirect block and no other promises share this block. That allows the read-only case to work correctly and consume fairly little memory.

When going to read-write mode, the promises can be moved into the indirect blocks again. If those consume too much memory, they are written back. However, for some period promises may exist both in the btree and in indirect blocks. Care must be taken that those two never disagree.

Requires a bit more RAM than your outlined algorithm, but still bounded to a reasonable amount - nearly identical to the size occupied in the journal.

Please, stop...

giraffedata — Sat, 20 Dec 2008 03:55:39 +0000

You must have seriously misread the post to which you responded. It doesn't mention special features of hardware. It does mention special flaws in hardware and how XFS works in spite of them.

I too remember reports that in testing, systems running early versions of XFS didn't work because XFS assumed, like pretty much everyone else, that the hardware would not write garbage to the disk and subsequently read it back with no error indication. The testing showed that real world hardware does in fact do that and, supposedly, XFS developers improved XFS so it could maintain data integrity in spite of it.

Please, stop...

sandeen — Sat, 20 Dec 2008 03:31:02 +0000

Can we just drop the whole "XFS expects and/or works around special hardware" meme? This has been kicked around for years without a shred of evidence. I may as well assert that XFS requires death-rays from mars for proper functionality.

XFS, like any journaling filesystem, expects that when the storage says data is safe on disk, it is safe on disk and the filesystem can proceed with whatever comes next. That's it; no special capacitors, no power-fail interrupts, no death-rays from mars. There is no special-ness required (unless you consider barriers to prevent re-ordering to be special, and xfs is not unique in that respect either).

File checksums needed?

daniel — Tue, 16 Dec 2008 01:57:14 +0000

There are much more likely failure modes for which file checksums are needed. One is where the disk writes the data to the wrong track. Another is where it doesn't write anything but reports that it did. Another is that the power left the client slightly before the disk drive and the client sent garbage to the drive, which then correctly wrote it.

Scribble on final write is something we plan to detect, by checksumming the commit block. I seem to recall reading that SGI ran into hardware that would lose power to the memory before the drive controller lost its power-good, and had to do something special in XFS to survive it. Better would be if hardware was engineered not to do that.

Tux3: the other next-generation filesystem

daniel — Tue, 16 Dec 2008 01:42:12 +0000

I've only lived with maybe a few dozen disks in my life, but I've still corruption like that too -- in this case, it turned out that the disk was fine, but one of the connections on the RAID card was bad, and was silently flipping single bits on reads that went to that disk (so it was nondeterministic, depending on which mirror got hit on any given cache fill, and quietly persisted even after the usual fix of replacing the disk).

Luckily the box happened to be hosting a modern DVCS server (the first, in fact), which was doing its own strong validation on everything it read from the disk, and started complaining very loudly. No saying how much stuff on this (busy, multi-user, shared) machine would have gotten corrupted before someone noticed otherwise, though... and backups are no help, either.

Our ddnap-style checksumming at replication time would have caught that corruption promptly.

if there comes a day when there are two great filesystems and one is a little slower but has checksumming, I'm choosing the checksumming one. Saving milliseconds (of computer time) is not worth losing years (of work).

It is not milliseconds, it is a significant fraction of your CPU, no matter how powerful. But yes, if you want extra checking is important to you, should be able to have it. Whether block checksums belong in the filesystem rather than volume manager is another question. There may be a powerful efficiency argument that checksumming has to be done by the filesystem, not the volume manager. If so, I would like to see it.

Anyway, when the time comes that block checksumming rises to the top of the list of things to do, we will make sure Tux3 has something respectable, one way or another. Note that checksumming at replication time already gets nearly all the benefit at a very modest CPU cost.

If you want to rank the relative importance of features, replication way beats checksumming. It takes you instantly from having no backup or really awful backup, to having great backup with error detection. So getting to that state with minimal distractions seems like an awfully good idea.

Correctness

grundler — Mon, 15 Dec 2008 21:06:52 +0000

ncm wrote:
> There's a widespread myth (originating where?!) that disks
> detect a power drop and use the last few milliseconds to do
> something safe, such as finish up the sector they're writing.
> It's not true. A disk will happily write half a sector and scribble trash.

It was true for SCSI disks in the 90's. The feature was called "Sector Atomicity". As expected, there is a patent for one implementation:
http://www.freepatentsonline.com/5359728.html

AFAIK, every major server vendor required it. I have no idea if this was ever implemented for IDE/ATA/SATA drives. But UPS's became the norm for avoiding power failure issues.

Correctness

anton — Thu, 11 Dec 2008 16:50:54 +0000

A disk will happily write half a sector and scribble trash. Most times reading that sector will report a failure, but you only get reasonable odds.

Given that disk drives do their own checksumming, you get pretty good odds. And if you think they are not good, why would you think that FS checksums are any better?

Concerning getting such damage on power-off, most drives don't do that; we would hear a lot about drive-level read errors after turning off computers if that was a general characteristic. However, I have seen such things a few times, and it typically leads to me avoiding the brand of the drive for a long time (i.e., no Hitachi drives for me, even though they were still IBM when it happened, and no Maxtor, either; hmm, could it be that selling such drives leads to having to sell the division/company soon after?); they usually did not happen happen on an ordinary power-off, but in some unusual situations that might result in funny power characteristics (that's still no excuse to corrupt the disk).

Tux3: the other next-generation filesystem

daniel — Thu, 11 Dec 2008 08:01:07 +0000

Hi Joern,

there is another more subtle problem. When mounting the filesystem with very little DRAM available, it may not be possible to cache all "promised" metadata blocks. So one must start writing them back at mount time.

You mean, first run with lots of ram, get tons of metadata blocks pinned, then remount with too little ram to hold all the pinned metadata blocks. A rare situation, you would have to work at that. All of ram is available for pinned metadata on remount, and Tux3 is pretty stingy about metadata size.

In your example, when B is rewritten (a btree split or merge) the promise made by C' to update B is released because B' is on disk. So the situation is not as complex as you feared.

I expect we can just ignore the problem of running out of dirtyable cache on replay and nobody will ever hit it. But for completeness, note that writing out the dirty metadata is not the only option. By definition, one can reconstruct each dirty metadata block from the log. So choose a dirty metadata block with no dirty children, reconstruct it and write it out, complete with promises (a mini-rollup). Keep doing that until all the dirty metadata fits in cache, then go live. This may not be fast, but it clearly terminates. Unwinding these promises is surely much easier than unwinding credit default swaps :-)

Regards,

Daniel

Tux3: the other next-generation filesystem

daniel — Thu, 11 Dec 2008 06:42:51 +0000

How do you deal with inode->i_size and inode->i_blocks changing on behalf of the "promise"?

These are updated with the inode table block and not affected by promises. Note that we can sometimes infer the i_size and i_blocks changes from the logical positions of the written data blocks and could defer inode table block udpates until rollup time. And in the cases where we can't infer it, write the i_size into the log commit block. More optimization fun.

Tux3: the other next-generation filesystem

martinfick — Tue, 09 Dec 2008 20:26:59 +0000

You are correct, that is actually quite more advanced than what I was proposing. But since I did not go into any details about what I was asking for, I can hardly object. :) The real problem with the above, apart from perhaps being difficult to achieve, is that it would likely break posix semantics!

The yum proposal probably assumes that I could have multiple writes interleaved with reads from the same locations that could succeed in one transaction and then possibly be rolled back. Posix requires that once a write succeeds any reads to the same location that succeed after the write report the newly written bytes. To return a read of some written bytes to any process, even the writer, with the transaction pending, and to then rollback the transaction and return in a read what was there before the write, to any process, would break this requirement. The yum example above probably requires such "broken" semantics.

What I was suggesting is actually something much simpler than the above: a method to allow a transaction coordinator to intercept every individual write action (anything that modifies the FS) and decide whether to commit of rollback the write (transaction).

The coordinator would intercept the write after the FS signals "ready to commit". The write action would then block until either a commit or a rollback is received from the coordinator. This would not allow any concurrent read or writes to the portion of the object being modified during this block, ensuring full posix semantics.

For this to be effective with distributed redundant filesystems, once the FS has signaled ready to commit, the write has to be able to survive a crash so that if the node hosting the FS crashes, the rollback or commit can be issued upon recovery (depending on the coordinator's decision) and reads/writes must continue to be blocked until then (even after the crash!)

If the commit is performed, things continue as usual, if there is a rollback, the write simply fails. Nothing would seem different to applications using such an FS, except for a possible (undetermined) delay while the coordinator decides to commit or rollback the transaction.

That's all I had in mind, not bunching together multiple writes. It should not actually be that difficult to implement, the tricky part is defining a useful generic interface to the controller that would allow higher level distributed FSes to use it effectively.

Tux3: the other next-generation filesystem

rwmj — Tue, 09 Dec 2008 14:48:03 +0000

begin transaction
yum update
test test test
commit transaction

This sounds like a nice idea at first, but you're forgetting an essential step: if you have multiple transactions outstanding, you need some way to combine the results to get a consistent filesystem.

For example, suppose that the yum transaction modified /etc/motd, and a user edited this file at the same time (before the yum transaction was committed). What is the final, consistent value of this file after the transaction?

From DBMSes you can find lots of different strategies to deal with these cases. A non-exhaustive list might include: Don't permit the second transaction to succeed. Always take the result of the first (or second) transaction and overwrite the other. Use a merge strategy (and there are many different sorts).

As usual in computer science, there is a whole load of interesting, accessible theory here, which is being completely ignored. My favorite which is directly relevant here is Oleg's Zipper filesystem.

Rich.

Correctness

ncm — Sun, 07 Dec 2008 22:28:32 +0000

We have, elsewhere in this same thread, reports of bad data delivered as good, and causing trouble, Mr. Phillips's opinion notwithstanding. The incidence is, therefore, not negligible for data many people care about. Partially-written blocks are only one cause of bad sectors, which I noted only because they are an example on one that occurs much for frequently for some users than for others. Bad sectors may occur in the journal as well as in file contents. The drive will detect and report only a large, but not always a large enough, fraction of these.

Tux3: the other next-generation filesystem

ncm — Sun, 07 Dec 2008 21:33:54 +0000

For most uses we would benefit from the file system doing as much as it can, and even backing itself up -- although we'd like to be able to bypass whatever gets in the way. But if the file system does less, at first, the first thing to checksum is the metadata.

File checksums needed?

giraffedata — Sat, 06 Dec 2008 19:07:27 +0000

Having been checksumming filesystem data during continuous replication for two years now on multiple machines, and having caught exactly zero blocks of bad data passed as good in that time,

If TUX3 is for small systems, Philipps is probably right. I don't know what "continuous replication" means or how much data he's talking about here, but I have a feeling that studies I've seen calling for file checksumming did maybe 10,000 times as much I/O as this.

File checksums needed?

giraffedata — Sat, 06 Dec 2008 18:57:06 +0000

A disk will happily write half a sector and scribble trash. Most times reading that sector will report a failure, but you only get reasonable odds.

Actually, I think the probability of reading such a sector without error indication is negligible. There are much more likely failure modes for which file checksums are needed. One is where the disk writes the data to the wrong track. Another is where it doesn't write anything but reports that it did. Another is that the power left the client slightly before the disk drive and the client sent garbage to the drive, which then correctly wrote it.

I've seen a handful of studies that showed these failure modes, and I'm pretty sure none of them showed simple sector CRC failure.

If sector CRC failure were the problem, adding a file checksum is probably no better than just using stronger sector CRC.

Tux3: the other next-generation filesystem

njs — Sat, 06 Dec 2008 08:55:22 +0000

End-to-end is great, and it absolutely makes sense that special purpose systems like databases may want both additional guarantees and low-overhead access to the drive. But basically none of my important data is in a database; it's scattered all over my hard drive in ordinary files, in a dozen or more formats. If the filesystem *is* your database, as it is for ordinary desktop storage, then that's the only place you can reasonably put your integrity checking.

Backups are also great, but there are cases (slow quiet unreported corruption that can easily persist undetected for weeks+, see upthread) where they do not protect you.

(In some cases you can actually increase integrity too -- if your app checks its checksum when loading a file and it fails, then the data is lost but at least you know it; if btrfs checks a checksum while loading a block and it fails, then it can go pull an uncorrupted copy from the RAID mirror and prevent the data from being lost at all.)

>If performance matters, it should be limited by the sustained streaming capacity of the file system, and then delays from redundant checksum operations really do hurt.

Again, I'm not convinced. My year-old laptop does SHA-1 at 200 MB/s (using one core only); the fastest hard-drive in the world (according to storagereview.com) streams at 135 MB/s. Not that you want to devote a CPU to this sort of thing, and RAID arrays can stream faster than a single disk, but CRC32 goes *way* faster than SHA-1 too, and my laptop has neither RAID nor a fancy 15k RPM server drive anyway.

And anyway my desktop is often seek-bound, alas, and yours is too; it does make things slow, but I don't see why it should make me care less about my data.

Tux3: the other next-generation filesystem

ncm — Sat, 06 Dec 2008 08:26:15 +0000

This is another case where the end-to-end argument applies. Either (a) it's a non-critical application, and backups (which you have to do anyway) provide enough reliability; or (b) it's a critical application, and the file system can't provide enough assurance anyway, and what it could do would interfere with overall performance.

Similarly, if your application is seek-bound, it's in trouble anyway. If performance matters, it should be limited by the sustained streaming capacity of the file system, and then delays from redundant checksum operations really do hurt.

Hence the argument for reliable metadata, anyway: the application can't do that for itself, and it had better not depend on metadata operations being especially fast. Traditionally, serious databases used raw block devices to avoid depending on file system metadata.

Correctness

man_ls — Sat, 06 Dec 2008 00:06:35 +0000

Interesting point: it seems I misread your post so let me re-elaborate. Data journaling prevents against half-written sectors, since they will not count as written. This leaves a power-off which causes physical damage to a disk, and yet the disk will not realize the sector is bad. Keep in mind that we have data journaling, so this particular sector will not be used until it is completely overwritten. The kind of damage must be permanent yet remain hidden when writing, which is why I deemed it impossible. It seems you have good cause to believe it can happen, so it would be most enlightening to hear any data points you may have.

As to your concerns about high data density and error rates, they are exactly what Mr Phillips happily dismisses: in practice they do not seem to cause any trouble.

Over-engineering is not a sound engineering practice either.

Tux3: the other next-generation filesystem

njs — Fri, 05 Dec 2008 23:58:56 +0000

>Checksumming only the file system's metadata and log, but not the user-level data, is a reasonable compromise

Well, maybe...

Within reason, my goal is to have a much confidence as possible in my data's safety, with as little investment of my time and attention. Leaving safety up to individual apps is a pretty wretched system for achieving this -- it defaults to "unsafe", then I have to manually figure out which stuff needs more guarantees, which I'll screw up, plus I have to worry about all the bugs that may exist in the eleventeen different checksumming systems being used in different codebases... This is the same reason I do whole disk backups instead of trying to pick and choose which files to save, or leaving backup functionality up to each individual app. (Not as crazy as an idea as it sounds -- that DVCS basically has its own backup system, for instance; but I'm not going around adding that functionality to my photo editor and word processor too.)

Obviously if checksumming ends up causing unacceptable slowdowns, then compromises have to be made. But I'm pretty skeptical; it's not like CRC (or even SHA-1) is expensive compared to disk access latency, and the Btrfs and ZFS folks seem to think usable full disk checksumming is possible.

If it's possible I want it.

Tux3: the other next-generation filesystem

ncm — Fri, 05 Dec 2008 22:52:05 +0000

Checksumming only the file system's metadata and log, but not the user-level data, is a reasonable compromise. Then applications that matter (e.g. PostgreSQL, or your DVCS) can provide their own data checksums (and not pay twice) and operate on a reliable file system.

This suggests a reminder for applications providing their own checksums: mix in not just the data, but your own metadata (block number, file id). Getting the right checksum on the wrong block is just embarrassing.

Correctness

ncm — Fri, 05 Dec 2008 22:18:41 +0000

Everything you say can be prevented by a more robust filesystem ...

FALSE. I'm talking about hardware-level sector failures. A filesystem without checksumming can be made robust against reported bad blocks, but a bad block that the drive delivers as good can completely bollix ext3 or any fs without its own checksums. Drive manufacturers specify and (just) meet a rate of such bad blocks, low enough for non-critical applications, and low enough not to kill performance of critical applications that perform their own checking and recovery methods.

Denial is not a sound engineering practice.

Tux3: the other next-generation filesystem

liljencrantz — Fri, 05 Dec 2008 19:39:03 +0000

Cool. Doesn't fix the lack of comments, though. *hint* :)

Correctness

man_ls — Fri, 05 Dec 2008 18:22:58 +0000

Everything you say can be prevented by a more robust filesystem with data journaling, even without checksums. Ext3 with data=ordered is an example.

Even with checksumming data integrity is not guaranteed: yes, the filesystem will detect that a sector is corrupt, but it still needs to locate a good previous version and be able to roll back to that version. Isn't it easier to just do data journaling?

Tux3: the other next-generation filesystem

njs — Thu, 04 Dec 2008 21:38:24 +0000

Luckily the box happened to be hosting a modern DVCS server (the first, in fact), which was doing its own strong validation on everything it read from the disk, and started complaining very loudly. No saying how much stuff on this (busy, multi-user, shared) machine would have gotten corrupted before someone noticed otherwise, though... and backups are no help, either.

I totally understand not being able to implement everything at once, but if there comes a day when there are two great filesystems and one is a little slower but has checksumming, I'm choosing the checksumming one. Saving milliseconds (of computer time) is not worth losing years (of work).

Tux3: the other next-generation filesystem

joern — Thu, 04 Dec 2008 21:05:51 +0000

Oh, and there is another more subtle problem. When mounting the filesystem with very little DRAM available, it may not be possible to cache all "promised" metadata blocks. So one must start writing them back at mount time. However, before they have all been loaded, one might have an incorrect (slightly dated) picture of the filesystem. The easiest example I can come up with involves three blocks, A, B and C, where A points to B and B points to C:
A -> B -> C

Now both B and C are rewritten, without updating their respective parent blocks (A and B):
A -> B -> C
B' C'

B' and C' appear disconnected without reading up on all the promises. At this point, when mounting under memory pressure, order becomes important. If A is written out first, to release the "promise" on B', everything works fine. But when B is written out first, to release the "promise on C', we get something like this:
A -> B -> C
B' C'
B"---^

And now there are two conflicting "promises" on B' and B". A rather ugly situation.

Tux3: the other next-generation filesystem

dlang — Thu, 04 Dec 2008 17:42:33 +0000

there is interest, but currently the only way to do this is via FUSE

the hooks that are being proposed for file scanning are also being looked at as possibly being used for HSM type uses.

Tux3: the other next-generation filesystem

lysse — Thu, 04 Dec 2008 17:40:24 +0000

Yes, and you'll still have backups for it in the future. But wouldn't it be nice to have a way out of your last backup having gone up in flames at a really inconvenient time? Is there some reason why it would be desirable to limit the number of ways of thwarting Murphy we permit ourselves? Because honestly, I can't think of one...

Tux3: the other next-generation filesystem

joern — Thu, 04 Dec 2008 16:13:04 +0000

> Tux3 handles this by writing the new blocks directly to their final location, then putting a "promise" to update the metadata block into the log. At roll-up time, that promise will be fulfilled through the allocation of a new block and, if necessary, the logging of a promise to change the next-higher block in the tree. In this way, changes to files propagate up through the filesystem one step at a time, without the need to make a recursive, all-at-once change.

Excellent, you had the same idea. How do you deal with inode->i_size and inode->i_blocks changing on behalf of the "promise"?

Tux3: the other next-generation filesystem

etienne_lorrain@yahoo.fr — Thu, 04 Dec 2008 12:47:51 +0000

For RAIDs, there should be few selectable options:
- read (all mirrors) after writes, report error if content differ (slow)
- write (all mirrors) and return if all writes successfull, post a read of the same data and report delayed error if content differ.
- write (one mirror) and return as soon as possible, post writes to other mirrors, then post a read of the same data (all mirrors) and report delayed error if content differ.
Obviously, for previous test, you should run the disks with their cache disabled.

Those can run with cache enabled:
- read all mirrors and compare content, report error to the read operation if content differ (slow)
- read and return first available data, but keep data and compare when other mirrors deliver data; report delayed error if mirrors have different data.

That is better handled in the controller hardware itself, I do not know if some hardware RAID controller do it correctly.
I am not sure there is a defined clean way to report "delayed errors" in either SCSI or SATA, there isn't any in ATA interface (so booting from those RAID drives using the BIOS may be difficult).
Moreover the "check data" (i.e. read and compare) in SCSI is sometimes simply ignored by devices, so that may have to be implemented by reads in the controller itself.
I am not sure a lot of users would accept the delay penalties due to the amount of data transferred in between controller and RAID disks...

Tux3: the other next-generation filesystem

smitty_one_each — Thu, 04 Dec 2008 12:15:19 +0000

As a side effect, consider that you could end up making real deletion of information (say, credit card numbers) harder.
Amdist all the great work (which is well above my skill level, kudos to all) there are ramifications.

Tux3: the other next-generation filesystem

meuh — Thu, 04 Dec 2008 11:55:16 +0000

The code needs some cleanups - little problems like the almost complete lack of comments and the use of macros as formal function parameters are likely to raise red flags on wider review

And here is changeset 580: "The "Jon Corbet" patch. Get rid of SB and BTREE macros, spell it like it is."

Tux3: the other next-generation filesystem

biolo — Thu, 04 Dec 2008 11:32:52 +0000

Does anyone know if there has been any talk about implementing HSM (Hierarchical Storage Management) in Linux? I'm aware there are one or two (very expensive) proprietary solutions out there, but it strikes me that now is a good time to at least consider how you would implement it and what you need from the various layers to handle it. Since we have two potential new generic file systems in the works, whose on-disk layout hasn't been fixed yet I can't think of a better time.

Obviously HSM is one of those things that crosses the traditional layering, but BTRFS at least is already handling multi layer issues.

Implementing a linux native HSM strikes me as one of those game changers, we'd have a huge feature none of the other OS's can currently match without large expenditure. I've lost count of the number of situations where organizations have bought hugely expensive SCSI or FC storage systems with loads of capacity, where what they actually needed was just a few high performance disks (or even SSDs nowadays) backed by a slower but high capacity set of SATA disks. Even small servers or desktops probably have a use for this, that new disk you just bought to expand capacity is probably faster that the old one.

Using tape libraries at the second or third level of the HSM has a few more complications, but could be tackled later.

Tux3: the other next-generation filesystem

mjthayer — Thu, 04 Dec 2008 11:09:29 +0000

Actually this patch

http://patchwork.ozlabs.org/patch/6047/ (filesystem-freeze-implement-generic-freeze-feature.patch)

might make general online fs checking doable.

Tux3: the other next-generation filesystem

zmi — Thu, 04 Dec 2008 08:05:37 +0000

Having been checksumming filesystem data during continuous replication for two years now on multiple machines, and having caught exactly zero blocks of bad data passed as good in that time, I consider the spectre of disks passing bad data as good to be largely vendor FUD.

I must strongly object here. Over the last years, I have had 3 different customers, using 2 different RAID-controller vendors with 2 different disk types (SCSI, SATA), who got destroyed RAID contents because of a broken disk that did not report (or detect) it's errors.

The problem is, that even RAID controllers do not "read-after-write" and thus verify the contents of a disk. So if the disk says "OK" after a write where in reality it's not, your RAID and filesystem contents still go to be destroyed (because the drive reads back other data than it wrote).

Another check could be "on every read also calculate the RAID checksum to verify", but for performance reasons nobody does that.

There REALLY should be filesystem-level checksumming, and a generic interface between filesystem and disk controller, where the filesystem can tell the RAID controller to switch to "paranoid mode", doing read-after-write of disk data. It's gonna be slow then, but at least the controller will find a broken disk and disable it - after that, it can switch to performance mode again.

Yes, our customers were quiet unsatisfied that even with RAID controllers their data got broken. But the worst is, it takes a long time for customers to see and identify there is a problem - you can only hope for a good backup strategy! Or for a filesystem doing checksumming.

mfg zmi

Tux3: the other next-generation filesystem

daniel — Thu, 04 Dec 2008 07:26:57 +0000

Online checking is planned. Offline checking is in progress.

Tux3: the other next-generation filesystem

mjthayer — Thu, 04 Dec 2008 06:11:26 +0000

What about online fsck-ing, which I have not seen mentioned here yet? Surely that ought to be feasible if there is no update in place. (I'm not actually sure why it is not generally feasible with journalling filesystems, possibly excluding the journal itself, at least if you temporarily disable journal write out).

Correctness

ncm — Thu, 04 Dec 2008 04:06:45 +0000

having caught exactly zero blocks of bad data passed as good

Evidently Daniel hasn't worked much with disks that are powered off unexpectedly. There's a widespread myth (originating where?!) that disks detect a power drop and use the last few milliseconds to do something safe, such as finish up the sector they're writing. It's not true. A disk will happily write half a sector and scribble trash. Most times reading that sector will report a failure, but you only get reasonable odds. Some hard read failures, even if duly reported, count as real damage, and are not unlikely.

Your typical journaled file system doesn't protect against power-off scribbling damage, as fondly as so many people wish and believe with all their little hearts.

Even without unexpected power drops, it's foolish to depend on more reliable reads than the manufacturer promises, because they trade off marginal correctness (which is hard to measure) against density (which is on the box in big bold letters). What does the money say to do? PostgreSQL uses 64-bit block checksums because they care about integrity. It's possibly reasonable to say that theirs is the right level for such checking, but not to say there's no need for it.

Tux3: the other next-generation filesystem

Ze — Thu, 04 Dec 2008 03:54:45 +0000

>Hardware failures and accidental deletion is what we have backups for I would argue that accidental deletion is one of the things that versioning should handle. Unfortunately backups offer only limited granularity along with people failing to use or test them. When you combine all that you can see why people a clear need for data recovery tools. People clearly feel a need for recovery tools since there are quite a few tools on the market both free and commercial. It makes sense to consider that use case when designing a file system. It can only lead to better documented and designed file system.

Tux3: the other next-generation filesystem

daniel — Wed, 03 Dec 2008 12:40:53 +0000

Reiser4, JFS, XFS and btrfs all use their own journalling. Leaves... ext3 to use jbd.

And OCFS2. JBD was created at a time when it seemed as though all future filesystems would be journalling filesystems. Incidentally, any filesystem developer who overlooks Stephen Tweedie's copious writings on the JBD design, does so at their peril whether they intend to use journalling or some other atomic commit model.