User: Password:
|
|
Subscribe / Log in / New account

Ts'o: Delayed allocation and the zero-length file problem

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 23:17 UTC (Sat) by masoncl (subscriber, #47138)
In reply to: Ts'o: Delayed allocation and the zero-length file problem by masoncl
Parent article: Ts'o: Delayed allocation and the zero-length file problem

Testing here shows that I can change the btrfs rename code to make sure the data for the new file is on disk before the rename commits without any performance penalty in most workloads.

It works differently in btrfs than xfs and ext4 because fsyncs go through a special logging mechanism, and so an fsync on one file won't have to wait for the rename flush on any other files in the FS.

I'll go ahead and queue this patch for 2.6.30.


(Log in to post comments)

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 8:38 UTC (Mon) by njs (guest, #40338) [Link]

So, uh... doesn't the Btrfs FAQ claim that this is the default, indeed required, behavior already?

http://btrfs.wiki.kernel.org/index.php/FAQ#Does_Btrfs_hav...

I'm curious what I'm missing...

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 10:46 UTC (Mon) by forthy (guest, #1525) [Link]

I'm curious, too. I thought btrfs did it right, by being COW-logging of data&metadata and having data=ordered mandatory, with all the explication in the FAQ that make complete sense (correct checksums in the metadata also mean correct data). Now Chris Mason tells us he didn't? Ok, this will be fixed in 2.6.30, and for now, we all don't expect that btrfs is perfect. We expect bugs to be fixed; and that's going on well.

IMHO a robust file system should preserve data operation ordering, so that a file system after a crash follows the same consistency semantics as during operation (and during operation, POSIX is clear about consistency). Delaying metadata updates until all data is committed to disk at the update points should actually speed things up, not slow them down, since there is an opportunity to coalesce several metadata updates into single writes without seeks (delayed inode allocation e.g. can allocate all new inodes into a single consecutive block, delayed directory name allocation all new names into consecutive data, as well).

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 16:50 UTC (Mon) by masoncl (subscriber, #47138) [Link]

The btrfs data=ordered implementation is different from ext34 and reiserfs. It decouples data writes from the metadata transaction, and simply updates the metadata for file extents after the data blocks are on disk.

This means the transaction commit doesn't have to wait for the data blocks because the metadata for the file extents always reflects extents that are actually on disk.

When you rename one file over another, the destination file is atomically replaced with the new file. The new file is fully consistent with the data that has already been written, which in the worst case means it has a size of zero after a crash.

I hope that made some kind of sense. At any rate, 2.6.30 will have patches that make the rename case work similar to the way ext3 does today. Files that have been through rename will get flushed before the commit is finalized (+/- some optimizations to skip it for destination files that were from the current transaction).

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 21:23 UTC (Mon) by njs (guest, #40338) [Link]

...Is what you're saying that for btrfs, metadata about extents (like disk location and checksums, I guess) is handled separately from metadata about filenames, and traditionally only the former had data=ordered-style guarantees? (Just trying to see if I understand.)

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 22:51 UTC (Mon) by masoncl (subscriber, #47138) [Link]

That's correct. The main point behind data=ordered is to make sure that if you crash you don't have extent pointers in the file pointing to extents that haven't been written since they were allocated.

Without data=ordered, after a crash the file could have garbage in it, or bits of old files that had been deleted.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 22:56 UTC (Mon) by njs (guest, #40338) [Link]

That makes sense. Thanks.

Ts'o: Delayed allocation and the zero-length file problem

Posted Apr 7, 2009 22:27 UTC (Tue) by pgoetz (guest, #4931) [Link]

"When you rename one file over another, the destination file is atomically replaced with the new file. The new file is fully consistent with the data that has already been written, which in the worst case means it has a size of zero after a crash."

Sorry this doesn't make any sense. Atomicity in this context means that when executing a rename, you always get either the old data (exactly) or the new data. Your worst case scenario -- a size of zero after crash -- precisely violates atomicity.

For the record, the first 2 paragraphs are equally mysterious: "This means the transaction commit doesn't have to wait for the data blocks...". Um, is the data ordered or not? If you commit the transaction -- i.e. update the metadata before the data blocks are committed, then the operations are occurring out of order and ext4 open-write-close-rename mayhem ensues.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds