Not logged in
Log in now
Create an account
Subscribe to LWN
Pencil, Pencil, and Pencil
Dividing the Linux desktop
LWN.net Weekly Edition for June 13, 2013
A report from pgCon 2013
Little things that matter in language design
It works differently in btrfs than xfs and ext4 because fsyncs go through a special logging mechanism, and so an fsync on one file won't have to wait for the rename flush on any other files in the FS.
I'll go ahead and queue this patch for 2.6.30.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 8:38 UTC (Mon) by njs (guest, #40338)
I'm curious what I'm missing...
Posted Mar 16, 2009 10:46 UTC (Mon) by forthy (guest, #1525)
I'm curious, too. I thought btrfs did it right, by being COW-logging
of data&metadata and having data=ordered mandatory, with all the
explication in the FAQ that make complete sense (correct checksums in the
metadata also mean correct data). Now Chris Mason tells us he didn't? Ok,
this will be fixed in 2.6.30, and for now, we all don't expect that btrfs
is perfect. We expect bugs to be fixed; and that's going on well.
IMHO a robust file system should preserve data operation ordering, so
that a file system after a crash follows the same consistency semantics
as during operation (and during operation, POSIX is clear about
consistency). Delaying metadata updates until all data is committed to
disk at the update points should actually speed things up, not slow them
down, since there is an opportunity to coalesce several metadata updates
into single writes without seeks (delayed inode allocation e.g. can
allocate all new inodes into a single consecutive block, delayed
directory name allocation all new names into consecutive data, as
Posted Mar 16, 2009 16:50 UTC (Mon) by masoncl (subscriber, #47138)
This means the transaction commit doesn't have to wait for the data blocks because the metadata for the file extents always reflects extents that are actually on disk.
When you rename one file over another, the destination file is atomically replaced with the new file. The new file is fully consistent with the data that has already been written, which in the worst case means it has a size of zero after a crash.
I hope that made some kind of sense. At any rate, 2.6.30 will have patches that make the rename case work similar to the way ext3 does today. Files that have been through rename will get flushed before the commit is finalized (+/- some optimizations to skip it for destination files that were from the current transaction).
Posted Mar 16, 2009 21:23 UTC (Mon) by njs (guest, #40338)
Posted Mar 16, 2009 22:51 UTC (Mon) by masoncl (subscriber, #47138)
Without data=ordered, after a crash the file could have garbage in it, or bits of old files that had been deleted.
Posted Mar 16, 2009 22:56 UTC (Mon) by njs (guest, #40338)
Posted Apr 7, 2009 22:27 UTC (Tue) by pgoetz (subscriber, #4931)
Sorry this doesn't make any sense. Atomicity in this context means that when executing a rename, you always get either the old data (exactly) or the new data. Your worst case scenario -- a size of zero after crash -- precisely violates atomicity.
For the record, the first 2 paragraphs are equally mysterious: "This means the transaction commit doesn't have to wait for the data blocks...". Um, is the data ordered or not? If you commit the transaction -- i.e. update the metadata before the data blocks are committed, then the operations are occurring out of order and ext4 open-write-close-rename mayhem ensues.
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds