I'm definitely in the app programmer camp, not kernel, yet to me it seems bloody obvious that if you open a file using F_TRUNCATE you're creating a period of time where the file will be empty.
The only thing that surprised me was that writing to a second file then and renaming to the first wasn't fully sufficient.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 16:15 UTC (Fri) by dcoutts (guest, #5387)
[Link]
Exactly. The truncate method can obviously cause data loss, which is why we use the atomic rename method. The problem is that ext4 re-orders the rename() before the write(). That is the broken behavior.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 18:06 UTC (Fri) by jgg (guest, #55211)
[Link]
Yeah, I think you have hit this exactly on the head. Reading through the Tyso comments on the blog I think he even confirmed that not preserving ordering is a change in behavior since ext3.
This whole discussion has really not been focused much on what actually are sane behaviors for a FS to have across an unclean shutdown. To date most writings by FS designers I've read seem to focus entirely on avoiding FSCK and avoiding FS meta-data inconsistencies. Very few people seem to talk about things like what the application sees/wants.
One of the commentors on the blog had the best point - insisting on adding fsync before every close/rename sequence (either implicitly in the kernel as has been done, or explicitly in all apps) is going to badly harm performance. 99% of these case do not need the data on the disk, just the write/close/rename order preserved.
Getting great performance by providing weak guarentees is one thing, but then insisting everyone who cares about their data use explicit calls that provide a much stronger and slower guarantee is kinda crazy. Just because POSIX is silent on this matter doesn't mean FS designers should get a free pass on transactional behaviors that are so weak they are useless.
For instance under the same POSIX arguments Ted is making it would be perfectly legitimate for a write/fsync/close/rename to still erase both files because you didn't do a fsync on the directory! Down this path lies madness - at some point the FS has to preserve order!
I wonder how bad a hit performance sensitive apps like rsync will get due to the flushing on rename patches?
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 19:16 UTC (Fri) by endecotp (guest, #36428)
[Link]
> you didn't do a fsync on the directory!
Yes, I was just thinking the same thing! Come on Ted, what exactly do you want us to write to be portably safe? I have just added an fsync() to my write() close() rename() code, but I checked man fsync first and it tells me that I need to fsync the directory. So is it:
? Or some re-ordering of that? Is there more? Do I have to fsync() the directories up to the root? Can I avoid all this if I call sync()?
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 20:17 UTC (Fri) by alexl (subscriber, #19068)
[Link]
I don't think that is quite necessary for durability.
If the metadata is not written out but the data is and then things crash, you will just have the old file as it was, and either a written file+inode with no name (moved to lost+found) or the written file with the temporary name.
As far as i can see syncing the directory is not needed. (Unless you want to guarantee the file being on disk, rather than just not breaking the atomic file replace behaviour.)
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 20:41 UTC (Fri) by masoncl (subscriber, #47138)
[Link]
The directory fsync requirements came from ext2. The for the journaled filesystems, and fsync on the file will get you the dir as well.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 21:03 UTC (Fri) by endecotp (guest, #36428)
[Link]
OK. So if I want code that's portable to ext2, I need to fsync the directory. Maybe there aren't many people using ext2 these days, but I would like code that's genuinely portable; I do personally care about the various flash filesystems, and when I break things for BSD users they complain. So I guess the directory fsync is needed.
Thinking a bit more about this from the "application requirements" point of view, I can see three cases:
1- The change needs to be atomic wrt other processes running concurrently.
2- The change needs to be atomic if this process terminates (ctrl-C, OOM).
3- The change needs to be atomic if the system crashes.
I can't think of a scenario where the application author would reasonably say, "I need this data to be safe in cases 1 and 2 but I don't care about 3." Can anyone else?
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 22:45 UTC (Fri) by jgg (guest, #55211)
[Link]
> I can't think of a scenario where the application author would reasonably say, "I need this data to be safe in cases 1 and 2 but I don't care about 3." Can anyone else?
It isn't that uncommon really, anytime you want to protect against a failing program and not mess up its output file rename is the best way. For instance programs that are used with make should be careful to not leave garbage around if they crash. rsync and other downloading software does the rename trick too, for the same basic reasons. None of these uses require fsync or other performance sucking things.
The reason things like emacs and vim are so careful is because they are almost always handling critical data. I don't think anyone would advocate rsync should use fsync.
The considerable variations in what FSs do is also why, as an admin, I have a habit of knocking off a quick 'sync' command after finishing some adminy task just to be certain :)
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 10:37 UTC (Mon) by endecotp (guest, #36428)
[Link]
> Come on Ted, what exactly do you want us to write to be portably safe?
Ted seems to have answered this in his second blog post: YES you DO need to fsync the directory if you want to be certain that the metadata has been saved.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 20:28 UTC (Fri) by amk (subscriber, #19)
[Link]
There's also the broken behaviour of the applications that are updating their config. files every minute, which shouldn't be forgotten. If saves are being explicitly requested by the user, that's pretty infrequent and the crash has to occur within a window of vulnerability, but constant resaves mean a crash will inevitably cause data loss.
Why this behavour is broken? It's perfectly normal behaviour...
Posted Mar 14, 2009 11:10 UTC (Sat) by khim (subscriber, #9252)
[Link]
This P2P client. Good p2p client will keep information about peers for
each file - this way if the the system is rebooted the lenghty process of
finding peers can be avoided. Since there are hundreds (sometimes
thousands) peers this means hundreds of files are rewritten every minute or
so. If filesystem can not provide guarantees without fsync - I just refuse
to use it. XFS went this way. XFS developers long argues their right to
destroy files on crush and we've all agreed that they can do this
and I can answer the question "What do you think about XFS?" with just
"Don't use it. Ever." And everyone was happy.
Looks like tytso actually fixed the problem in ext4 (even if actual
words were akin "application developers are crazy and this is incorrect
usage but we can not replace all of them") so at least I can conclude he's
more sane then XFS developers...