|
|
Log in / Subscribe / Register

Ts'o: Delayed allocation and the zero-length file problem

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 18:47 UTC (Fri) by masoncl (subscriber, #47138)
In reply to: Ts'o: Delayed allocation and the zero-length file problem by forthy
Parent article: Ts'o: Delayed allocation and the zero-length file problem

Just to clarify what will happen on btrfs.

If you:

create(file1)
write(file1, good data)
fsync(file1)

create(file2)
write(file2, new good data)
rename(file2 to file1)

FS transaction happens to commit

<crash>

file2 has been renamed over file1, and that rename was atomic. file2 either completely replaces file1 in the FS or file1 and file2 both exist.

But, no fsync was done on file2's data, and file2 replaced file1. After the crash, the file1 you see in the directory tree may or may not have data in it.

The btrfs delalloc implementation is similar to XFS in this area. File metadata is updated after the file IO is complete, so you won't see garbage or null bytes in the file, but the file might be zero length.


to post comments

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 19:13 UTC (Fri) by foom (subscriber, #14868) [Link] (14 responses)

> The btrfs delalloc implementation is similar to XFS in this area. File metadata
> is updated after the file IO is complete, so you won't see garbage or null
> bytes in the file, but the file might be zero length.

Better get to fixing that, then!

I'm rather amused at the number of comments along the lines of "XFS already does it so it must be
okay!" A filesystem known for pissing off users by losing their data after power-outages is not one
I'm happy to see filesystem developers hold up as a shining example of what to do...

(and apparently XFS doesn't even do this anymore, after the volume of complaints raised against it,
according to other comments!)

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 19:44 UTC (Fri) by masoncl (subscriber, #47138) [Link] (13 responses)

XFS closes this window on truncate but not renames.

This is a fundamental discussion about what the FS is supposed to implement when it does a rename.

The part where applications expect rename to also mean fsync is a new invention with broad performance implications.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 20:11 UTC (Fri) by alexl (guest, #19068) [Link] (12 responses)

Nobody is expecting rename to imply fsync!
This isn't about having the data on disk *now*.

We just expect it to don't write the new metadata for "newpath" to disk before the data in oldpath is on disk.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 20:34 UTC (Fri) by masoncl (subscriber, #47138) [Link] (11 responses)

The goal is to make sure the data for the new file is on disk before (or in the same transaction as) the metadata for the rename.

We have two basic choices to accomplish this:

1) Put the new file into a list of things that must be written before the commit is done. This is pretty much what the proposed ext4 changes do.

2) Write the data before the rename is complete.

The problem with #1 is that it reintroduces the famous ext3 fsync behavior that caused so many problems with firefox. It does this in a more limited scope, just for files that have been renamed, but it makes for very large side effects when you want to rename big files.

The problem with #2 is that it is basically fsync-on-rename.

The btrfs fsync log would allow me to get away with #1 without too much pain, because fsyncs don't force full btrfs commits and so they won't actually wait for the renamed file data to hit disk.

But, the important discussion isn't if I can sneak in a good implementation for popular but incorrect API usage. The important discussion is, what is the API today and what should it really be?

Applications have known how to get consistent data on disk for a looong time. Mail servers do it, databases do it. Changing rename to include significant performance penalties when it isn't documented or expected to work this way seems like a very bad idea to me.

I'd much rather make a new system call or flag for open that explicitly documents the extra syncing, and give application developers the choice.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 20:46 UTC (Fri) by alexl (guest, #19068) [Link] (7 responses)

I think "and give application developers the choice." is a fallacy.

At the level of the close happening we don't really know what kind of data this is, as this is generally some library helper function. And even at the application level, how do you know that its important to not zero out a file on crash? It depends on how the user uses the application.

It all comes back to the fact that for "proper" use of this more or less all cases would turn into sync-on-write (or the new flag or whatever). So, the difference wrt the filesystem wide implementation of this will get smaller as apps gets "fixed" until the difference is more or less zero.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 21:19 UTC (Fri) by drag (guest, #31333) [Link] (6 responses)

Well all you guys are over my head.

But as it's pointed out that many applications do get it right consistantly. Vim, OpenOffice, Emacs, mail clients, databases, etc etc. All sorts of them. Right?

So you have the choice of making undocumented behavior documented and then forcing that behavior on all the file systems that Linux supports and all the file systems that you expect your application to run on, or you can fix the application to do it right.

And the assumptions that were made to create this bad behavior are not even true. So even then it's not even a question of backward compatability... They've always gotten it wrong, it's just that the it's been dumb luck that that it wasn't a bigger issue in the past.

As long as file systems are async then your going to have a delay between when the data is created and when that data is commited to disk. You can do all sorts of things to help reduce the damage that can cause, but it's still the fundamental nature of the beast. If you lose power or crash your computer you WILL lose some data.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 0:38 UTC (Sat) by drag (guest, #31333) [Link] (2 responses)

Alright...

I've been reading what has been written and I think I am understanding what is going on better now.

But here is my new thought on the subject:

The reason why they are going:
create file 1
write file 1
.... (time passes)
create file 2
write file 2
rename file 2 over file 1

Is because they are trying to get a 'atomic' instruction, right?

They are making a few assumptions about the file system that:
A. the file system does not operate in atomic operations and thus they have to do this song and dance to do the file system's work for them... (protect their data)
B. that while the file system is going to fail them otherwise, it is still able to do rename in one single operation.
C. That by renaming the file they are actually telling the file to commit it to disk.

So in effect they are trying to outguess or outthink the operating system. But their assumptions, in the case of Ext4 and most others, are not correct and their software is doing what they told it do, but what they told it to do is not what they think its doing.

You see they are putting extra effort into compinsating for the file system already. So if they are putting the extra effort into out thinking the OS, then why don't they at least do it correctly?

Instead of writing out hundreds of files and trying the rename atomic trick, which isn't really right anyways, there are a half a dozen different design approaches that would yeild better results.

Or am I completely off-base here?

I understand the need for the file system to protect a user's data despite what the application developers actually wrote. Really I do.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 8:15 UTC (Sat) by alexl (guest, #19068) [Link] (1 responses)

The rename is documented by posix and unix since forever to be atomic. That is not some form of "workaround", or "compensation" but a solid safe, well documented way to write files. However, those atomicity guarantees are only valid if the system doesn't crash (as crashes are not specified by posix).

The "atomic" part is protection against other apps that are saving to the file at the same time, not crashes. The fsync is only required not to get problems when the system crashes.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 16:45 UTC (Sat) by drag (guest, #31333) [Link]

Thanks. I realise now that I was off-base. I understand now it has to do with application-land and not so much with file system stuff. :)

Not getting it right

Posted Mar 14, 2009 12:17 UTC (Sat) by man_ls (guest, #15091) [Link]

The "other applications get it right with fsync" part is a bit of a fallacy. When you save a document in an application (be it a word processor or a mail client) you really want it to be safe, and it is normally not an issue to wait a few seconds. IOW the user is expected to wait for disk activity because we have been trained this way, so we are willing to accept this trade-off.

But there are other programs doing file operations all the time, and nobody wants to wait a few seconds for them. Most notably background tasks like the operating system or a desktop environment. Is it reasonable to expect all of them to do something which potentially slows the system to a crawl on other filesystems, just to play safe with the newcomer ext4?

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 19, 2009 19:34 UTC (Thu) by anton (subscriber, #25547) [Link] (1 responses)

But as it's pointed out that many applications do get it right consistantly. Vim, OpenOffice, Emacs, mail clients, databases, etc etc. All sorts of them. Right?
Emacs did get it right when UFS lost my file that I had just written out from emacs, as well as the autosave file. But UFS got it wrong, just as ext4 gets it wrong now. There may be applications that manage to work around the most crash-vulnerable file systems in many cases, but that does not mean that the file system has sensible crash consistency guarantees.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 19, 2009 19:50 UTC (Thu) by alexl (guest, #19068) [Link]

That is true, and I "fixed" the glib saver code to also fsync() before rename in the case where the rename would replace an existing file.

However, all apps doing such syncing results in lower overall system performance than if the system could guarantee data-before-metadata on rename. So, ideally we would either want a new call to use instead of fsync, or a way to tell if the guarantees are met so that we don't have to fsync.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 21:16 UTC (Fri) by foom (subscriber, #14868) [Link] (1 responses)

It used to be that losing *your entire filesystem* upon power loss was a possible failure mode.
Whether an app called fsync for a file or not is rather irrelevant in such a case. Obviously, this
kind of failure is allowed by the standards. Just as obviously, it sucks for users, so people made
better filesystems that don't have that failure mode. That's good, I quite enjoy using a system
that doesn't lose my entire filesystem randomly if the power fails.

So it seems to me that claiming that since failing to call fsync before rename is "incorrect" API
usage, and thus it's okay to lose both old data and new, is simply wishful thinking on the part of
the filesystem developers. Sure it may be allowed by standards (as would be zeroing out the
entire partition...), but it sucks for users of that filesystem. So filesystems shouldn't do it. That's
really all there is to it.

Unless *every* call to rename is *always* be preceded by a call to fsync (including those in "mv"
and such), it will suck for users. And there's really no point in forcing everyone to put fsync()s
before every rename, when you could just make the filesystem work without that, and get to the
same place faster.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 15, 2009 7:35 UTC (Sun) by phiggins (guest, #5605) [Link]

Every rename() call does not need to preceded by fsync(). If the source file is known to be on disk already, there's no point in calling fsync(). This is code that knowingly creates a new file, does not call fsync() so that there is no reason whatsoever to assume that the data is on disk, and then calls rename() to replace an existing file. I do think that the behavior of persisting the update to the directory before saving the new file's data is bizarre and likely to cause problems, though. There may not be a required ordering for those operations, but having them reordered is clearly confusing.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 20, 2009 18:29 UTC (Fri) by anton (subscriber, #25547) [Link]

But, the important discussion isn't if I can sneak in a good implementation for popular but incorrect API usage. The important discussion is, what is the API today and what should it really be?
The API in the non-crash case is defined by POSIX, so I translate this as: What guarantees in case of a crash should the file system give?

One ideal is to perform all operations synchronously. That's very expensive, however.

The next-cheaper ideal is to preserve the order of the operations (in-order semantics), i.e., after crash recovery you will find the file system in one of the states that it logically had before the crash; the file system may lose the operations that happened after some point in time, but it will be just as consistent as it was at that point in time. If the file system gives this guarantee, any application that written to be safe against being killed will also have consistent (but not necessarily up-to-date) data in case of a system crash.

This guarantee can be implemented relatively cheaply in a copy-on-write file systems, so I really would like Btrfs to give that guarantee, and give it for its default mode (otherwise things like ext3's data=journal debacle will happen).

How to implement this guarantee? When you decide to do another commit, just remember the then-current logical state of the file system (i.e., which blocks have to be written out), then write them out, then do a barrier, and finally the root block. There are some complications: e.g., you have to deal with some processes being in the middle of some operation at the time; and if a later operation wants to change a block before it is written out, you have to make a new working copy of that block (in addition to the one waiting to be written out), but that's just a variation on the usual copy-on-write routine.

You would also have to decide how to deal with fsync() when you give this guarantee: Can fsync()ed operations run ahead of the rest (unlike what you normally guarantee), or do you just perform a sync when an fsync is requested.

The benefits of providing such a guarantee would be:

  • Many applications that work well when killed would work well on Btrfs even upon a crash.
  • It would be a unique selling point for Btrfs. Other popular Linux file systems don't guarantee anything at all, and their maintainers only grudgingly address the worst shortcomings when there's a large enough outcry while complaining about "incorrect API usage" by applications, and some play fast and lose in other ways (e.g., by not using barriers). Many users value their data more than these maintainers and would hopefully flock to a filesystem that actually gives crash consistency gurarantees.
If you don't even give crash consistency guarantees, I don't really see a point in having the checksums that are one of the main features of Btrfs. I have seen many crashes, including some where the file system lost data, but I have never seen hardware go bad in a way where checksums would help.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 23:17 UTC (Sat) by masoncl (subscriber, #47138) [Link] (7 responses)

Testing here shows that I can change the btrfs rename code to make sure the data for the new file is on disk before the rename commits without any performance penalty in most workloads.

It works differently in btrfs than xfs and ext4 because fsyncs go through a special logging mechanism, and so an fsync on one file won't have to wait for the rename flush on any other files in the FS.

I'll go ahead and queue this patch for 2.6.30.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 8:38 UTC (Mon) by njs (subscriber, #40338) [Link] (6 responses)

So, uh... doesn't the Btrfs FAQ claim that this is the default, indeed required, behavior already?

http://btrfs.wiki.kernel.org/index.php/FAQ#Does_Btrfs_hav...

I'm curious what I'm missing...

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 10:46 UTC (Mon) by forthy (guest, #1525) [Link]

I'm curious, too. I thought btrfs did it right, by being COW-logging of data&metadata and having data=ordered mandatory, with all the explication in the FAQ that make complete sense (correct checksums in the metadata also mean correct data). Now Chris Mason tells us he didn't? Ok, this will be fixed in 2.6.30, and for now, we all don't expect that btrfs is perfect. We expect bugs to be fixed; and that's going on well.

IMHO a robust file system should preserve data operation ordering, so that a file system after a crash follows the same consistency semantics as during operation (and during operation, POSIX is clear about consistency). Delaying metadata updates until all data is committed to disk at the update points should actually speed things up, not slow them down, since there is an opportunity to coalesce several metadata updates into single writes without seeks (delayed inode allocation e.g. can allocate all new inodes into a single consecutive block, delayed directory name allocation all new names into consecutive data, as well).

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 16:50 UTC (Mon) by masoncl (subscriber, #47138) [Link] (4 responses)

The btrfs data=ordered implementation is different from ext34 and reiserfs. It decouples data writes from the metadata transaction, and simply updates the metadata for file extents after the data blocks are on disk.

This means the transaction commit doesn't have to wait for the data blocks because the metadata for the file extents always reflects extents that are actually on disk.

When you rename one file over another, the destination file is atomically replaced with the new file. The new file is fully consistent with the data that has already been written, which in the worst case means it has a size of zero after a crash.

I hope that made some kind of sense. At any rate, 2.6.30 will have patches that make the rename case work similar to the way ext3 does today. Files that have been through rename will get flushed before the commit is finalized (+/- some optimizations to skip it for destination files that were from the current transaction).

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 21:23 UTC (Mon) by njs (subscriber, #40338) [Link] (2 responses)

...Is what you're saying that for btrfs, metadata about extents (like disk location and checksums, I guess) is handled separately from metadata about filenames, and traditionally only the former had data=ordered-style guarantees? (Just trying to see if I understand.)

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 22:51 UTC (Mon) by masoncl (subscriber, #47138) [Link] (1 responses)

That's correct. The main point behind data=ordered is to make sure that if you crash you don't have extent pointers in the file pointing to extents that haven't been written since they were allocated.

Without data=ordered, after a crash the file could have garbage in it, or bits of old files that had been deleted.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 22:56 UTC (Mon) by njs (subscriber, #40338) [Link]

That makes sense. Thanks.

Ts'o: Delayed allocation and the zero-length file problem

Posted Apr 7, 2009 22:27 UTC (Tue) by pgoetz (guest, #4931) [Link]

"When you rename one file over another, the destination file is atomically replaced with the new file. The new file is fully consistent with the data that has already been written, which in the worst case means it has a size of zero after a crash."

Sorry this doesn't make any sense. Atomicity in this context means that when executing a rename, you always get either the old data (exactly) or the new data. Your worst case scenario -- a size of zero after crash -- precisely violates atomicity.

For the record, the first 2 paragraphs are equally mysterious: "This means the transaction commit doesn't have to wait for the data blocks...". Um, is the data ordered or not? If you commit the transaction -- i.e. update the metadata before the data blocks are committed, then the operations are occurring out of order and ext4 open-write-close-rename mayhem ensues.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds