LWN: Comments on "Ts'o: Delayed allocation and the zero-length file problem"

Atomicity vs durability

pgoetz — Wed, 08 Apr 2009 15:30:18 +0000

Who gives a flying fruitcake about what POSIX requires?! It's not acceptable for a user to edit, say her thesis, which she's been working on for 18 months and which has been saved thousands of times, and -- upon system crash -- find that not only did she lose her most recent 15 minutes worth of changes (acceptable) but in fact THE ENTIRE FILE. Putting the onus on application developers to fsync early and often is beyond ridiculous.

Ts'o: Delayed allocation and the zero-length file problem

pgoetz — Tue, 07 Apr 2009 22:27:25 +0000

"When you rename one file over another, the destination file is atomically replaced with the new file. The new file is fully consistent with the data that has already been written, which in the worst case means it has a size of zero after a crash."

Sorry this doesn't make any sense. Atomicity in this context means that when executing a rename, you always get either the old data (exactly) or the new data. Your worst case scenario -- a size of zero after crash -- precisely violates atomicity.

For the record, the first 2 paragraphs are equally mysterious: "This means the transaction commit doesn't have to wait for the data blocks...". Um, is the data ordered or not? If you commit the transaction -- i.e. update the metadata before the data blocks are committed, then the operations are occurring out of order and ext4 open-write-close-rename mayhem ensues.

Ts'o: Delayed allocation and the zero-length file problem

dedekind — Mon, 23 Mar 2009 05:02:54 +0000

It's not just awful, it is a silly.

$ man 2 write

...

"A successful return from write() does not make any guarantee that data has been committed to disk."

Ts'o: Delayed allocation and the zero-length file problem

dedekind — Mon, 23 Mar 2009 04:56:16 +0000

I'm confused. So userspace people refuse to understand that before they call 'fsync()', the data does not have to be on the disk. And "man 2 write" even says this at the end.

And now what Theo is doing - he is fixing userspace bugs and pleasing angry bloggers by hacking ext4? Because ext4 wants more users? And now we have zero chance to have userspace ever fixed?

Ts'o: Delayed allocation and the zero-length file problem

spitzak — Sat, 21 Mar 2009 00:34:24 +0000

This flag certainly is needed, but I would go further and say that Linux should change the behavior of O_CREAT|O_WRONLY|O_TRUNC to do exactly what you specify. This is because probably every program using these flags (or using creat()) are written to implicitly expect this behavior anyway.

Ts'o: Delayed allocation and the zero-length file problem

anton — Fri, 20 Mar 2009 18:29:00 +0000

But, the important discussion isn't if I can sneak in a good implementation for popular but incorrect API usage. The important discussion is, what is the API today and what should it really be?

The API in the non-crash case is defined by POSIX, so I translate this as: What guarantees in case of a crash should the file system give?

One ideal is to perform all operations synchronously. That's very expensive, however.

The next-cheaper ideal is to preserve the order of the operations (in-order semantics), i.e., after crash recovery you will find the file system in one of the states that it logically had before the crash; the file system may lose the operations that happened after some point in time, but it will be just as consistent as it was at that point in time. If the file system gives this guarantee, any application that written to be safe against being killed will also have consistent (but not necessarily up-to-date) data in case of a system crash.

This guarantee can be implemented relatively cheaply in a copy-on-write file systems, so I really would like Btrfs to give that guarantee, and give it for its default mode (otherwise things like ext3's data=journal debacle will happen).

How to implement this guarantee? When you decide to do another commit, just remember the then-current logical state of the file system (i.e., which blocks have to be written out), then write them out, then do a barrier, and finally the root block. There are some complications: e.g., you have to deal with some processes being in the middle of some operation at the time; and if a later operation wants to change a block before it is written out, you have to make a new working copy of that block (in addition to the one waiting to be written out), but that's just a variation on the usual copy-on-write routine.

You would also have to decide how to deal with fsync() when you give this guarantee: Can fsync()ed operations run ahead of the rest (unlike what you normally guarantee), or do you just perform a sync when an fsync is requested.

The benefits of providing such a guarantee would be:

Many applications that work well when killed would work well on Btrfs even upon a crash.
It would be a unique selling point for Btrfs. Other popular Linux file systems don't guarantee anything at all, and their maintainers only grudgingly address the worst shortcomings when there's a large enough outcry while complaining about "incorrect API usage" by applications, and some play fast and lose in other ways (e.g., by not using barriers). Many users value their data more than these maintainers and would hopefully flock to a filesystem that actually gives crash consistency gurarantees.

If you don't even give crash consistency guarantees, I don't really see a point in having the checksums that are one of the main features of Btrfs. I have seen many crashes, including some where the file system lost data, but I have never seen hardware go bad in a way where checksums would help.

Ts'o: Delayed allocation and the zero-length file problem

anton — Fri, 20 Mar 2009 15:04:55 +0000

If you insist that everything should appear as if it happened in order, then by definition delaying allocation is incompatible with your desires.

By what definition? It's perfectly possible to delay allocation (and writing) as long as desired, and also delay all the other operations that are supposed to happen afterwards (in particular the rename of the same file) until after the allocation and writing has happened. Delayed allocation is a red hering here, the problem is the ordering of the writes.

LinLogFS, which had the goal of providing good crash recovery, did implement delayed allocation.

Ts'o: Delayed allocation and the zero-length file problem

anton — Fri, 20 Mar 2009 14:52:07 +0000

To provide a completely unsurprising behavior, i.e. provide expected inter-process POSIX semantics between pre-crash and post-crash processes, you would need to infer a write barrier between every IO operation by a process.

You have the right idea about the proper behaviour, but there is more freedom in implementing it: You can commit a batch of operations at the same time (e.g., after the 60s flush interval), and you need only a few barriers for each batch: essentially one barrier between writing everything but the commit block and writing the commit block, and another barrier between writing the commit block and writing the free-blocks information for the blocks that were freed by the commit (and if you delay the freeing long enough, you can combine the latter barrier with the former barrier of the next cycle).

This can be done relatively easily on a copy-on-write file system. For an update-in-place file system, you probably need more barriers or write more stuff in the journal (including data that's written to pre-existing blocks).

Ts'o: Delayed allocation and the zero-length file problem

anton — Fri, 20 Mar 2009 14:40:20 +0000

Write barriers or something equivalent (properly-used tagged commands, write cache flushes, or disabling write-back caches) are needed for any file system that wants to provide any consistency guarantee. Otherwise the disk drive can delay writing one block indefinitely while writing out others that are much later logically.

Ts'o: Delayed allocation and the zero-length file problem

anton — Fri, 20 Mar 2009 14:22:37 +0000

I'm shocked and dismayed that a developer of an ext# filesystem can be so cavalier regarding a data integrity issue. This attitude would *never* have been taken by a dev during this period in ext3's life-cycle.

The more recent ext3 developers have a pretty cavalier attitude to data integrity: For some time the data=journal mode (which should provide the highest integrity) was broken. Also, ext3 disables using barriers by default, essentially eliminating all the robustness that adding a journal gives you (and without getting any warning that your file system is corrupted).

Ts'o: Delayed allocation and the zero-length file problem

anton — Fri, 20 Mar 2009 13:48:03 +0000

[...] it is not like a C compiler that adds some new crazy optimisation that can break any threaded program that previously was working (like speculatively writing to a variable that is never touched in logical control flow)

That's actually a very good example: ext4 performs the writes in a different order than what corresponds to the operations in the process(es), resulting in an on-disk state that never was the logical state of the file system at any point in time. One difference is that file systems have been crash-vulnerable ("crazy") for a long time, so in a variation of the Stockholm syndrome a number of people now insist that that's the right thing.

Atomicity vs durability

anton — Fri, 20 Mar 2009 11:24:16 +0000

Similarly, any file system that on fsync() locks up for several seconds is not a very good one ;-)

If the fsync() has to write out 500MB, I certainly would expect it to take several seconds and the fsync call to block for several seconds. fsync() is just an inherently slow operation. And if an application works around the lack of robustness of a file system by calling fsync() frequently, the application will be slow (even on robust file systems).

Ts'o: Delayed allocation and the zero-length file problem

alexl — Thu, 19 Mar 2009 19:50:28 +0000

That is true, and I "fixed" the glib saver code to also fsync() before rename in the case where the rename would replace an existing file.

However, all apps doing such syncing results in lower overall system performance than if the system could guarantee data-before-metadata on rename. So, ideally we would either want a new call to use instead of fsync, or a way to tell if the guarantees are met so that we don't have to fsync.

Ts'o: Delayed allocation and the zero-length file problem

anton — Thu, 19 Mar 2009 19:34:39 +0000

But as it's pointed out that many applications do get it right consistantly. Vim, OpenOffice, Emacs, mail clients, databases, etc etc. All sorts of them. Right?

Emacs did get it right when UFS lost my file that I had just written out from emacs, as well as the autosave file. But UFS got it wrong, just as ext4 gets it wrong now. There may be applications that manage to work around the most crash-vulnerable file systems in many cases, but that does not mean that the file system has sensible crash consistency guarantees.

Where did the correctness go?

nix — Tue, 17 Mar 2009 20:42:44 +0000

Sure. I meant nobody else had done it *in a filesystem*.

Where did the correctness go?

butlerm — Tue, 17 Mar 2009 17:38:48 +0000

I refer to filesystem *meta-data* operations of course.

Where did the correctness go?

butlerm — Tue, 17 Mar 2009 17:30:56 +0000

"and all the filesystems (including ext4 prior to the patches) provide the
atomicity you are looking for."

I am afraid not. Atomic means that the pertinent operation always appears
either to have completed OR to have never started in the first place. If
the system recovers in a state where some of the effects of the operation
have been preserved and other parts have disappeared, that is not atomic.

The operation here is replacing a file with a new version. Atomic
operation means when the system recovers there is either the old version or
the new version, not any other possibility. You can do this now of course,
you simply have have to pay the price for durability in addition to
atomicity.

Per accident of design, filesystems require a much higher price (in terms
of latency) to be paid for durability than databases do. That
factor is multiplied by a hundred or more if atomicity is required, but
durability is not.

This is a regression

ikm — Tue, 17 Mar 2009 11:59:53 +0000

Yep, it was myisam.

Where did the correctness go?

dlang — Tue, 17 Mar 2009 09:48:37 +0000

and all the filesystems (including ext4 prior to the patches) provide the atomicity you are looking for.

it's just the durability in the face of a crash that isn't there. but it wasn't there on ext3 either (there was just a smaller window of vunerability), and even if you mount your filesystem with the sync option many commodity hard drives would not let you disable their internal disk caches, and so you would still have the vunerability (with an even smaller window)

Where did the correctness go?

butlerm — Tue, 17 Mar 2009 08:31:34 +0000

ACID has four letters for a reason. Atomicity is logically independent of
durability. A decent database will let you turn (synchronous) durability
off while fully guaranteeing atomicity and consistency.

The reason is that with a typical rotating disk, any durable commit is
going to take at least one disk revolution time, i.e. about 10 ms. Single
threaded atomic (but not necessarily durable) commits can be issued a
hundred times faster than that, because no synchronous disk I/O is required
at all.

Where did the correctness go?

dlang — Tue, 17 Mar 2009 07:18:59 +0000

how do you think the databases make sure their data is on disk?

they use f(data)sync calls to the filesystem.

so your assertion that databases can make atomic changes to their data faster than the filesystem can do an fsync means that either you don't know what you are saying, or you don't really have the data safety that you think you have.

Where did the correctness go?

butlerm — Tue, 17 Mar 2009 06:42:18 +0000

On the contrary, every decent database in the world does this, and will run
circles around contemporary filesystems for comparable synchronous and
asynchronous operations. Check put Gray and Reuter's Transaction Processing
book for details. The edition I have was published in 1993.

There are two basic problems here:

The first is that fsync is a ridiculously *expensive* way to get the needed
functionality. The second is that most filesystems cannot implement atomic
operations any other way (i.e. without forcing both the metadata and the
data and any other pending metadata changes to disk).

fsync is orders of magnitude more expensive than necessary for the case
under consideration. A properly designed filesystem (i.e. one with
metadata undo) can issue an atomic rename in microseconds. The only option
that POSIX provides can take hundreds if not thousands of milliseconds on a
busy filesystem.

Databases do *synchronous*, durable commits on busy systems in ten
milliseconds or less. Ten to twenty times faster than it takes
contemporary filesystems to do an fsync under comparable conditions.

Even that is still a hundred times more expensive than necessary, because
synchronous durability is not required here. Just atomicity. Nothing has
to hit the disk. No synchronous I/O overhead. Just metadata undo
capability.

This is a regression

efexis — Tue, 17 Mar 2009 05:41:59 +0000

Note that mysql isn't always acid compliant and clearly states that fact, eg, when using myisam tables. Converting to innodb should fix that for you. If you were running innodb tables... then shut me up! Hehe never done any testing of this myself. Which storage engine are you using?

Write barriers

xoddam — Tue, 17 Mar 2009 01:43:16 +0000

In the context of a journalling filesystem the application-level guarantee doesn't really need to be implemented with an explicit write barrier at the disk level. Write barriers may or may not be used to maintain the journal; journals can work (perhaps somewhat less effectively) without them.

Because the journal is already able to provide the guarantee of filesystem metadata consistency, it can be used in the same way to ensure an effective ordering between write() and rename().

Ts'o: Delayed allocation and the zero-length file problem

njs — Mon, 16 Mar 2009 22:56:38 +0000

That makes sense. Thanks.

Ts'o: Delayed allocation and the zero-length file problem

masoncl — Mon, 16 Mar 2009 22:51:50 +0000

That's correct. The main point behind data=ordered is to make sure that if you crash you don't have extent pointers in the file pointing to extents that haven't been written since they were allocated.

Without data=ordered, after a crash the file could have garbage in it, or bits of old files that had been deleted.

Ts'o: Delayed allocation and the zero-length file problem

njs — Mon, 16 Mar 2009 21:23:26 +0000

...Is what you're saying that for btrfs, metadata about extents (like disk location and checksums, I guess) is handled separately from metadata about filenames, and traditionally only the former had data=ordered-style guarantees? (Just trying to see if I understand.)

Ts'o: Delayed allocation and the zero-length file problem

masoncl — Mon, 16 Mar 2009 16:50:05 +0000

The btrfs data=ordered implementation is different from ext34 and reiserfs. It decouples data writes from the metadata transaction, and simply updates the metadata for file extents after the data blocks are on disk.

This means the transaction commit doesn't have to wait for the data blocks because the metadata for the file extents always reflects extents that are actually on disk.

When you rename one file over another, the destination file is atomically replaced with the new file. The new file is fully consistent with the data that has already been written, which in the worst case means it has a size of zero after a crash.

I hope that made some kind of sense. At any rate, 2.6.30 will have patches that make the rename case work similar to the way ext3 does today. Files that have been through rename will get flushed before the commit is finalized (+/- some optimizations to skip it for destination files that were from the current transaction).

Temporary files.

nix — Mon, 16 Mar 2009 15:22:46 +0000

Alternatively you could use... tmpfs! Writing a large file to /tmp isn't
going to crash the box unless you explicitly raised the size limits
on /tmp...

This is a regression

ikm — Mon, 16 Mar 2009 14:09:37 +0000

> That would break ACID guarantees for all databases, etc.

Once I had MySQL running on an XFS filesystem, and the system has hanged for some reason. The database got broken so horribly I had to restore it from backups. I wouldn't really count on any 'ACID guarantees' here :) An UPS and a ventilated dust-free environment is our only ACID guarantee :)

This is a regression

ikm — Mon, 16 Mar 2009 13:45:47 +0000

>> Er, no, 'going beyond the standard' is what *ext4* should do

> And that's going to help the broken application running on another filesystem exactly how?

It's not. We are talking about fixing problems users start to experience when they switch from ext3 to ext4. None of the other goals, such as fixing all the apps, making all filesystems happy, feeding the hungry and making world a better place are being pursued here. The 2.6.30 fixes do what they are supposed to do, without breaking anything else. So it is a good thing, and I don't understand why you seem to be against it.

Sure, there's lots of stuff which ain't working right, but it's not a subject here. World's not perfect, and it's not going to be any time soon.

Where the the correctness go?

jamesh — Mon, 16 Mar 2009 13:28:29 +0000

Of course, if the drive supports barriers in its command queueing implementation it should be possible to prevent it reordering those writes.

That is likely to restrict reorderings that won't break correctness guarantees though.

Where the the correctness go?

nye — Mon, 16 Mar 2009 12:00:23 +0000

POSIX also allows a system crash to cause your computer to explode and hurl shrapnel into your face, because crash-behaviour is *undefined*. Are you seriously arguing that *any* POSIX-compliant behaviour is automatically the right thing? Clearly not, because you are arguing against one POSIX-compliant method in favour of another. There are an infinite number of ways to be POSIX-compliant, some of which are more useful than others.

Atomicity vs durability

forthy — Mon, 16 Mar 2009 11:00:41 +0000

Any reasonable hard disk (SATA, SCSI) has write barriers which allow file system implementers to actually implement atomicy.

Ts'o: Delayed allocation and the zero-length file problem

forthy — Mon, 16 Mar 2009 10:46:46 +0000

I'm curious, too. I thought btrfs did it right, by being COW-logging of data&metadata and having data=ordered mandatory, with all the explication in the FAQ that make complete sense (correct checksums in the metadata also mean correct data). Now Chris Mason tells us he didn't? Ok, this will be fixed in 2.6.30, and for now, we all don't expect that btrfs is perfect. We expect bugs to be fixed; and that's going on well.

IMHO a robust file system should preserve data operation ordering, so that a file system after a crash follows the same consistency semantics as during operation (and during operation, POSIX is clear about consistency). Delaying metadata updates until all data is committed to disk at the update points should actually speed things up, not slow them down, since there is an opportunity to coalesce several metadata updates into single writes without seeks (delayed inode allocation e.g. can allocate all new inodes into a single consecutive block, delayed directory name allocation all new names into consecutive data, as well).

Ts'o: Delayed allocation and the zero-length file problem

endecotp — Mon, 16 Mar 2009 10:37:48 +0000

> Come on Ted, what exactly do you want us to write to be portably safe?

Ted seems to have answered this in his second blog post: YES you DO need to fsync the directory if you want to be certain that the metadata has been saved.

Ts'o: Delayed allocation and the zero-length file problem

oseemann — Mon, 16 Mar 2009 10:24:44 +0000

So it turns out there are really very different use cases for files. As the name implies, temporary files need never hit the disk and could thus even happily reside on a ramdisk (many systems clear /tmp upon reboot anyways).

For /home or /var many users might want a more conservative approach, e.g. fsync on close or something similar, accepting performance penalties where necessary.

I believe this is a larger issue and I'm glad the current behavior of ext4 receives such wide attention and makes people think about the actual requirements for persistent storage.

I'm certain in the long run the community will come up with a proper approach for a solution.

Ts'o: Delayed allocation and the zero-length file problem

dlang — Mon, 16 Mar 2009 09:28:15 +0000

think about temporary files created during a compile. you may create them, fill them, and close them with one program. then a second program comes along a few seconds later to read and delete the program. it never actually needs to hit the disk

not all temporary files are only used by a single program that keeps them open the entire time.

Ts'o: Delayed allocation and the zero-length file problem

kornelix — Mon, 16 Mar 2009 08:57:58 +0000

I do not understand why there is a debate about this. If a file is written and closed by the application, I see no reason to delay committing it to disk. No work will be saved, only delayed. Nothing can be better optimized by the delay (well maybe a bit of seek time on a busy disk but this only applies to commit delays < 1 second or so). The only impact of the delay is greater risk that the update will get lost. Of course the buffers should be marked "clean" and retained in cache for a while in case a read of the same file is requested shortly later.

Ts'o: Delayed allocation and the zero-length file problem

njs — Mon, 16 Mar 2009 08:38:05 +0000

So, uh... doesn't the Btrfs FAQ claim that this is the default, indeed required, behavior already?

http://btrfs.wiki.kernel.org/index.php/FAQ#Does_Btrfs_hav...

I'm curious what I'm missing...