|
|
Subscribe / Log in / New account

Ts'o: Delayed allocation and the zero-length file problem

Here's a posting from Ted Ts'o on ext4, delayed allocation, and robustness. "What’s the best path forward? For now, what I would recommend to Ubuntu gamers whose systems crash all the time and who want to use ext4, to use the nodelalloc mount option. I haven’t quantified what the performance penalty will be for this mode of operation, but the performance will be better than ext3, and at least this way they won’t have to worry about files getting lost as a result of delayed allocation. Long term, application writers who are worried about files getting lost on an unclean shutdown really should use fsync."

(Log in to post comments)

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 14:20 UTC (Fri) by welinder (guest, #4699) [Link]

That's an awful rant.

Ts'o says "POSIX allows this behaviour".

Compare that to the gcc developers who say "The C standard allows this
or that behaviour" and the kernel people's response that gcc should do
the reasonable thing (whatever that is) regardless of what the standard
allows.

Similarly, all FSs should try hard to make sure open-write-close-rename
leaves you with either the new or the old file. Anything else isn't
reasonable.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 14:33 UTC (Fri) by jbailey (guest, #16890) [Link]

@welinder: By that same argument, glibc should never have removed PLT entries for all of its internal functions that many programs were twiddling to try and improve their own performance. The result of that is Windows: where the scheduler has so many hacks and tweaks for different applications and there's no way out of the mess.

Relying on undocumented behaviour and then crying foul when it changes is just bad. The OS cannot and should not provide bug for bug compatibility with itself.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 14:46 UTC (Fri) by mrshiny (subscriber, #4266) [Link]

The OS can provide bug-for-bug compatibility with itself. To do so is clearly possible given the evidence that Windows does it.

To what extent that compatibility is required or desired is a different matter.

In any case, "Linux" has, for many years now, been commonly found with EXT3 as its default filesystem. This filesystem did not exhibit data-loss for the scenario in question. EXT4 does. How is this not a regression? We're not talking about a program crashing or running slowly or anything like that, we're talking about data loss. DATA LOSS. If there's one thing a computer should get right, it's storing the darn data and not losing it.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 15:42 UTC (Fri) by drag (guest, #31333) [Link]

Lets get this straight first:

1. It's not the file system causing dataloss. It's combination of buggy drivers and incorrect application developer assumptions that are causing dataloss. The file system is working correct.

2. Ext3 exhibits the same behavior, as does all modern file systems. You have the same problems with badly designed applications with XFS or Btrfs, for example.

The only difference between Ext3 and Ext4 in this manner is that with Ext3 you had a 5 second window and with Ext4 you generally have up to 60 seconds. That is there is no difference in behavior if your system crashes in 4 seconds after a write.

3. Linux supports multiple different file systems and it always has. Your not dealing with Windows were your only choices in life are NTFS or Fat32.

Therefore if you want a OS that can benefit from the positive qualities of anything other then Ext3 then it's shitty policy to bend over backwards to support badly written applications because those developers never bothered to test on anything other then Ext3 or understand what the code they are writing actually does.

4. If you want your software to be portable at all to other operating systems, say OS X, Windows, FreeBSD, etc etc... then depending on the dumb-luck chance characteristics of a common configuration a nearly obsolete file system on a single operating system is not a good way to go.

5. Tso introduced a patch that helps mitigate this issue anyways.

----------------------------------

Anyways. If you at any time complained about the lack of standards of banks that demand IE only, then what your saying now is slightly hypocritical. POSIX is a standard for accessing file systems and specific chance behavior of Ext3 isn't.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 16:14 UTC (Fri) by forthy (guest, #1525) [Link]

You have the same problems with badly designed applications with XFS or Btrfs, for example.

Actually, not with Btrfs. This is a log structured file system with COW, i.e. it preserves file system operation order. You either get new data and new metadata, or you get old data and old metadata, but no mix like in ext4 or XFS. BTW: This discussion is old as dirt. See Anton Ertl's very old rant on it: What's wrong with synchronous metadata updates. If you don't understand what's wrong with synchronous metadata updates, don't even try to write a robust file system. You can still write a fast, but fragile file system, but this doesn't need synchronous metadata updates.

If you move the problem of creating a consistent file system to the applications, you will fail. fsync() only syncs the data - the metadata still can be whatever the file system designer likes (old, new, not yet allocated, trash that will settle down in five seconds). The file system still can be broken beyond repair after a crash. And using fsync() is horribly slow, increases file system fragmentation, and so on. If you are a responsible file system designer, you don't mitigate your consistency problems to the user. You solve them.

The fact that ext3 is broken the same way, just with a shorter period, is no excuse. ReiserFS was broken in a similar way, until they added data journalling as default. Btrfs gets it right by design, but then, we better wait for Btrfs to mature than to "fix" ext4.

The applications in question like KDE are not broken. They just rely on a robust, consistent file system. There is no guarantee for that, as POSIX does not specify that the file system should be robust in case of crashes. But it is a sound assumption for writing user code. If you can't rely on your operating system, write one yourself.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 18:47 UTC (Fri) by masoncl (subscriber, #47138) [Link]

Just to clarify what will happen on btrfs.

If you:

create(file1)
write(file1, good data)
fsync(file1)

create(file2)
write(file2, new good data)
rename(file2 to file1)

FS transaction happens to commit

<crash>

file2 has been renamed over file1, and that rename was atomic. file2 either completely replaces file1 in the FS or file1 and file2 both exist.

But, no fsync was done on file2's data, and file2 replaced file1. After the crash, the file1 you see in the directory tree may or may not have data in it.

The btrfs delalloc implementation is similar to XFS in this area. File metadata is updated after the file IO is complete, so you won't see garbage or null bytes in the file, but the file might be zero length.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 19:13 UTC (Fri) by foom (subscriber, #14868) [Link]

> The btrfs delalloc implementation is similar to XFS in this area. File metadata
> is updated after the file IO is complete, so you won't see garbage or null
> bytes in the file, but the file might be zero length.

Better get to fixing that, then!

I'm rather amused at the number of comments along the lines of "XFS already does it so it must be
okay!" A filesystem known for pissing off users by losing their data after power-outages is not one
I'm happy to see filesystem developers hold up as a shining example of what to do...

(and apparently XFS doesn't even do this anymore, after the volume of complaints raised against it,
according to other comments!)

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 19:44 UTC (Fri) by masoncl (subscriber, #47138) [Link]

XFS closes this window on truncate but not renames.

This is a fundamental discussion about what the FS is supposed to implement when it does a rename.

The part where applications expect rename to also mean fsync is a new invention with broad performance implications.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 20:11 UTC (Fri) by alexl (subscriber, #19068) [Link]

Nobody is expecting rename to imply fsync!
This isn't about having the data on disk *now*.

We just expect it to don't write the new metadata for "newpath" to disk before the data in oldpath is on disk.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 20:34 UTC (Fri) by masoncl (subscriber, #47138) [Link]

The goal is to make sure the data for the new file is on disk before (or in the same transaction as) the metadata for the rename.

We have two basic choices to accomplish this:

1) Put the new file into a list of things that must be written before the commit is done. This is pretty much what the proposed ext4 changes do.

2) Write the data before the rename is complete.

The problem with #1 is that it reintroduces the famous ext3 fsync behavior that caused so many problems with firefox. It does this in a more limited scope, just for files that have been renamed, but it makes for very large side effects when you want to rename big files.

The problem with #2 is that it is basically fsync-on-rename.

The btrfs fsync log would allow me to get away with #1 without too much pain, because fsyncs don't force full btrfs commits and so they won't actually wait for the renamed file data to hit disk.

But, the important discussion isn't if I can sneak in a good implementation for popular but incorrect API usage. The important discussion is, what is the API today and what should it really be?

Applications have known how to get consistent data on disk for a looong time. Mail servers do it, databases do it. Changing rename to include significant performance penalties when it isn't documented or expected to work this way seems like a very bad idea to me.

I'd much rather make a new system call or flag for open that explicitly documents the extra syncing, and give application developers the choice.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 20:46 UTC (Fri) by alexl (subscriber, #19068) [Link]

I think "and give application developers the choice." is a fallacy.

At the level of the close happening we don't really know what kind of data this is, as this is generally some library helper function. And even at the application level, how do you know that its important to not zero out a file on crash? It depends on how the user uses the application.

It all comes back to the fact that for "proper" use of this more or less all cases would turn into sync-on-write (or the new flag or whatever). So, the difference wrt the filesystem wide implementation of this will get smaller as apps gets "fixed" until the difference is more or less zero.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 21:19 UTC (Fri) by drag (guest, #31333) [Link]

Well all you guys are over my head.

But as it's pointed out that many applications do get it right consistantly. Vim, OpenOffice, Emacs, mail clients, databases, etc etc. All sorts of them. Right?

So you have the choice of making undocumented behavior documented and then forcing that behavior on all the file systems that Linux supports and all the file systems that you expect your application to run on, or you can fix the application to do it right.

And the assumptions that were made to create this bad behavior are not even true. So even then it's not even a question of backward compatability... They've always gotten it wrong, it's just that the it's been dumb luck that that it wasn't a bigger issue in the past.

As long as file systems are async then your going to have a delay between when the data is created and when that data is commited to disk. You can do all sorts of things to help reduce the damage that can cause, but it's still the fundamental nature of the beast. If you lose power or crash your computer you WILL lose some data.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 0:38 UTC (Sat) by drag (guest, #31333) [Link]

Alright...

I've been reading what has been written and I think I am understanding what is going on better now.

But here is my new thought on the subject:

The reason why they are going:
create file 1
write file 1
.... (time passes)
create file 2
write file 2
rename file 2 over file 1

Is because they are trying to get a 'atomic' instruction, right?

They are making a few assumptions about the file system that:
A. the file system does not operate in atomic operations and thus they have to do this song and dance to do the file system's work for them... (protect their data)
B. that while the file system is going to fail them otherwise, it is still able to do rename in one single operation.
C. That by renaming the file they are actually telling the file to commit it to disk.

So in effect they are trying to outguess or outthink the operating system. But their assumptions, in the case of Ext4 and most others, are not correct and their software is doing what they told it do, but what they told it to do is not what they think its doing.

You see they are putting extra effort into compinsating for the file system already. So if they are putting the extra effort into out thinking the OS, then why don't they at least do it correctly?

Instead of writing out hundreds of files and trying the rename atomic trick, which isn't really right anyways, there are a half a dozen different design approaches that would yeild better results.

Or am I completely off-base here?

I understand the need for the file system to protect a user's data despite what the application developers actually wrote. Really I do.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 8:15 UTC (Sat) by alexl (subscriber, #19068) [Link]

The rename is documented by posix and unix since forever to be atomic. That is not some form of "workaround", or "compensation" but a solid safe, well documented way to write files. However, those atomicity guarantees are only valid if the system doesn't crash (as crashes are not specified by posix).

The "atomic" part is protection against other apps that are saving to the file at the same time, not crashes. The fsync is only required not to get problems when the system crashes.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 16:45 UTC (Sat) by drag (guest, #31333) [Link]

Thanks. I realise now that I was off-base. I understand now it has to do with application-land and not so much with file system stuff. :)

Not getting it right

Posted Mar 14, 2009 12:17 UTC (Sat) by man_ls (guest, #15091) [Link]

The "other applications get it right with fsync" part is a bit of a fallacy. When you save a document in an application (be it a word processor or a mail client) you really want it to be safe, and it is normally not an issue to wait a few seconds. IOW the user is expected to wait for disk activity because we have been trained this way, so we are willing to accept this trade-off.

But there are other programs doing file operations all the time, and nobody wants to wait a few seconds for them. Most notably background tasks like the operating system or a desktop environment. Is it reasonable to expect all of them to do something which potentially slows the system to a crawl on other filesystems, just to play safe with the newcomer ext4?

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 19, 2009 19:34 UTC (Thu) by anton (subscriber, #25547) [Link]

But as it's pointed out that many applications do get it right consistantly. Vim, OpenOffice, Emacs, mail clients, databases, etc etc. All sorts of them. Right?
Emacs did get it right when UFS lost my file that I had just written out from emacs, as well as the autosave file. But UFS got it wrong, just as ext4 gets it wrong now. There may be applications that manage to work around the most crash-vulnerable file systems in many cases, but that does not mean that the file system has sensible crash consistency guarantees.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 19, 2009 19:50 UTC (Thu) by alexl (subscriber, #19068) [Link]

That is true, and I "fixed" the glib saver code to also fsync() before rename in the case where the rename would replace an existing file.

However, all apps doing such syncing results in lower overall system performance than if the system could guarantee data-before-metadata on rename. So, ideally we would either want a new call to use instead of fsync, or a way to tell if the guarantees are met so that we don't have to fsync.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 21:16 UTC (Fri) by foom (subscriber, #14868) [Link]

It used to be that losing *your entire filesystem* upon power loss was a possible failure mode.
Whether an app called fsync for a file or not is rather irrelevant in such a case. Obviously, this
kind of failure is allowed by the standards. Just as obviously, it sucks for users, so people made
better filesystems that don't have that failure mode. That's good, I quite enjoy using a system
that doesn't lose my entire filesystem randomly if the power fails.

So it seems to me that claiming that since failing to call fsync before rename is "incorrect" API
usage, and thus it's okay to lose both old data and new, is simply wishful thinking on the part of
the filesystem developers. Sure it may be allowed by standards (as would be zeroing out the
entire partition...), but it sucks for users of that filesystem. So filesystems shouldn't do it. That's
really all there is to it.

Unless *every* call to rename is *always* be preceded by a call to fsync (including those in "mv"
and such), it will suck for users. And there's really no point in forcing everyone to put fsync()s
before every rename, when you could just make the filesystem work without that, and get to the
same place faster.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 15, 2009 7:35 UTC (Sun) by phiggins (guest, #5605) [Link]

Every rename() call does not need to preceded by fsync(). If the source file is known to be on disk already, there's no point in calling fsync(). This is code that knowingly creates a new file, does not call fsync() so that there is no reason whatsoever to assume that the data is on disk, and then calls rename() to replace an existing file. I do think that the behavior of persisting the update to the directory before saving the new file's data is bizarre and likely to cause problems, though. There may not be a required ordering for those operations, but having them reordered is clearly confusing.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 20, 2009 18:29 UTC (Fri) by anton (subscriber, #25547) [Link]

But, the important discussion isn't if I can sneak in a good implementation for popular but incorrect API usage. The important discussion is, what is the API today and what should it really be?
The API in the non-crash case is defined by POSIX, so I translate this as: What guarantees in case of a crash should the file system give?

One ideal is to perform all operations synchronously. That's very expensive, however.

The next-cheaper ideal is to preserve the order of the operations (in-order semantics), i.e., after crash recovery you will find the file system in one of the states that it logically had before the crash; the file system may lose the operations that happened after some point in time, but it will be just as consistent as it was at that point in time. If the file system gives this guarantee, any application that written to be safe against being killed will also have consistent (but not necessarily up-to-date) data in case of a system crash.

This guarantee can be implemented relatively cheaply in a copy-on-write file systems, so I really would like Btrfs to give that guarantee, and give it for its default mode (otherwise things like ext3's data=journal debacle will happen).

How to implement this guarantee? When you decide to do another commit, just remember the then-current logical state of the file system (i.e., which blocks have to be written out), then write them out, then do a barrier, and finally the root block. There are some complications: e.g., you have to deal with some processes being in the middle of some operation at the time; and if a later operation wants to change a block before it is written out, you have to make a new working copy of that block (in addition to the one waiting to be written out), but that's just a variation on the usual copy-on-write routine.

You would also have to decide how to deal with fsync() when you give this guarantee: Can fsync()ed operations run ahead of the rest (unlike what you normally guarantee), or do you just perform a sync when an fsync is requested.

The benefits of providing such a guarantee would be:

  • Many applications that work well when killed would work well on Btrfs even upon a crash.
  • It would be a unique selling point for Btrfs. Other popular Linux file systems don't guarantee anything at all, and their maintainers only grudgingly address the worst shortcomings when there's a large enough outcry while complaining about "incorrect API usage" by applications, and some play fast and lose in other ways (e.g., by not using barriers). Many users value their data more than these maintainers and would hopefully flock to a filesystem that actually gives crash consistency gurarantees.
If you don't even give crash consistency guarantees, I don't really see a point in having the checksums that are one of the main features of Btrfs. I have seen many crashes, including some where the file system lost data, but I have never seen hardware go bad in a way where checksums would help.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 23:17 UTC (Sat) by masoncl (subscriber, #47138) [Link]

Testing here shows that I can change the btrfs rename code to make sure the data for the new file is on disk before the rename commits without any performance penalty in most workloads.

It works differently in btrfs than xfs and ext4 because fsyncs go through a special logging mechanism, and so an fsync on one file won't have to wait for the rename flush on any other files in the FS.

I'll go ahead and queue this patch for 2.6.30.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 8:38 UTC (Mon) by njs (guest, #40338) [Link]

So, uh... doesn't the Btrfs FAQ claim that this is the default, indeed required, behavior already?

http://btrfs.wiki.kernel.org/index.php/FAQ#Does_Btrfs_hav...

I'm curious what I'm missing...

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 10:46 UTC (Mon) by forthy (guest, #1525) [Link]

I'm curious, too. I thought btrfs did it right, by being COW-logging of data&metadata and having data=ordered mandatory, with all the explication in the FAQ that make complete sense (correct checksums in the metadata also mean correct data). Now Chris Mason tells us he didn't? Ok, this will be fixed in 2.6.30, and for now, we all don't expect that btrfs is perfect. We expect bugs to be fixed; and that's going on well.

IMHO a robust file system should preserve data operation ordering, so that a file system after a crash follows the same consistency semantics as during operation (and during operation, POSIX is clear about consistency). Delaying metadata updates until all data is committed to disk at the update points should actually speed things up, not slow them down, since there is an opportunity to coalesce several metadata updates into single writes without seeks (delayed inode allocation e.g. can allocate all new inodes into a single consecutive block, delayed directory name allocation all new names into consecutive data, as well).

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 16:50 UTC (Mon) by masoncl (subscriber, #47138) [Link]

The btrfs data=ordered implementation is different from ext34 and reiserfs. It decouples data writes from the metadata transaction, and simply updates the metadata for file extents after the data blocks are on disk.

This means the transaction commit doesn't have to wait for the data blocks because the metadata for the file extents always reflects extents that are actually on disk.

When you rename one file over another, the destination file is atomically replaced with the new file. The new file is fully consistent with the data that has already been written, which in the worst case means it has a size of zero after a crash.

I hope that made some kind of sense. At any rate, 2.6.30 will have patches that make the rename case work similar to the way ext3 does today. Files that have been through rename will get flushed before the commit is finalized (+/- some optimizations to skip it for destination files that were from the current transaction).

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 21:23 UTC (Mon) by njs (guest, #40338) [Link]

...Is what you're saying that for btrfs, metadata about extents (like disk location and checksums, I guess) is handled separately from metadata about filenames, and traditionally only the former had data=ordered-style guarantees? (Just trying to see if I understand.)

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 22:51 UTC (Mon) by masoncl (subscriber, #47138) [Link]

That's correct. The main point behind data=ordered is to make sure that if you crash you don't have extent pointers in the file pointing to extents that haven't been written since they were allocated.

Without data=ordered, after a crash the file could have garbage in it, or bits of old files that had been deleted.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 22:56 UTC (Mon) by njs (guest, #40338) [Link]

That makes sense. Thanks.

Ts'o: Delayed allocation and the zero-length file problem

Posted Apr 7, 2009 22:27 UTC (Tue) by pgoetz (subscriber, #4931) [Link]

"When you rename one file over another, the destination file is atomically replaced with the new file. The new file is fully consistent with the data that has already been written, which in the worst case means it has a size of zero after a crash."

Sorry this doesn't make any sense. Atomicity in this context means that when executing a rename, you always get either the old data (exactly) or the new data. Your worst case scenario -- a size of zero after crash -- precisely violates atomicity.

For the record, the first 2 paragraphs are equally mysterious: "This means the transaction commit doesn't have to wait for the data blocks...". Um, is the data ordered or not? If you commit the transaction -- i.e. update the metadata before the data blocks are committed, then the operations are occurring out of order and ext4 open-write-close-rename mayhem ensues.

"bug for bug compatibility" is why Windows sucks

Posted Mar 13, 2009 16:50 UTC (Fri) by JoeBuck (subscriber, #2330) [Link]

The OS can provide bug-for-bug compatibility with itself. To do so is clearly possible given the evidence that Windows does it.

Microsoft has managed to hire some of the most brilliant developers on the planet. But they have a heavy burden: every mistake they ever made in the last 20 years, every misdesigned API, every unspecified behavior, has come to be relied on by some significant application developer, so they have to keep this mountain of crap duct-taped together and running. Some of the worst offenders have been Microsoft's own application developers, relying on undocumented behavior that they can find out by reading the source code as a competitive edge.

The wisest thing the Linux developers have done is that they decided they're willing to regularly break kernel APIs, other than system calls. It's the main reason that Linux has been able to catch up so rapidly.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 15:41 UTC (Fri) by qg6te2 (guest, #52587) [Link]

Hang on, comparing the changing functionality of glibc, windows, etc to a file system is misleading. Ensuring that "open-write-close-rename" does what it says is a reasonable requirement, even if POSIX is silent about it.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 15:03 UTC (Fri) by rsidd (subscriber, #2582) [Link]

It reminds me of Richard Gabriel's famous "Worse is better" screed.

From Gabriel's article:

The New Jersey guy said that the Unix folks were aware of the [PC-losering] problem, but the solution was for the system routine to always finish, but sometimes an error code would be returned that signaled that the system routine had failed to complete its action. A correct user program, then, had to check the error code to determine whether to simply try the system routine again.
Which, to me, sounds very much the same as saying, like Ted Ts'o, that a correct user program has to fsync() its data and not rely on fclose() actually flushing anything to disk. Also, there are lots of people (like scientific programmers) who write their own short file-handling code without being fanaticallyc "correct" C programmers; buffer overflows and other such bugs are probably OK for them, since their systems are trusted, but data loss really is not OK.

Still, as Gabriel says, Unix won against Lisp systems. And Windows (which was even worse up until Windows ME) won against Unix. So there's food for thought there.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 16:19 UTC (Fri) by ajross (guest, #4563) [Link]

Agreed. What's the point of putting all these fancy journaling and reliability features into a file system if they don't work by default? I mean, hell, we could lose data after a system crash with ufs in 1983. Why bother with ext4?

Hiding behind POSIX here is just ridiculous. POSIX allows this absurd lack of reliability not because it's a good idea, but because filesystems available when the standard was drafted can't support it.

Worse is better

Posted Mar 13, 2009 23:03 UTC (Fri) by pboddie (guest, #50784) [Link]

It's very apt to bring up "worse is better" because that particular rant is all about the applications programmer having to jump through hoops so that the systems programmer can save some effort.

Although people can argue that UNIX "got things about right" in comparison to competing (and presumably discontinued) operating systems which were more clever in their implementation, there's a lot to be said for not pestering application programmers with, for example, the tedious details of fsync and friends at the expense of common sense idioms that just work, like those which assume that closed files can safely have filesystem operations performed on them. Those tedious details involving, of course, figuring out which sync-related function actually does what the developer might anticipate from one platform to the next.

Sometimes worse really is worse.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 3:37 UTC (Sat) by bojan (subscriber, #14302) [Link]

> Which, to me, sounds very much the same as saying, like Ted Ts'o, that a correct user program has to fsync() its data and not rely on fclose() actually flushing anything to disk.

It's not Ted saying this. It is how it works. From man 2 close:

> A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a file system to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored use fsync(2). (It will depend on the disk hardware at this point.)

OK?

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 6:13 UTC (Mon) by jamesh (guest, #1159) [Link]

People aren't asking for sync on close. Rather they'd like the rename() operation to only occur if the new file data has been written to disk.

Conversely, if the file data hasn't been written to disk then they expect that the rename over the old data won't occur.

There is no expectation that either of these operations will occur immediately, which is why they don't request that happen via fsync().

If the current method applications use when expecting this behaviour, then it'd be nice to define an API that does provide the desired semantics. That said, I can't think of any cases where you wouldn't want the new data blocks written out before renaming over an existing file.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 15:44 UTC (Fri) by forthy (guest, #1525) [Link]

"POSIX allows this" is an anal point. Like "the C standard allows this". This is not what standard documents are about. I'm working in a standard committee, and standards are about compromises between different implementations who all have different design goals. If you design a file system, it is your task to define those parts left out by the standard. Do you want it fast, do you want it robust, or do you want it compatible with some cruft out there for 25 years (like FAT)? A robust file system has to preserve a consistent state of the file system in case of a crash - and POSIX defines what consistent is: no reordering of operations allowed. The real solution therefore is log-structured file system with COW. Ok, so wait for btrfs.

But in the meantime, ext4 could just delay metadata writes as well, and make sure that at these sync points it does not commit meta data until it also has written the data out. The bug here is that ext4 writes "commit" to a metadata update even though data updates issued prior to this metadata update are still pending.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 15:58 UTC (Fri) by kh (subscriber, #19413) [Link]

I wonder if the default for distributions will not become the "nodelalloc mount option". I'm pretty happy with the speed of ext3, and would even take a small performance penalty from ext3 for additional data integrity guarantees, and I think my position is pretty normal - so I am left wondering who the target audience of ext4 is? Only server farms with multiple power backup systems?

Server farms are not the target

Posted Mar 13, 2009 19:21 UTC (Fri) by khim (subscriber, #9252) [Link]

Ext2 is plenty fast for most operations and if you can just throw away everything (like Google can) - it's not a bad choice. And if ext4 can not provide sane guarantee (and yes, open/write/close/rename must be atomic no matter what POSIX says) it's filesystem for nobody...

Server farms are not the target

Posted Mar 13, 2009 19:41 UTC (Fri) by martinfick (subscriber, #4455) [Link]

It sounds like it is atomic, just not durable.

Server farms are not the target

Posted Mar 14, 2009 1:21 UTC (Sat) by droundy (subscriber, #4559) [Link]

No, if it were atomic, you'd be guaranteed to end up with either the old version of the file or the new version of the file.

Server farms are not the target

Posted Mar 14, 2009 1:47 UTC (Sat) by martinfick (subscriber, #4455) [Link]

Well, it sounds like you are guaranteed to see the new version of the file unless the system crashes. In other words, running processes will always see the operations as atomic without a crash. This question is, can an operation be atomic without being durable? I cannot find a satisfying reference that says yes or no. Wikipedia says no, but has zero references. However, all the first ten google results which define ACID transactions mention nothing about operations needing to be durable to be atomic.

You may be right, but now you have me questioning both wikipedia and my own interpretation. :)

Atomicity vs durability

Posted Mar 14, 2009 13:42 UTC (Sat) by man_ls (guest, #15091) [Link]

"Atomic without a crash" is not good enough; "atomic" means that a transaction is committed or not, no matter at what point we are -- even after a crash. I think that durability is not needed here:
In database systems, durability is the ACID property which guarantees that transactions that have committed will survive permanently.
In ext3 it does not matter (too much) if the transaction stays committed, but it cannot be left in the middle of an operation (crashes notwithstanding).

Let's see an example. Say we have the following sequence:

  atomic change -> commit -> 5 secs -> flushed to disk
The change might be a rename operation. If the system crashes during those 5 secs, the transaction might not survive the crash -- the filesystem would appear as before the atomic change, and thus ext3 is not durable. But the transaction can only appear as not started or as finished, and not in any state in between; thus ext3 is atomic. I guess that is what fsync() is about: durability of the requested change.

But the problem Ts'o is talking about is different: the transaction has been committed but only part of it may appear on disk -- a zero-length version of the transaction to be precise. So the system is not atomic. It can be made durable with fsync() but that is not really the point.

I may very well have confused everything up, and would be grateful for any clarification. My coffee intake is not what it used to be these days.

Atomicity vs durability

Posted Mar 15, 2009 3:21 UTC (Sun) by bojan (subscriber, #14302) [Link]

> But the problem Ts'o is talking about is different: the transaction has been committed but only part of it may appear on disk -- a zero-length version of the transaction to be precise. So the system is not atomic. It can be made durable with fsync() but that is not really the point.

I think you're missing the crucial distinction here. When the atomic rename happens (and atomicity refers to file names _only_ here), "so that there is no point at which another process attempting to access newpath will find it missing" (from rename(2) manual page), the new file will replace the old file. However, because the application writer didn't commit the _data_ to the new file yet, it may not be on disk.

In other words, rename(2) does _not_ specify atomicity of _data_ inside the file, only that at no point the file _name_ will be missing. For data to be in that new file, durability is required. Ergo, fsync.

The whole API, including write(), close(), fsync() and rename() has absolutely no idea that the application writer is trying to atomically replace the file. Only the application writer knows this and must act accordingly.

Atomicity vs durability

Posted Mar 15, 2009 9:44 UTC (Sun) by alexl (subscriber, #19068) [Link]

I don't think this is a correct description of the specs.

POSIX guarantees an atomic replacement of the file, and this means *both* the data an the filenames[1]. However, POSIX doesn't specify anything about system crashes. So, this guarantee is only valid for the non-crashing case.

For the crashing case POSIX doesn't guarantee anything. In fact, many POSIX filesystems such as ext2 can (correctly, by the spec) result in a total loss of all filesystem data in the case of a system crash. And in fact, this is allowed even if the application fsync()ed some data before the crash.

Now, in order to have some way of getting better guarantees than this POSIX also supplies fsync that guarantees that the files have been written to disk. However, nowhere in the specs of fsync does it say that it guarantees that this will survive a system crash. Of course if it *does* survive it is nice to have the fsync guarantee because that means if the metadata change survived we're more likely to get the whole new file.

But, your discussions about how the "atomic" part is only refering to the filenames is bullshit. POSIX does give full guarantees for both filename and content in the case it specifies. Everything else is up to the implementation. This is why its a good idea for a robust filesystem to give the write-data-before-metadata-on-rename guarantee, since it turns an non-crash POSIX guarantee into a post-crash guarantee. (Of course, this is by no means necessary, even ext2 with full data loss on crash is POSIX compliant, its just a *good* implementation.)

[1] From the POSIX spec:
If the link named by the new argument exists, it shall be removed and old renamed to new. In this case, a link named new shall remain visible to other processes throughout the renaming operation and refer either to the file referred to by new or old before the operation began.
(Notice how this has no separation about "filenames" and "data")

Atomicity vs durability

Posted Mar 15, 2009 10:18 UTC (Sun) by bojan (subscriber, #14302) [Link]

> (Notice how this has no separation about "filenames" and "data")

Notice how it doesn't say _anything_ at all about the content of the file _on_ _disk_ that is being renamed into the old file. That is because in order to see the file _durably_, you have to commit its content. Completely unrelated to committing the _rename_.

Just because another process can see the link (and access the data correctly, which may still just be cached by the kernel) does _not_ mean data reached the disk yet.

BTW, thanks for working on fixes of this in Gnome.

Atomicity vs durability

Posted Mar 15, 2009 10:29 UTC (Sun) by alexl (subscriber, #19068) [Link]

Of course not. It says *nothing* about whats on disk, because durability wrt system crashes (not process crashes) is not part of the POSIX. So, any behaviour better than full data loss on crash is a robustness property of the implementation.

I argue that a robust useful filesystem should give data-before-metadata-on-rename guarantees, as that would make it a better filesystem. And without this guarantee I'd rather use another filesystem for my data. This is clearly not a requirement, and important code should still fsync() to handle other filesystems. But its still the mark of a "good" filesystem.

Atomicity vs durability

Posted Mar 15, 2009 10:37 UTC (Sun) by bojan (subscriber, #14302) [Link]

Similarly, any file system that on fsync() locks up for several seconds is not a very good one ;-)

Atomicity vs durability

Posted Mar 15, 2009 11:04 UTC (Sun) by man_ls (guest, #15091) [Link]

That is why we are willing to replace it in the first place! but not if it means losing in the process its good properties (be them in the spec or not).

Atomicity vs durability

Posted Mar 20, 2009 11:24 UTC (Fri) by anton (subscriber, #25547) [Link]

Similarly, any file system that on fsync() locks up for several seconds is not a very good one ;-)
If the fsync() has to write out 500MB, I certainly would expect it to take several seconds and the fsync call to block for several seconds. fsync() is just an inherently slow operation. And if an application works around the lack of robustness of a file system by calling fsync() frequently, the application will be slow (even on robust file systems).

Atomicity vs durability

Posted Mar 15, 2009 10:36 UTC (Sun) by bojan (subscriber, #14302) [Link]

> But, your discussions about how the "atomic" part is only refering to the filenames is bullshit.

Consider this. A file has been renamed and just then kernel decides that the directory will be evicted from the cache (i.e. committed to disk). What will be written to disk? The _new_ content of the directory will be written out to disk, because any other process looking for the file must see the _new_ file (and must never _not_ see the file). At the same time, the content of the file can still be cached just fine and _everything_ is atomically visible by all processes.

And yet, the atomicity of rename _only_ refers to filenames.

Atomicity vs durability

Posted Mar 15, 2009 11:02 UTC (Sun) by man_ls (guest, #15091) [Link]

No, you are just blurring the issue; transactionality does not work that way. I think the interpretation of alexl is correct here. It does not matters if the contents of the file are still cached; other processes can see either the old contents or the new contents, but not both and not a broken file. The rename cannot be atomic if the name points to e.g. an empty file; not only the filename must be valid but the contents of the file as well, up to the point when the rename is done. It is no good if the file appears as it was before the atomic operation was issued (e.g. empty).

With fsync you make the contents persistent i.e. durable, but the operation should be atomic even without the fsync.

Atomicity vs durability

Posted Mar 15, 2009 11:34 UTC (Sun) by dlang (guest, #313) [Link]

what you are missing is that unless the system crashes everything does work. processes on the system see either the old file content or the new file content.

the only time they could see a blank or corrupted file is if the system crashes.

so the atomicity is there now

it's the durability that isn't there unless you do f(data)sync on the file and directory (and have hardware that doesn't do unsafe caching, which most drives do by default)

Atomicity vs durability

Posted Mar 15, 2009 12:48 UTC (Sun) by man_ls (guest, #15091) [Link]

Exactly. What is missing now is atomicity in the face of a crash. To quote myself from a few posts above:
"Atomic without a crash" is not good enough; "atomic" means that a transaction is committed or not, no matter at what point we are -- even after a crash.
Durability (what fsync provides) is not needed here; durability means that the transaction is safely stored to disk. What we people are requesting from ext4 is that the property of atomicity be preserved even after a crash.

Even if the POSIX standard does not speak about system crashes it is good engineering to take them into account IMHO.

Atomicity vs durability

Posted Mar 15, 2009 13:06 UTC (Sun) by bojan (subscriber, #14302) [Link]

> What we people are requesting from ext4 is that the property of atomicity be preserved even after a crash.

Which is not what POSIX requires.

Atomicity vs durability

Posted Mar 15, 2009 13:13 UTC (Sun) by man_ls (guest, #15091) [Link]

That is right, that is why we are not using ext2 (which is POSIX-compliant), FreeBSD (which is POSIX-compliant) or even Windows Vista (which can be made to be POSIX-compliant). We are running a journalling file system in the (apparently silly) hope that the system will hold our data and then give it back.

Atomicity vs durability

Posted Mar 15, 2009 13:19 UTC (Sun) by bojan (subscriber, #14302) [Link]

Look, I'm all for reliability. But, if the manual says: "fsync if you want your data on disk" and we don't fsync, then it is us that are creating the problem.

I think we should come up with a new API that guarantees what people really want. Making the existing API do that on a particular FS is just going to make applications non-portable to any FS that doesn't work that way using existing POSIX API. We've seen this with XFS. Who knows what's lurking out there. Better do the proper thing, fsync and be done with it. Then we can invent the new, better, smarter API.

Atomicity vs durability vs reliability

Posted Mar 15, 2009 13:35 UTC (Sun) by man_ls (guest, #15091) [Link]

No, you are not all for reliability if you cannot see beyond your little POSIX manual. Or if you don't care about system crashes because the manual is silent about this particular point. Sorry to break it to you: reliability is such little details such as having predictable response to a crash, or surviving the crash while retaining all the nice properties.
I think we should come up with a new API that guarantees what people really want.
APIs are good enough as they are -- we don't need a special "reliability API" so we can build a special "reliability manual" for guys who just follow the book.
We've seen this with XFS.
Nope. What we have seen with XFS is how some anal-retentive developers lost most of their user base while trying to argue such points as "POSIX-compliance", and then they finally give in. With ex4 we are hoping to get to the point where the devs give in before they lose most of their user base. Just because ext4 is important for Linux and for our world domination agenda. Meanwhile you can keep waving the POSIX standard in our face. The POSIX standard seems to be about compatibility, not about reliability, and it should keep playing that role. Reliability is left as an exercise for the attentive reader. Let us hope that Mr Ts'o is attentive and can tell atomicity, reliability and durability apart.

Actually it's done deal...

Posted Mar 15, 2009 17:34 UTC (Sun) by khim (subscriber, #9252) [Link]

If you read the comments on tytso's blog you'll see that current position is: "POSIX is right while applications are broken yet we'll save them anyway". Even if "proper way" is fix thousands of applications its just not realistic - so ext4 (starting from 2.6.30) will try to save these broken applications by default. And if you want performance - there are a switch. Good enough for me. Can we close the discussion?

Actually it's done deal...

Posted Mar 15, 2009 21:10 UTC (Sun) by bojan (subscriber, #14302) [Link]

Exactly. Ted is a practical man, so he already put a workaround in place, until applications are fixed.

Sorry

Posted Mar 15, 2009 21:20 UTC (Sun) by man_ls (guest, #15091) [Link]

Sure, I have polluted the interwebs enough with my ignorance, and there is little chance to learn anything else.

Atomicity vs durability vs reliability

Posted Mar 15, 2009 21:06 UTC (Sun) by bojan (subscriber, #14302) [Link]

> No, you are not all for reliability if you cannot see beyond your little POSIX manual.

POSIX manual is not little ;-)

Seriously, we tell Microsoft that going out of spec is bad, bad, bad. But, we can go out of spec no problem. There is a word for that:

http://en.wikipedia.org/wiki/Hypocrisy

> What we have seen with XFS is how some anal-retentive developers lost most of their user base while trying to argue such points as "POSIX-compliance", and then they finally give in.

Yep, blame the people that _didn't_ cause the problem. We've seen that before.

Sorry, but I don't see it this way...

Posted Mar 15, 2009 22:08 UTC (Sun) by khim (subscriber, #9252) [Link]

I'm yet to see anyone who asks Microsoft to never go beyond the spec. It'll be just insane: if you can not ever add anything beyond what the spec says how any progress can occur?

When Microsoft is blamed it's because Microsoft
1. Does not implement spec correctly, or
2. Don't say what's the spec requirements and what's extensions.

When Microsoft says "JNI is not sexy so we'll provide RMI instead" the ire is NOT about problems with RMI. Lack of JNI is to blame.

I don't see anything of the sort here: POSIX does not require to make open/write/close/rename atomic but it certainly does not forbid this. And it's useful thing to have so why not? It'll be best to actually document this behaviour, of course - after that applications can safely rely on it and other systems can implement it as well if they wish. We even have nice flag to disable this extensions if someone wants this :-)

Sorry, but I don't see it this way...

Posted Mar 15, 2009 22:24 UTC (Sun) by bojan (subscriber, #14302) [Link]

> 1. Does not implement spec correctly

Which is exactly what our applications are doing. POSIX says, commit. We don't and then we blame others for it.

This is the same thing HTML5 is doing

Posted Mar 15, 2009 22:33 UTC (Sun) by khim (subscriber, #9252) [Link]

Sorry, but it's not the problem with POSIX or FS - it's problem with number of applications. Once a lot of applications are starting to depend on some weird feature (content sniffing in case of HTML, atomicity of open/write/close/rename on case of filesystem) it makes no sense to try to fix them all. Much better to document it and make it official. This is what Microsoft did with a lot of "internal" functions in MS-DOS 5 (and it was praised for it, not ostracized), this is what HTML is doing in HTML5 and this is what Linux filesystems should do.

Was it good idea to depend on said atomicity? May be, may be not. But the time to fix these problems come and gone - today it's much better to extend the spec.

This is the same thing HTML5 is doing

Posted Mar 15, 2009 23:37 UTC (Sun) by bojan (subscriber, #14302) [Link]

> But the time to fix these problems come and gone - today it's much better to extend the spec.

Time to fix these problems using the existing API is now, because right now we have the attention of everyone on how to use the API properly. To the credit of some in this discussion, bugs are already being fixed in Gnome (as I already mentioned in another comment). I also have bugs to fix in my own code - there is no denying that :-(

In general, I agree with you on extending the spec. But, before the spec gets extended officially, we need to make sure that _every_ POSIX compliant file system implements it that way. Otherwise, apps depending on this new spec will not be reliable until that's the case. So, can we actually make sure that's the case? I very much doubt it. There is a lot of different systems out there that are implementing POSIX, some of them very old. Auditing all of them and then fixing them may be harder than fixing the applications.

Why do we need such blessing?

Posted Mar 16, 2009 0:05 UTC (Mon) by khim (subscriber, #9252) [Link]

Linux extends POSIX all the time. New syscalls, new features (things like "According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However, since version 1.3.20 Linux does actually wait."), etc. If application wants to use such "extended feature" - it can do this, if not - it can use POSIX-approved features only.

As for old POSIX systems... it's up to application writers again. And you can be pretty sure A LOT OF them don't give a damn about POSIX compliance. They are starting to consider Linux as third platfrom for their products (first two are obviously Windows and MacOS in that order), but if you'll try to talk to them about POSIX it'll just lead to the removal of Linux from list of supported platforms. Support of many distributions is already hard enough, support of some exotic filesystems "we'll think about it but don't hold your breath...", support for old exotic POSIX systems... fuggetaboudit!

Now - the interesting question is: do we welcome such selfish developers or not? This is hard question because the answer "no, they should play by our rules" will just lead to exodus of users - because they need these applications and WINE is not a good long-term solution...

Atomicity vs durability

Posted Mar 15, 2009 22:05 UTC (Sun) by dcoutts (subscriber, #5387) [Link]

Remember, we do not care if the data is on disk or not, just that if it does make it to disk that it preserves the atomic property we were after. All that needs to happen is for the rename not to be reordered in front of the write. That hardly restricts performance.

As for a new API, yes, that'd be great. There are doubtless other situations where it would be useful to be able to constrain write re-ordering. For example for writes within a single file if we're implementing a persistent tree structure where the ordering is important to provide atomicity in the face of system failure.

Having a nice new API does not mean that the obvious cases that app writers have been using for ages are wrong. We should just insert the obvious write barriers in those cases.

Atomicity vs durability

Posted Mar 16, 2009 4:52 UTC (Mon) by dlang (guest, #313) [Link]

remember that the drive has it's own buffer (that usually isn't battery backed), and it will tell the OS that the data is written when it's in the buffer, not when it is on the disk. it then can re-order the writes to the disk.

so everything that you are screaming that the OS should guarantee can be broken by the hardware after the OS has done it's best.

you can buy/configure your hardware to not behave this way, but it costs a bunch (either in money or in performance). similarly you can configure your filesystem to give you added protection, at a significant added cost in performance.

Atomicity vs durability

Posted Mar 16, 2009 11:00 UTC (Mon) by forthy (guest, #1525) [Link]

Any reasonable hard disk (SATA, SCSI) has write barriers which allow file system implementers to actually implement atomicy.

Atomicity vs durability

Posted Mar 15, 2009 23:51 UTC (Sun) by vonbrand (guest, #4458) [Link]

I just don't understand all this "extN isn't crash-proof" whining... Yes, Linux systems do crash on occasion. It is thankfully very rare. Yes, hardware does fail. Even disks do fail. Yes, if you are unlucky you will lose data. Yes, the system could fail horribly and scribble all over the disk. Yes, the operating system could mess up its internal (and external) data structures.

It is just completely impossible for the operating system to "do the right thing with respect to whatever data the user values more", even more so in the face of random failures. Want performance? Then you have to do tricks caching/buffering data, disks are horribly _s_l_o_w_ when compared to your processor or memory.

Asking Linux developers to create some Linux-only beast of a filesystem in order to make application developer's life easier doesn't cut it, there are other operating systems (and Linux systems with other filesystems) around, and always will be. Plus asking for a filesystem that is impossible in principle won't get you too far either.

Atomicity vs durability

Posted Mar 16, 2009 0:08 UTC (Mon) by man_ls (guest, #15091) [Link]

Yes, isn't it silly to ask for the moon like this? Apart from the fact that ext3 does exactly what we are asking for; and XFS since 2007; and now ext4 with the new patches. Oh wait... maybe you really didn't understand what we were asking for.

Listen, the sky might fall on our heads tomorrow and eventually we are all to die, we understand that. But until then we really want our filesystems to do atomic renames in the face of a crash (i.e. what the rest of the world [except POSIX] understands as "atomic"). Not durable, not crash-proof, not magically indestructible -- just all-or-nothing. Atomic.

YMMV

Posted Mar 16, 2009 0:26 UTC (Mon) by khim (subscriber, #9252) [Link]

Yes, Linux systems do crash on occasion. It is thankfully very rare.

Depends of what hardware and what kind of drivers you have.

Want performance? Then you have to do tricks caching/buffering data, disks are horribly _s_l_o_w_ when compared to your processor or memory.

The problem is: fast filesystem is useless if it can't keep my data safe. Microsoft knows this - that's why you don't need to explicitly unmount flash drive there. Yes, cost is huge, it means flash wears down faster and speed is horrible - but anything else is unacceptable. Oh, and I know A LOT OF users who just turn off computer at the end of day. This problem is somewhat mitigated by design of current systems ("power off" button is actually "shutdown" button), but people are finding ways to cope: they just switch power to the desk.

The same thing applies to developers. They are lazy. Most application writers do not use fsync and do not check the error code from close. Yet if data is lost - OS will be blamed. Is it fair to OS and FS developers? Not at all! Can it be changed? Nope. Life is unfair - deal with it.

The whining started when it was found it that new filesystem can lose valuable data - where ext3 never does it in this fashion (it can do this with O_TRUNC, but not with rename). This is pretty serious regression to most people. The approach "let's fix thousads upon thousands applications" (including proprietary ones) was thankfully rejected. This is good sign: this means Linux is almost ready to be usable by normal people. Last time such problem happened (OSS->ALSA switch) offered solution was beyond the pale.

Atomicity vs durability

Posted Apr 8, 2009 15:30 UTC (Wed) by pgoetz (subscriber, #4931) [Link]

Who gives a flying fruitcake about what POSIX requires?! It's not acceptable for a user to edit, say her thesis, which she's been working on for 18 months and which has been saved thousands of times, and -- upon system crash -- find that not only did she lose her most recent 15 minutes worth of changes (acceptable) but in fact THE ENTIRE FILE. Putting the onus on application developers to fsync early and often is beyond ridiculous.

Atomicity vs durability

Posted Mar 15, 2009 13:14 UTC (Sun) by bojan (subscriber, #14302) [Link]

> The rename cannot be atomic if the name points to e.g. an empty file

As long as processes running on the system can see a consistent picture (and they can), according to POSIX it is.

> not only the filename must be valid but the contents of the file as well

From the point of view of any process running on the system, it is.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 5:45 UTC (Sat) by Nick (guest, #15060) [Link]

> "POSIX allows this" is an anal point. Like
> "the C standard allows this". This is not
> what standard documents are about.

To everyone whining about this behaviour because "posix allows it", it is not like a C compiler that adds some new crazy optimisation that can break any threaded program that previously was working (like speculatively writing to a variable that is never touched in logical control flow). POSIX in this case *IS* basically ratifying existing practices.

fsync() is something that you have to do on many OSes and many filesystems for a long time in order to get correct semantics. Programmers that come along a decades or two later and decide to ignore basics like that because it mostly appears to work on one filesystem on one OS cannot complain that a new filesystem or OS breaks their app really don't have a high horse on which to sit.... a mini pony at best.

It's quite fair to hope for a best effort, but it is also fair to make an appropriate choice about the performance tradeoff so that well written applications don't get horribly penalised.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 20, 2009 13:48 UTC (Fri) by anton (subscriber, #25547) [Link]

[...] it is not like a C compiler that adds some new crazy optimisation that can break any threaded program that previously was working (like speculatively writing to a variable that is never touched in logical control flow)
That's actually a very good example: ext4 performs the writes in a different order than what corresponds to the operations in the process(es), resulting in an on-disk state that never was the logical state of the file system at any point in time. One difference is that file systems have been crash-vulnerable ("crazy") for a long time, so in a variation of the Stockholm syndrome a number of people now insist that that's the right thing.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 16:25 UTC (Fri) by droundy (subscriber, #4559) [Link]

Similarly, all FSs should try hard to make sure open-write-close-rename leaves you with either the new or the old file. Anything else isn't reasonable.

Absolutely. The problem with the argument that it must be open-write-fsync-close-rename is that that describes a *different* goal, which is to ensure that the new file is actually saved. When an application doesn't particularly care whether the new or old file is present in case of a crash, it'd be nice to allow them to ask only for the ordinary POSIX guarantees of atomicity, without treating system crashes as a special exception.

And on top of that, I'd prefer to think of all those Ubuntu gamers as providing the valuable service of searching for race conditions relating to system crashes. Personally, I prefer not to run nvidia drivers, but I'd like to use a file system that has been thoroughly tested by hard rebooting the computer, so that on the rare power outages, I'll not be likely to lose data. It'd be even nicer if lots of people were to stress-test their ext4 file systems by simply repeatedly hitting the hard reset button, but it's hard to see how to motivate people to do this.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 18:26 UTC (Fri) by oak (guest, #2786) [Link]

> It'd be even nicer if lots of people were to stress-test their ext4 file
systems by simply repeatedly hitting the hard reset button, but it's hard
to see how to motivate people to do this.

If you have very small children, they can do it for you. When you least
want/expect it...

Anyway, it's not so hard to do an automated system for this and you can
also buy one. All the file system maintainers I know, do this kind of
testing.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 18:03 UTC (Sat) by dkite (guest, #4577) [Link]

Used to be when the reset buttons were red and prominent. I speak from
experience.

Derek

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 23, 2009 5:02 UTC (Mon) by dedekind (guest, #32521) [Link]

It's not just awful, it is a silly.

$ man 2 write

...

"A successful return from write() does not make any guarantee that data has been committed to disk."

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 15:57 UTC (Fri) by sbergman27 (guest, #10767) [Link]

This blog entry was about 50% a platform for trolling against Ubuntu and its users, and about 50% Ted wildly blaming everyone he can think of for a significant data integrity flaw in his filesystem.

I'm shocked and dismayed that a developer of an ext# filesystem can be so cavalier regarding a data integrity issue. This attitude would *never* have been taken by a dev during this period in ext3's life-cycle.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 16:09 UTC (Fri) by johnkarp (guest, #39285) [Link]

I'm definitely in the app programmer camp, not kernel, yet to me it seems bloody obvious that if you open a file using F_TRUNCATE you're creating a period of time where the file will be empty.

The only thing that surprised me was that writing to a second file then and renaming to the first wasn't fully sufficient.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 16:15 UTC (Fri) by dcoutts (subscriber, #5387) [Link]

Exactly. The truncate method can obviously cause data loss, which is why we use the atomic rename method. The problem is that ext4 re-orders the rename() before the write(). That is the broken behavior.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 18:06 UTC (Fri) by jgg (subscriber, #55211) [Link]

Yeah, I think you have hit this exactly on the head. Reading through the Tyso comments on the blog I think he even confirmed that not preserving ordering is a change in behavior since ext3.

This whole discussion has really not been focused much on what actually are sane behaviors for a FS to have across an unclean shutdown. To date most writings by FS designers I've read seem to focus entirely on avoiding FSCK and avoiding FS meta-data inconsistencies. Very few people seem to talk about things like what the application sees/wants.

One of the commentors on the blog had the best point - insisting on adding fsync before every close/rename sequence (either implicitly in the kernel as has been done, or explicitly in all apps) is going to badly harm performance. 99% of these case do not need the data on the disk, just the write/close/rename order preserved.

Getting great performance by providing weak guarentees is one thing, but then insisting everyone who cares about their data use explicit calls that provide a much stronger and slower guarantee is kinda crazy. Just because POSIX is silent on this matter doesn't mean FS designers should get a free pass on transactional behaviors that are so weak they are useless.

For instance under the same POSIX arguments Ted is making it would be perfectly legitimate for a write/fsync/close/rename to still erase both files because you didn't do a fsync on the directory! Down this path lies madness - at some point the FS has to preserve order!

I wonder how bad a hit performance sensitive apps like rsync will get due to the flushing on rename patches?

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 19:16 UTC (Fri) by endecotp (guest, #36428) [Link]

> you didn't do a fsync on the directory!

Yes, I was just thinking the same thing! Come on Ted, what exactly do you want us to write to be portably safe? I have just added an fsync() to my write() close() rename() code, but I checked man fsync first and it tells me that I need to fsync the directory. So is it:

open()
write()
fsync()
close()
rename()
opendir()
fsync()
closedir()

? Or some re-ordering of that? Is there more? Do I have to fsync() the directories up to the root? Can I avoid all this if I call sync()?

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 20:17 UTC (Fri) by alexl (subscriber, #19068) [Link]

I don't think that is quite necessary for durability.

If the metadata is not written out but the data is and then things crash, you will just have the old file as it was, and either a written file+inode with no name (moved to lost+found) or the written file with the temporary name.

As far as i can see syncing the directory is not needed. (Unless you want to guarantee the file being on disk, rather than just not breaking the atomic file replace behaviour.)

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 20:41 UTC (Fri) by masoncl (subscriber, #47138) [Link]

The directory fsync requirements came from ext2. The for the journaled filesystems, and fsync on the file will get you the dir as well.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 21:03 UTC (Fri) by endecotp (guest, #36428) [Link]

OK. So if I want code that's portable to ext2, I need to fsync the directory. Maybe there aren't many people using ext2 these days, but I would like code that's genuinely portable; I do personally care about the various flash filesystems, and when I break things for BSD users they complain. So I guess the directory fsync is needed.

Thinking a bit more about this from the "application requirements" point of view, I can see three cases:

1- The change needs to be atomic wrt other processes running concurrently.
2- The change needs to be atomic if this process terminates (ctrl-C, OOM).
3- The change needs to be atomic if the system crashes.

I can't think of a scenario where the application author would reasonably say, "I need this data to be safe in cases 1 and 2 but I don't care about 3." Can anyone else?

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 22:45 UTC (Fri) by jgg (subscriber, #55211) [Link]

> I can't think of a scenario where the application author would reasonably say, "I need this data to be safe in cases 1 and 2 but I don't care about 3." Can anyone else?

It isn't that uncommon really, anytime you want to protect against a failing program and not mess up its output file rename is the best way. For instance programs that are used with make should be careful to not leave garbage around if they crash. rsync and other downloading software does the rename trick too, for the same basic reasons. None of these uses require fsync or other performance sucking things.

The reason things like emacs and vim are so careful is because they are almost always handling critical data. I don't think anyone would advocate rsync should use fsync.

The considerable variations in what FSs do is also why, as an admin, I have a habit of knocking off a quick 'sync' command after finishing some adminy task just to be certain :)

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 10:37 UTC (Mon) by endecotp (guest, #36428) [Link]

> Come on Ted, what exactly do you want us to write to be portably safe?

Ted seems to have answered this in his second blog post: YES you DO need to fsync the directory if you want to be certain that the metadata has been saved.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 20:28 UTC (Fri) by amk (subscriber, #19) [Link]

There's also the broken behaviour of the applications that are updating their config. files every minute, which shouldn't be forgotten. If saves are being explicitly requested by the user, that's pretty infrequent and the crash has to occur within a window of vulnerability, but constant resaves mean a crash will inevitably cause data loss.

Why this behavour is broken? It's perfectly normal behaviour...

Posted Mar 14, 2009 11:10 UTC (Sat) by khim (subscriber, #9252) [Link]

This P2P client. Good p2p client will keep information about peers for each file - this way if the the system is rebooted the lenghty process of finding peers can be avoided. Since there are hundreds (sometimes thousands) peers this means hundreds of files are rewritten every minute or so. If filesystem can not provide guarantees without fsync - I just refuse to use it. XFS went this way. XFS developers long argues their right to destroy files on crush and we've all agreed that they can do this and I can answer the question "What do you think about XFS?" with just "Don't use it. Ever." And everyone was happy.

Looks like tytso actually fixed the problem in ext4 (even if actual words were akin "application developers are crazy and this is incorrect usage but we can not replace all of them") so at least I can conclude he's more sane then XFS developers...

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 12:32 UTC (Sat) by nix (subscriber, #2304) [Link]

Er, you do know that Ted was one of the ext3 devs as well, right? ;)

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 20, 2009 14:22 UTC (Fri) by anton (subscriber, #25547) [Link]

I'm shocked and dismayed that a developer of an ext# filesystem can be so cavalier regarding a data integrity issue. This attitude would *never* have been taken by a dev during this period in ext3's life-cycle.
The more recent ext3 developers have a pretty cavalier attitude to data integrity: For some time the data=journal mode (which should provide the highest integrity) was broken. Also, ext3 disables using barriers by default, essentially eliminating all the robustness that adding a journal gives you (and without getting any warning that your file system is corrupted).

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 16:11 UTC (Fri) by dcoutts (subscriber, #5387) [Link]

Surely the solution is just to put an implicit write barrier between the file content data being written and the file meta-data being written. Then the write followed by rename thing would actually be atomic (as we had always assumed it was).

There's a very low performance penalty for using a write barrier. All modern disks support it without having to issue a full flush.

App authors are not demanding that the file date make it to the disk immediately. They're demanding that the file update is atomic. It should preserve the old content or the new but never give us a zero length file.

Can this be that difficult? We do not need a total ordering on all file system requests. We just need it for certain meta-data and content data writes.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 16:25 UTC (Fri) by forthy (guest, #1525) [Link]

We don't even need write barriers. The updates of the file system can update lots of data and metadata in one go. But it should keep the consistency POSIX promises: All file system operations are performed in order. This is actually a POSIX promise; it just doesn't hold for crashes (because crashes are not specified). I.e. if delayed updates are used, they should be delayed all together and then done in an atomic way - either complete them or roll them back to the previous state. This is actually not difficult.

Btrfs does this; Ted Ts'o doesn't seem to get it. Many file system designer don't get it, they are anal about their metadata, and don't care at all about user data. Unix file systems have lost data in that way since the invention of synchronous metadata updates (prior to that, they also lost metadata ;-).

IMHO there is absolutely nothing wrong with the create-write-close-rename way to replace a file. As application writer, we have to rely somehow on our OS. If we don't, we better write it ourselves. When the file system designer don't get it, and are anal about some vague spec, fsck them.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 17:04 UTC (Fri) by jwb (guest, #15467) [Link]

What we actually need is a user-space API that actually makes sense and wasn't invented in the Sputnik era.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 20:25 UTC (Fri) by alexl (subscriber, #19068) [Link]

In terms of posix API additions, what would be nice is for atomic safe rewrite of files would be:

fd1=open(dirname(file))
fd2=openat(fd1, NULL, O_CREAT) // Creates an unlinked file
write(fd2)
flinkat(fd1, fd2, basename(file)) // Should guarantee fd2 is written to disk before linking.
close(fd2)
close(fd1)

This seems race free:
doesn't break if the directory is moved during write
doesn't let other apps see or modify the temp file while writing
doesn't leave a broken tempfile around on app crash
doesn't end up with an empty file on system crash

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 1:41 UTC (Sat) by dcoutts (subscriber, #5387) [Link]

Yes, that would be great. It's a natural extension of the POSIX notion that files are separate from their directory entries.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 15, 2009 23:25 UTC (Sun) by halfline (guest, #31920) [Link]

Another idea would be to introduce a new open flag, O_REWRITE, or some such that gives the same straightforwardness as O_TRUNC to the application developer but under the hood works on a detached file and atomically renames on close. Since all the I/O operations are grouped (via the fd), the kernel should be able to ensure proper ordering relatively easily (i think?) and apps wouldn't have to introduce a costly "sync it now" operation.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 21, 2009 0:34 UTC (Sat) by spitzak (guest, #4593) [Link]

This flag certainly is needed, but I would go further and say that Linux should change the behavior of O_CREAT|O_WRONLY|O_TRUNC to do exactly what you specify. This is because probably every program using these flags (or using creat()) are written to implicitly expect this behavior anyway.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 1:35 UTC (Sat) by dcoutts (subscriber, #5387) [Link]

Sorry, I wasn't clear. I meant that the write barrier should be implicit in the create-write-close-rename dance. I didn't mean application authors should explicitly have to add a write barrier. Of course in the kernel the write barrier would be explicit.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 15:59 UTC (Sat) by Pc5Y9sbv (guest, #41328) [Link]

I thought the same thing as you at first, but I thought about it more and am no longer sure. To provide a completely unsurprising behavior, i.e. provide expected inter-process POSIX semantics between pre-crash and post-crash processes, you would need to infer a write barrier between every IO operation by a process. This may lead to far too much serialization of IO operations for the typical desktop use case.

So, is there an appropriate set of heuristics to infer write barriers sometimes but not others? The specific case in this discussion would be something like "insert a write barrier after file content operations requested before metadata operations affecting linkage of that file's inode"? Is this sufficient and defensible?

Ideally, we should have POSIX write-barriers that can be applied to a set of open file and directory handles, and use them to get the proper level of ordering across crashes. The fsync solution is far too blunt an instrument to provide the transactionality that everyone is looking for when they relink newly created files into place.

But then what about all those shell scripts out there which do "foo > file.tmp && mv file.tmp file"? We would need a new write-barrier operation applicable from the shell script (somehow selecting partial ordering of requests issued from separate processes), or a heuristic write-barrier as above...

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 17:32 UTC (Sat) by dcoutts (subscriber, #5387) [Link]

Yes it does raise a more general issue. We're not asking for a write barrier between every operation but it's not entirely obvious which ones we can safely omit (or "mostly safely" omit). I don't have a complete answer either.

Certainly since rename is supposed to be atomic and because it is used in this common idiom then it should have a write barrier wrt other operations on the same file. I don't think we should demand barriers between write operations within the same file or between different files. As you say it would be useful to be able to add explicit barriers sometimes, just as we can for CPU operations on memory.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 15, 2009 7:57 UTC (Sun) by Pc5Y9sbv (guest, #41328) [Link]

I think the relink atomicity is a red herring here. That refers to the fact that the file is present under either the old or new name, i.e. it is an atomic directory metadata change, ignoring crash behaviors. Our main concern is that we want to extend the POSIX IO ordering semantics of non-atomic sequences across crash boundaries. We could actually recover from non-atomic relink (e.g. file is linked under old and new names) more easily than reordered content and name commits.

I think I agree now that it would be sensible to infer a write barrier between file content requests and inode linking requests for the same inode. This would cover a large percentage of "making data available" scenarios.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 20, 2009 14:52 UTC (Fri) by anton (subscriber, #25547) [Link]

To provide a completely unsurprising behavior, i.e. provide expected inter-process POSIX semantics between pre-crash and post-crash processes, you would need to infer a write barrier between every IO operation by a process.
You have the right idea about the proper behaviour, but there is more freedom in implementing it: You can commit a batch of operations at the same time (e.g., after the 60s flush interval), and you need only a few barriers for each batch: essentially one barrier between writing everything but the commit block and writing the commit block, and another barrier between writing the commit block and writing the free-blocks information for the blocks that were freed by the commit (and if you delay the freeing long enough, you can combine the latter barrier with the former barrier of the next cycle).

This can be done relatively easily on a copy-on-write file system. For an update-in-place file system, you probably need more barriers or write more stuff in the journal (including data that's written to pre-existing blocks).

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 15, 2009 9:40 UTC (Sun) by k8to (subscriber, #15413) [Link]

NOTES
Note that fclose() only flushes the user space buffers provided by the C library. To ensure that the data
is physically stored on disk the kernel buffers must be flushed too, for example, with sync(2) or fsync(2).

It seems fclose doesn't imply write.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 20, 2009 14:40 UTC (Fri) by anton (subscriber, #25547) [Link]

Write barriers or something equivalent (properly-used tagged commands, write cache flushes, or disabling write-back caches) are needed for any file system that wants to provide any consistency guarantee. Otherwise the disk drive can delay writing one block indefinitely while writing out others that are much later logically.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 17:52 UTC (Fri) by iabervon (subscriber, #722) [Link]

Ted does claim in the comments that rename() without fsync() is unsafe, but that applications should be granted some sort of leeway in that case. I think he'll have to move away from this position if he want to have his filesystem used in Linus's kernel, because Linus thinks that rename() without fsync() is safe and is the correct way to do things.

The case where you want fsync() is when you do something like: file A names a file, B, which exists and has valid data. You create file C, put valid data in it, and atomically change file A to name file C instead of B, and you want to be sure that file A always names a file which exists and has valid data. You can't be sure, without an fsync(), that the disk won't see the change to file A without seeing some operation on file C.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 20:41 UTC (Fri) by pr1268 (subscriber, #24648) [Link]

Linus thinks that rename() without fsync() is safe

"is safe" or _SHOULD_BE_ safe? Far be it from Linus being naïve on these things, but then again I'm certain he's been following this discussion closely for the past few days.

I do know that Linus' big soap box is about programming abstraction, and he'd certainly take the side that open/write/close/rename (in that order) should do exactly that, without any mysterious data loss.

My own "from-the-cuff" perception is that fsync(2)/fdatasync(2) are "band-aids" to address POSIX's lack of specification in this matter. One suggestion is that close(2) should implicitly include an fsync() call, and that programmers should be taught that open() and close() are expensive and best used judiciously.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 21:51 UTC (Fri) by iabervon (subscriber, #722) [Link]

Linus thinks rename() without fsync() "is safe", in the sense that if he were writing an application that was intended to safely maintain important data (like, for example, the Linux source code), he would use rename() on files without using fsync() on them first. I'm not entirely sure that he reads any forum where this has been discussed so far.

I think fsync() makes sense to have. If the system stops running, there are some things that would have happened, except that the system stopped running first. Furthermore, when the whole system stops running, it becomes difficult to know what things had happened and which had not happened. Furthermore, it's too inefficient to serialize everything, particularly for a multi-process system. Falling back to the concurrency model, you can say that the filesystem after an emergency restart should be in some state that could have been seen by a process that was running before the restart. But there needs to be a further restriction, so that you know that the system won't go back to the blank filesystem that you had before installing anything; so fsync() makes sense as a requirement that the filesystem after a restart will be some state that could have been seen after the last fsync() that returned successfully.

(Of course, any time the system crashes, you might lose some arbitrary data, since the system has crashed; but a better system will lose less or be less likely to lose things. This is qualitatively different from the perfectly reasonable habit of ext4 of deciding that the post-restart state is right after every truncate.)

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 0:47 UTC (Sat) by njs (guest, #40338) [Link]

But now you've just redefined fsync(2) to mean sync(2), and that has unacceptable overhead for many real uses. (Durably spooling a 1k email message should not force that multi-gigabyte rsync to flush to disk!)

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 12:40 UTC (Sat) by nix (subscriber, #2304) [Link]

Um, GNU coreutils 7.1's mv doesn't do it.

If even *coreutils*, written by some of the most insanely
portability-minded Unix hackers on the planet, doesn't do this
fsync()-source-and-target-directories thing, it's safe to say that, to a
first approximation, nobody ever does it.

The standard here is outvoted by reality.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 14:20 UTC (Sat) by endecotp (guest, #36428) [Link]

That's interesting, but there are two ways to look at it:

- Since mv doesn't fsync and mv is expected to leave things in a sane state after a crash, the kernel must be expected to "do the right thing" wrt rename().

OR

- Since mv doesn't sync, mv is not guaranteed to leave things in a sane state after a crash; if you thought that it was guaranteed to do so you were wrong.

Both

Posted Mar 14, 2009 15:25 UTC (Sat) by man_ls (guest, #15091) [Link]

What I read from this very interesting discussion is that both assumptions are right depending on the circumstances. An inherently unsafe fs like ext2 is not expected to guarantee anything, and mv on ext2 may be left in a unstable state after a crash (including zero-length files). Coreutils developers probably did not see fit to fsync since it would not increase the robustness significantly in these cases: the system might crash in the middle of the fsync anyway.

But on a journalled fs like ext3 users will expect their system to be robust in the event of a crash -- and as the XFS debacle shows, not only for metadata. Both are POSIX-compliant, only ext3 is held to higher standards than ext2. What this means for ext4 is obvious.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 17:56 UTC (Fri) by MisterIO (guest, #36192) [Link]

Well, after I saw that with nodelalloc "the performance will (still) be better than ext3", I immediately made that my default mount option. Frankly I don't give a f... about extreme performance, if I risk to lose my data(at least with respect to what was the common data loss rate of ext3 with the default options). This is not a discussion about what API is better and if we need to break an old API to create one better, we're talking about data loss. Being ext4 the evolution of ext3, I consider this a regression.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 23:34 UTC (Fri) by sbergman27 (guest, #10767) [Link]

"""
Well, after I saw that with nodelalloc "the performance will (still) be better than ext3", I immediately made that my default mount option.
"""

On the other hand, you are now on the less well tested path. I remember an ext3 bug a few years ago that caused data loss... but only for those people who mounted "data=journal" just to be safe. I'm always a bit nervous about straying from defaults.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 20:44 UTC (Fri) by tialaramex (subscriber, #21167) [Link]

A lot of commenters take a position to which the only reasonable reply is "disable delayed allocation". If you insist that everything should appear as if it happened in order, then by definition delaying allocation is incompatible with your desires.

If you're in that camp, you need to get out and start campaigning for programmers to fallocate() more, because without that you're losing a lot of performance to ensure your ordering constraint. With fallocate() the allocation step can be brought forward and avoid the performance loss. At the very least, file downloaders (e.g. curl, or in Firefox) and basic utilities like /bin/cp and the built-in-copy of modern mv implementations for crossing filesystems, need to fallocate() or you'll fragement just as badly as in ext3 and perhaps worse (since now the maintainers assume you have delayed allocation protecting you).

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 22:24 UTC (Fri) by MisterIO (guest, #36192) [Link]

Fragmentation IMO is not such a big problem, because the online defragmenter will solve or mitigate it.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 1:10 UTC (Sat) by tialaramex (subscriber, #21167) [Link]

1. Create 400MB file, allocating as you go, resulting in hundreds of fragements
2. Run defragmenter on file to collect fragments into contiguous areas

is crazy. It's so crazy ext4's default behaviour waits as long as possible to allocate in order to avoid this scenario and causes this "bug" that got Ubuntu testers in such a tizzy. The online defragmenter, if and when it arrives in mainline, is a workaround not a fix, you don't to make it part of your daily routine, so most likely what you'll actually do is live with the reduced performance, all so that some utility developers can avoid writing a few lines of code.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 1:40 UTC (Sat) by MisterIO (guest, #36192) [Link]

Coudln't it be run as a cron job on the whole fs, daily for example?

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 4:19 UTC (Sat) by nlucas (subscriber, #33793) [Link]

The fallocate() man page says it's only supported since kernel 2.6.23. That is a VERY recent kernel version. In a year or two maybe I will look at that page again. For now it's just too soon to care.

fallocate fdatasync sync_file_range

Posted Mar 14, 2009 8:30 UTC (Sat) by stephen_pollei (guest, #23348) [Link]

Yes fallocate is a good thing for a programmer to know. tytso has mentioned that sqlite most likely should have used fdatasync and fallocate . He also mentioned that fsync wouldn't have really been a problem even in ext3 with data=ordered mode if it was called in a thread. I Also think sync_file_range() and fiemap and msync() are good things to know about. I can kind of see how something like mincore() that returned more information that would be in the page tables might be nice; so you could check to see if a page you scheduled for writeout is still dirty or not.

I don't think any of these things would help the case of many small text files being replaced by a rename though -- you need a fsync() to flush the metadata of the filesize increasing, I assume.

a) open and read file ~/.kde/foo/bar/baz
b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
d) sync_file_range or msync to schedule but not wait for stuff to hit disk --- this is optional
e) close(fd)
f) rename("~/.kde/foo/bar/baz", "~/.kde/foo/bar/baz~") 
g) wait for the stuff to hit the disk somehow
h) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")
I think a lot of time not being in such a rush to clobber the old data but have both around for a while might work just fine. Heck keep a few versions around to roll back to and lazily garbage collect when you can see the things are more stale. I could be totally wrong though -- just brain-storming.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 20, 2009 15:04 UTC (Fri) by anton (subscriber, #25547) [Link]

If you insist that everything should appear as if it happened in order, then by definition delaying allocation is incompatible with your desires.
By what definition? It's perfectly possible to delay allocation (and writing) as long as desired, and also delay all the other operations that are supposed to happen afterwards (in particular the rename of the same file) until after the allocation and writing has happened. Delayed allocation is a red hering here, the problem is the ordering of the writes.

LinLogFS, which had the goal of providing good crash recovery, did implement delayed allocation.

Where the the correctness go?

Posted Mar 13, 2009 23:34 UTC (Fri) by bojan (subscriber, #14302) [Link]

It is sad to see people attack Ted here for simply pointing out bugs in applications. Once upon a time, people associated with Linux would tell Windows folks that we're all for open specifications and well documented behaviour. But, when it comes to the behaviour of one file system in one of its modes which was masking incorrect usage of the API, we quickly revert to screaming bloody murder and asking for more hand holding. How did that come about?

On the other hand, nobody is complaining about this:

> On some rather common Linux configurations, especially using the ext3 filesystem in the “data=ordered” mode, calling fsync doesn’t just flush out the data for the file it’s called on, but rather on all the buffered data for that filesystem.

That seems like the real problem to me. If I ask for fsync on _my_ file, why on earth does the file system flush the lot out? Shouldn't _that_ be fixed instead?

Not to mention the problem of indiscriminately writing out hundreds of configuration files, when nobody actually changed anything in them.

Where the the correctness go?

Posted Mar 14, 2009 1:34 UTC (Sat) by tialaramex (subscriber, #21167) [Link]

It's tricky/ impractical to fix the fsync() behaviour in data=ordered ext3.

My understanding of the problem is that it looks something like:

You, the application programmer delete file A, and create files B, C and D. You are writing to file B and call fsync(). I, the filesystem driver must now ensure that the data & metadata for B are on disk (survive a reboot etc.) before returning from fsync() or you will be very unhappy.

So I must flush B's data blocks. To do that I must allocate for them, and it so happens I use some blocks free'd by deleting A. So now I need to commit A's metadata update (deleting it) or else things may go tits up. Unfortunately that involves touching a directory, which mentions C and D. So now I need to commit C and D's metadata. And if I do that without pushing C and D's data to disk I am disobeying the data=ordered setting. So I will push C and D to disk.

Figuring out that I need to do all this is expensive, whereas just shrugging and pushing everything to disk is cheap, yet even if I do all the hard figuring I may still have to push everything to disk and now I've also wasted a lot of CPU figuring it out. So you will understand that nobody is anxious to take on the thankless (see the feedback to Tytso from Ubuntu users) yet very difficult task of trying to do better.

Feel free to volunteer though :)

Where the the correctness go?

Posted Mar 14, 2009 2:10 UTC (Sat) by bojan (subscriber, #14302) [Link]

> Feel free to volunteer though :)

I just switch to ext4 or some other FS that does this properly instead ;-)

Where the the correctness go?

Posted Mar 14, 2009 1:38 UTC (Sat) by foom (subscriber, #14868) [Link]

> It is sad to see people attack Ted here for simply pointing out bugs in applications.

I don't see anyone attacking Ted. I see people arguing against the idea that zeroing out files is a
good quality for a filesystem to have.

> But, when it comes to the behaviour of one file system in one of its modes which was
> masking incorrect usage of the API, we quickly revert to screaming bloody murder and
> asking for more hand holding.

So perhaps ext5 should erase the entire directory if it's had any entries added or removed since
the last time someone called fsync on the directory fd? Or how about *never* writing any data to
disk unless you've called fsync on the file, its parent directory, and all parent directories up to
the root?

> That seems like the real problem to me. If I ask for fsync on _my_ file, why on earth
> does the file system flush the lot out? Shouldn't _that_ be fixed instead?

Yes, it would be nice, but it's a performance issue, not a data-loss issue.

Presumably at some point someone will figure out how to make a filesystem such that it can
avoid writing out metadata updates related to data which is not yet on disk, without actually
forcing out unrelated data just because you need to write out a metadata update in a different
part of the filesystem.

Where the the correctness go?

Posted Mar 14, 2009 2:15 UTC (Sat) by bojan (subscriber, #14302) [Link]

> I don't see anyone attacking Ted.

So, calling Ted's reasonable analysis of the situation a rant is not attacking?

> So perhaps ext5 should erase the entire directory if it's had any entries added or removed since the last time someone called fsync on the directory fd?

And the directory has been truncated by someone? People truncate their files in this scenario _explicitly_, don't commit the data and then expect it to be there. Well, if you don't make it durable, it isn't going to be durable, just atomic.

> Yes, it would be nice, but it's a performance issue, not a data-loss issue.

It's the issue that is causing proper programming practice of fsyncing to be abandoned, because it makes the machine unusable. See the FF issue.

Where the the correctness go?

Posted Mar 14, 2009 2:22 UTC (Sat) by bojan (subscriber, #14302) [Link]

> I see people arguing against the idea that zeroing out files is a good quality for a filesystem to have.

Maybe you missed this bit, but people are truncating the files _explicitly_ in the code and _not_ committing subsequent changes. That's what's zeroing out the files, not the file system.

Where the the correctness go?

Posted Mar 14, 2009 4:14 UTC (Sat) by foom (subscriber, #14868) [Link]

> Maybe you missed this bit, but people are truncating the files _explicitly_ in the code and
_not_
> committing subsequent changes. That's what's zeroing out the files, not the file system.

That's not the only scenario. There's the one involving rename... You open a *new* file, write to
it, close it, and rename it over an existing file. Then the filesystem commits the metadata change
(that is: the unlinking of the file with data in it, and the creation of the new empty file with the
same name), but *not* the data written to the new file.

No explicit truncation.

Now, there is also the scenario involving truncation. I expect everybody agrees that if you
truncate a file and then later overwrite it, there's a chance that the empty version of the file
might end up on disk. The thing that's undesirable about what ext4 was doing in *that* scenario
is that it essentially eagerly committed to disk the truncation, but lazily committed the new data.
Thus making it *much* more likely that you end up with the truncated file than you'd expect
given that the application called write() with the new data a few milliseconds after truncating the
file.

Where the the correctness go?

Posted Mar 14, 2009 6:23 UTC (Sat) by bojan (subscriber, #14302) [Link]

Where the the correctness go?

Posted Mar 14, 2009 8:22 UTC (Sat) by alexl (subscriber, #19068) [Link]

That comment is totally misguided. We do not want our data guaranteed to be on disk. Nobody said they wanted that. What we want is for the traditional unix way to save a file to (write to tempfile, rename over target) to *either* result in the old file or the new file. Not a zero byte file, losing both the old and the new data.

fsync is just the only way to work around this in the posix API, but its is much more heavy and gives much more guarantees than we want.

Where the the correctness go?

Posted Mar 14, 2009 8:30 UTC (Sat) by bojan (subscriber, #14302) [Link]

Please. When you open a new file (which is empty) and then write to it, but do not commit (by calling fsync), that file may not contain anything on disk for a while. Even after close, this can still be the case.

If you rename such file to an existing file that contained data, you may legitimately end up with an empty file on crash.

If you want the data to be in the new file that will be renamed into the old file, you have to call fsync on the new file. Then you atomically rename.

This is what emacs and other safe programs already do. From https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug...

What emacs (and very sophisticated, careful application writers) will do is this:

3.a) open and read file ~/.kde/foo/bar/baz
3.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
3.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
3.d) fsync(fd) --- and check the error return from the fsync
3.e) close(fd)
3.f) rename("~/.kde/foo/bar/baz", "~/.kde/foo/bar/baz~") --- this is optional
3.g) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")

Where the the correctness go?

Posted Mar 14, 2009 12:03 UTC (Sat) by alexl (subscriber, #19068) [Link]

The only way with the current POSIX apis to get this guarantee is to fsync() the fd before renaming. But this imposes an unnecessary overhead on both the app (generally) and the whole system (with ext3 data=orderer).

Now, what ext4 does is clearly correct according to what is "allowed" by POSIX (actually, this is kinda vague as POSIX allows fsync() to be empty, and doesn't actually specify anything about system crashes.)

However, even if its "posixly correct", it is imho broken. In the sense that I wouldn't let any of my data near such a filesystem, and I would recommend everyone who asks me to not use it.

Take for example this command:
sed -i s/user1/user2/ foo.conf

This does in-place update using write-to-temp and rename over, without fsync. The result of running this command, is that if your machine locks up after up to a minute you loose both versions of foo.conf.

Now, is foo.conf important? How the heck is sed to know? Is sed broken? Should it fsync? Thats more or less arguing that every app should fsync on close, which on ext4 is the same as the filesystem doing it, but on ext3 is unnecessary and a massive system slowdown.

Or should we try to avoid the performance implications of fsync (due to its guarantees being far more than what we need to solve our requirements)? We could do this by punting this to the users of sed, by having a -important-data argument, and then pushing this further out to any script that uses sed, etc, etc.

Or we could just rely on filesystems to guarantee this common behaviour to work. Even if its not specified by POSIX. (And choose not to use filesystems that doesn't give us that guarantee, like so many people have switched from XFS after data losses).

Ideally of course there would be another syscall, flag or whatever that says "don't write metadata before data is written". That way we could get both efficient and correct apps, but that doesn't exist today.

Where the the correctness go?

Posted Mar 14, 2009 21:20 UTC (Sat) by bojan (subscriber, #14302) [Link]

> However, even if its "posixly correct", it is imho broken.

Look, this may as well be true, but the fact is that all of us that are creating applications have one thing to rely on - documentation. And the documentation says what it says.

Where the the correctness go?

Posted Mar 16, 2009 12:00 UTC (Mon) by nye (guest, #51576) [Link]

POSIX also allows a system crash to cause your computer to explode and hurl shrapnel into your face, because crash-behaviour is *undefined*. Are you seriously arguing that *any* POSIX-compliant behaviour is automatically the right thing? Clearly not, because you are arguing against one POSIX-compliant method in favour of another. There are an infinite number of ways to be POSIX-compliant, some of which are more useful than others.

Where the the correctness go?

Posted Mar 14, 2009 13:32 UTC (Sat) by nix (subscriber, #2304) [Link]

There's a window there where it can leave you with baz~ and baz.new, but
no baz, on crash.

Hardly ideal, but probably unavoidable.

Why doesn't someone add real DB-style transactions to at least one
filesystem, again? They'd be really useful...

Where the the correctness go?

Posted Mar 14, 2009 21:25 UTC (Sat) by bojan (subscriber, #14302) [Link]

> There's a window there where it can leave you with baz~ and baz.new, but no baz, on crash.

Yep, very true.

But, no zero length file, which was the original problem. Essentially, you will get at least _something_.

> Why doesn't someone add real DB-style transactions to at least one filesystem, again? They'd be really useful...

Who knows, maybe we'll get proper API for that behaviour out of this discussion.

Where the the correctness go?

Posted Mar 14, 2009 21:39 UTC (Sat) by foom (subscriber, #14868) [Link]

> Essentially, you will get at least _something_.

There's no guarantee of that. A filesystem could simple erase itself upon unexpected
poweroff/crash. *Anything* better than that is an implementation detail.

Where the the correctness go?

Posted Mar 15, 2009 1:58 UTC (Sun) by bojan (subscriber, #14302) [Link]

I knew someone's going have a silly comment here. I was expecting, however, that it's going to be more technical, along the lines "see, you cannot rely on it after all". Just for the record, the first rename emacs does is optional (in order to get the backup file) and would not be done for configuration files, hence full atomicity and durability.

Where the the correctness go?

Posted Mar 15, 2009 2:42 UTC (Sun) by njs (guest, #40338) [Link]

> There's a window there where it can leave you with baz~ and baz.new, but no baz, on crash.

Yeah, 3.f is supposed to say "link", not "rename". (Programming against POSIX correctly makes those Raymond Smullyan books seem like light reading. If only everything else weren't worse...)

> Why doesn't someone add real DB-style transactions to at least one filesystem, again? They'd be really useful...

The problem is that a filesystem has a bazillion mostly independent "transactions" going on all the time, and no way to tell which ones are actually dependent on each other. (Besides, app developers would just screw up their rollback-and-retry logic anyway...)

Completely off the wall solution: move to a plan9/capability-ish system where apps all live in little micro-worlds and can only see the stuff that is important to them (this is a good idea anyway), and then use these to infer safe but efficient transaction domain boundaries. (Anyone looking for a PhD project...?)

Transactions, ordering, rollback...

Posted Mar 15, 2009 8:09 UTC (Sun) by Pc5Y9sbv (guest, #41328) [Link]

The entire point of transactions is to say "these operations are related to one another" by opening a transaction and performing multiple read/write actions which populate a data dependency map. Then the commit says that either the dependency map is satisfied and all writes are made, or no writes are made. Thus it would not be difficult to obtain the map from the application, but it is a huge expansion of scope for the filesystem abstraction.

However, as we were discussing further up the page, a write-barrier is really all that is needed for the intuitive crash-proof behavior desired by everything doing the "create a temp file; relink to real name". An awful lot of the discussion seems to conflate request ordering with synchronous disk operations, when all we really desire is ordering constraints to flow through the entire filesystem and block layer to the physical medium.

All people want is for the POSIX ordering semantics of "file content writes" preceding "file name linkage" to be preserved across crashes. It is OK if the crash drops cached data and forgets the link, or the data and link, but not the data while preserving the link.

Where the the correctness go?

Posted Mar 15, 2009 12:26 UTC (Sun) by nix (subscriber, #2304) [Link]

Even the off-the-wall solution won't work, because the reason for
transactions getting entangled with each other is dependencies *within the
fs metadata*. i.e. what you'd actually need to do is put off *all*
operations on fs metadata areas that may be shared with other transactions
until such time as the entire transaction is ready to commit. And that's a
huge change.

Where the the correctness go?

Posted Mar 16, 2009 3:13 UTC (Mon) by k8to (subscriber, #15413) [Link]

There's an easy way to avoid that problem.

link the name to name~
rename the name.new to name

Yes, explicit transaction support in the filesystem would be great, though hammering out the api will probably be hairy.

Where the the correctness go?

Posted Mar 14, 2009 9:19 UTC (Sat) by bojan (subscriber, #14302) [Link]

> What we want is for the traditional unix way to save a file to (write to tempfile, rename over target) to *either* result in the old file or the new file.

This semantics (where data in the new file is magically committed) may or may not be a result of particular file system implementation. From the rename() man page:

> If newpath already exists it will be atomically replaced (subject to a few conditions; see ERRORS below), so that there is no point at which another process attempting to access newpath will find it missing.

Nowhere does it specify what _data_ will be in either file, just that the file will be there. ext4 dutifully obeys that.

In short, what you are referring to as "traditional unix way" doesn't really exist. Proof: emacs code.

PS. Sure, it would be nice to have such "one shot" API. But, the current API isn't it.

Where the the correctness go?

Posted Mar 14, 2009 14:30 UTC (Sat) by endecotp (guest, #36428) [Link]

> Nowhere does it specify what _data_ will be in either file,
> just that the file will be there.

No; POSIX requires that the effects of one process' actions, as observed by another process, occur in order. So if you do the write() before the rename(), it is guaranteed that the file will be there with the expected data in it.

Of course this is not true of crashes where POSIX doesn't say anything at all about what should happen. Behaviour after a crash is purely a QoI issue.

Where the the correctness go?

Posted Mar 14, 2009 21:18 UTC (Sat) by bojan (subscriber, #14302) [Link]

> No; POSIX requires that the effects of one process' actions, as observed by another process, occur in order. So if you do the write() before the rename(), it is guaranteed that the file will be there with the expected data in it.

We are talking about data _on_ _disk_ here, not what the process may see (which may be buffers just written, as presented by the kernel). What is on disk is _durable_, which is what we are discussing here. For durable, you need fsync.

So, rename does not specify which data on disk will be when.

Where the the correctness go?

Posted Mar 15, 2009 12:44 UTC (Sun) by endecotp (guest, #36428) [Link]

But __NOTHING__ specifies what data you'll find left on the disk after a crash (and after a crash is the only time when the difference between "on disk" and "in memory buffers" makes any difference). fsync() does NOT guarantee durability - it can be a no-op.

So what this all boils down to is how close each filesystem implementation comes to "non-crash" behaviour after a crash, which is a quality-of-implementation choice for the filesystems.

As far as I can see, for portable code the best bet is to stick with the write-close-rename pattern. This is sufficient for atomic changes in the non-crash case. Adding fsync in there makes it safe in the crash case for some filesystems, but not all, and there are others where it was safe without it, and others where it has a performance penalty: it's far from a clear winner at the moment.

Where the the correctness go?

Posted Mar 15, 2009 21:24 UTC (Sun) by bojan (subscriber, #14302) [Link]

> fsync() does NOT guarantee durability - it can be a no-op.

Hence, you need to have various #ifs and ifs() to figure out what works on your platform. See Mac OS X. fsync is just an example here. The point is that you must use _something_ to commit. Without that, POSIX does not guarantee anything beyond currently running processes seeing the same picture.

Where the the correctness go?

Posted Mar 16, 2009 4:49 UTC (Mon) by dlang (guest, #313) [Link]

ven doing s fsync doesn't mean that you won't have this corruption. the two writes could go to the disk drive's buffer and it could write the metadata out before it writes the data blocks. if it looses power in between these two steps you have the same problem

Where the the correctness go?

Posted Mar 16, 2009 13:28 UTC (Mon) by jamesh (guest, #1159) [Link]

Of course, if the drive supports barriers in its command queueing implementation it should be possible to prevent it reordering those writes.

That is likely to restrict reorderings that won't break correctness guarantees though.

Where the the correctness go?

Posted Mar 16, 2009 3:19 UTC (Mon) by k8to (subscriber, #15413) [Link]

A no-op fsync is not compliant. You've taken it quite a bit too far.

fsync explicitly says that when it returns success, the data has been handed to the storage system successfully.

It doesn't guarantee that that storage system has committed it in a durable way for all scenarios. That's another issue.

fsync does guarantee that the data has been handed to the storage medium, but makes no guarantees about the implementation of that storage medium.

Where the the correctness go?

Posted Mar 16, 2009 1:07 UTC (Mon) by vonbrand (guest, #4458) [Link]

Sorry, "opening a file for writing it from scratch" is truncating, quite explicitly.

Where the the correctness go?

Posted Mar 14, 2009 4:24 UTC (Sat) by flewellyn (subscriber, #5047) [Link]

> I see people arguing against the idea that zeroing out files is a good quality for a filesystem to have.

Which has what to do with the filesystem itself? I mean, if you use O_TRUNC when you call open(), zeroing out the file is exactly what you are asking to happen. Doing this and then writing new data to the file without calling fsync() before closing it, is where the problem comes from.

People should not blame the filesystem for doing what they ask it to do.

Where the the correctness go?

Posted Mar 14, 2009 5:16 UTC (Sat) by foom (subscriber, #14868) [Link]

Open a brand new file, write, close, "atomic" rename on top of an existing file. No O_TRUNC. This
sequence causes the problem.

Where the the correctness go?

Posted Mar 14, 2009 5:33 UTC (Sat) by flewellyn (subscriber, #5047) [Link]

Hmm...in that case, definitely fsync after rename.

Where the the correctness go?

Posted Mar 15, 2009 1:43 UTC (Sun) by droundy (subscriber, #4559) [Link]

Sorry, that's wrong. fsync before rename.

Where the the correctness go?

Posted Mar 14, 2009 6:21 UTC (Sat) by bojan (subscriber, #14302) [Link]

Please read the manual page for close(). Just because you closed the file doesn't mean your data is on disk.

As for rename, your file gets renamed _atomically_ just fine. However, if you don't _commit_ your writes (make the changes durable, call fsync), the renamed file will have zero size.

This is not a file system problem, but an application problem.

Where the the correctness go?

Posted Mar 14, 2009 13:06 UTC (Sat) by nix (subscriber, #2304) [Link]

Presumably at some point someone will figure out how to make a filesystem such that it can avoid writing out metadata updates related to data which is not yet on disk, without actually forcing out unrelated data just because you need to write out a metadata update in a different part of the filesystem.
The BSD people did. It's so difficult that nobody else has done it since.

Where the the correctness go?

Posted Mar 15, 2009 8:14 UTC (Sun) by Pc5Y9sbv (guest, #41328) [Link]

Why are write-barriers so difficult here? Is there something special about the filesystem domain, or is it a lack of cross-pollination of ideas between different systems/CS communities?

I know that contemporary disk controller protocols support write-barriers in their command streams. They were intended to make this sort of thing easy. You don't have to micro-manage the requests all the way to the platter, but just decorate them with the correct ordering relations when you issue the commands.

Write barriers

Posted Mar 17, 2009 1:43 UTC (Tue) by xoddam (subscriber, #2322) [Link]

In the context of a journalling filesystem the application-level guarantee doesn't really need to be implemented with an explicit write barrier at the disk level. Write barriers may or may not be used to maintain the journal; journals can work (perhaps somewhat less effectively) without them.

Because the journal is already able to provide the guarantee of filesystem metadata consistency, it can be used in the same way to ensure an effective ordering between write() and rename().

Where did the correctness go?

Posted Mar 17, 2009 6:42 UTC (Tue) by butlerm (subscriber, #13312) [Link]

On the contrary, every decent database in the world does this, and will run
circles around contemporary filesystems for comparable synchronous and
asynchronous operations. Check put Gray and Reuter's Transaction Processing
book for details. The edition I have was published in 1993.

There are two basic problems here:

The first is that fsync is a ridiculously *expensive* way to get the needed
functionality. The second is that most filesystems cannot implement atomic
operations any other way (i.e. without forcing both the metadata and the
data and any other pending metadata changes to disk).

fsync is orders of magnitude more expensive than necessary for the case
under consideration. A properly designed filesystem (i.e. one with
metadata undo) can issue an atomic rename in microseconds. The only option
that POSIX provides can take hundreds if not thousands of milliseconds on a
busy filesystem.

Databases do *synchronous*, durable commits on busy systems in ten
milliseconds or less. Ten to twenty times faster than it takes
contemporary filesystems to do an fsync under comparable conditions.

Even that is still a hundred times more expensive than necessary, because
synchronous durability is not required here. Just atomicity. Nothing has
to hit the disk. No synchronous I/O overhead. Just metadata undo
capability.

Where did the correctness go?

Posted Mar 17, 2009 7:18 UTC (Tue) by dlang (guest, #313) [Link]

how do you think the databases make sure their data is on disk?

they use f(data)sync calls to the filesystem.

so your assertion that databases can make atomic changes to their data faster than the filesystem can do an fsync means that either you don't know what you are saying, or you don't really have the data safety that you think you have.

Where did the correctness go?

Posted Mar 17, 2009 8:31 UTC (Tue) by butlerm (subscriber, #13312) [Link]

ACID has four letters for a reason. Atomicity is logically independent of
durability. A decent database will let you turn (synchronous) durability
off while fully guaranteeing atomicity and consistency.

The reason is that with a typical rotating disk, any durable commit is
going to take at least one disk revolution time, i.e. about 10 ms. Single
threaded atomic (but not necessarily durable) commits can be issued a
hundred times faster than that, because no synchronous disk I/O is required
at all.

Where did the correctness go?

Posted Mar 17, 2009 9:48 UTC (Tue) by dlang (guest, #313) [Link]

and all the filesystems (including ext4 prior to the patches) provide the atomicity you are looking for.

it's just the durability in the face of a crash that isn't there. but it wasn't there on ext3 either (there was just a smaller window of vunerability), and even if you mount your filesystem with the sync option many commodity hard drives would not let you disable their internal disk caches, and so you would still have the vunerability (with an even smaller window)

Where did the correctness go?

Posted Mar 17, 2009 17:30 UTC (Tue) by butlerm (subscriber, #13312) [Link]

"and all the filesystems (including ext4 prior to the patches) provide the
atomicity you are looking for."

I am afraid not. Atomic means that the pertinent operation always appears
either to have completed OR to have never started in the first place. If
the system recovers in a state where some of the effects of the operation
have been preserved and other parts have disappeared, that is not atomic.

The operation here is replacing a file with a new version. Atomic
operation means when the system recovers there is either the old version or
the new version, not any other possibility. You can do this now of course,
you simply have have to pay the price for durability in addition to
atomicity.

Per accident of design, filesystems require a much higher price (in terms
of latency) to be paid for durability than databases do. That
factor is multiplied by a hundred or more if atomicity is required, but
durability is not.

Where did the correctness go?

Posted Mar 17, 2009 17:38 UTC (Tue) by butlerm (subscriber, #13312) [Link]

I refer to filesystem *meta-data* operations of course.

Where did the correctness go?

Posted Mar 17, 2009 20:42 UTC (Tue) by nix (subscriber, #2304) [Link]

Sure. I meant nobody else had done it *in a filesystem*.

This is a regression

Posted Mar 13, 2009 23:59 UTC (Fri) by ikm (subscriber, #493) [Link]

I'd say that as long as this kind of thing wasn't happening with ext3, it shouldn't be happening with ext4 as well, or else it's a regression. So the "nodelalloc" should be the default, really.

This thing alone has actually made me decided I wouldn't be moving to ext4 any time soon.

This is a regression

Posted Mar 14, 2009 1:53 UTC (Sat) by tialaramex (subscriber, #21167) [Link]

Zero length files were a possibility in ext3 for the truncate & overwrite scenario already. They were probably rarer, but certainly possible. The patch should make them equally rare in ext4 (unless you disable it). In any case anyone writing for Unix/Linux should know about and use at least the rename trick for replacing small files. Not doing so causes much worse problems than this one.

Zero length files were probably not possible (or at least so rare that you'd never see it) in ext3 for the rename case if you have data=ordered. The patch makes them similarly rare in ext4.

Neither happens if you run normally, or even if you soft hang, losing interactivity but allowing the kernel to flush to disk. Neither happens if your laptop doesn't wake up from sleep so long as the sleep code properly calls sync(). Neither happens if your changes were at least 5 seconds old (ext3 data=ordered) or 60 seconds old (other cases) The people getting bitten either lost power suddenly while working, or hit the reset button.

I agree that zero length files are undesirable, and shouldn't be common even if you pull the plug. Evidently Ted does too, since the patches are enabled by default. Still, it remains the case that applications which must have data integrity need to be more careful than this, because otherwise things can (even in ext3 with data=ordered) go badly wrong for you.

I believe that nodelalloc is just as much overkill as fully preserving atime is. Sure, in theory it might be slightly safer to disable the delayed allocator, but in practice it doesn't make enough difference to worry about, and the performance gain is very attractive. Sooner or later if you use computers you will lose some data, that's why we have backups.

This is a regression

Posted Mar 14, 2009 2:26 UTC (Sat) by bojan (subscriber, #14302) [Link]

> Zero length files were a possibility in ext3 for the truncate & overwrite scenario already. They were probably rarer, but certainly possible.

Thanks for pointing this out. Essentially, relying on this behaviour was an accident, waiting to bite. Unfortunately, due to broken semantics of fsync on ext3, having a correct application would break the performance of the system. Looks to me that ext3 is far more broken than ext4 (which doesn't seem broken at all to me).

This is a regression

Posted Mar 14, 2009 12:30 UTC (Sat) by ikm (subscriber, #493) [Link]

User doesn't care if that or this is wrong, or broken, or whatever. If he wasn't losing data on ext3, but started to lose it on ext4, technical points don't matter for him.

This is a regression

Posted Mar 14, 2009 21:37 UTC (Sat) by bojan (subscriber, #14302) [Link]

Users cared plenty when Firefox locked up their machines for a few seconds every few minutes on ext3. What was blamed? Firefox, not ext3, where the problem comes from. This time applications lose data, ext4 gets blamed.

Which just proves that most users are irrational, because they don't know any better. So, people that _know_ what really is the problem should listen to people that don't in order to fix it?

This is a regression

Posted Mar 14, 2009 22:50 UTC (Sat) by ikm (subscriber, #493) [Link]

No one says that the users aren't listening. However, the point was and still is about this problem being a regression from ext3. And it's just unwise to say to users that they should actually "fix all their programs that work with files". That's tail wagging the dog. It's totally unrealistic and not doable in any short- or even mid-term. Why suggest this then? And who is irrational after all?

Users don't care which solution is the right one as long as it *works*. And the solution went to 2.6.30 indeed. Distributors would hopefully backport. Problem solved. Horray. But all the blabbering about how POSIX allows this and stuff is unhelpful to end-user, if surely interesting and inspiring to developers.

This is a regression

Posted Mar 15, 2009 2:02 UTC (Sun) by bojan (subscriber, #14302) [Link]

> However, the point was and still is about this problem being a regression from ext3.

And that is exactly why Ted, being a practical person, reverted to the old behaviour in some situations. Doesn't mean application writers should continue using incorrect idioms.

> It's totally unrealistic and not doable in any short- or even mid-term. Why suggest this then? And who is irrational after all?

Sorry, fixing bugs is irrational?

> But all the blabbering about how POSIX allows this and stuff is unhelpful to end-user, if surely interesting and inspiring to developers.

POSIX isn't blabbering (see http://www.kernel.org/):

> Linux is a clone of the operating system Unix, written from scratch by Linus Torvalds with assistance from a loosely-knit team of hackers across the Net. It aims towards POSIX and Single UNIX Specification compliance.

This is a regression

Posted Mar 15, 2009 3:12 UTC (Sun) by bojan (subscriber, #14302) [Link]

Speaking of regressions, the only _real_ regression here is the fact that fsync on ext3 in ordered mode may lock up your system for a few seconds. I think the real fix sequence for this should be:

1. By default make ext3 ordered mode have fsync as a no-op. People that want current broken behaviour could specify a mount option to get it.

2. Tell folks that they _must_ use fsync in order to commit their data.

3. Once critical mass of applications achieved the above, remove all hacks from ext4, XFS etc.

4. Retire ext3.

This is a regression

Posted Mar 15, 2009 4:53 UTC (Sun) by foom (subscriber, #14868) [Link]

I'm waiting for evilfs: the filesystem that *always* writes the deletion of data to disk synchronously,
but *never* writes any new data (file data, directory data, anything) to a permanent location until
you've called fsync on the file's fd, the containing directory's fd, and the fd of every directory up the
tree to the root (or call sync, of course).

Hopefully it can be the default fs for Ubuntu Jaded Jackal. If anyone complains, I'm sure "But POSIX
says it's okay to do that, the apps are broken for not obsessively calling sync after every write!" will
satisfy everyone. :)

This is a regression

Posted Mar 15, 2009 5:26 UTC (Sun) by bojan (subscriber, #14302) [Link]

Whatever.

All the people here suggesting that well established standards Linux _aims_ to implement should be ignored, should remember the screaming Microsoft had to face from the FOSS community when they started twisting various standards to their own ends.

http://en.wikipedia.org/wiki/Hypocrisy

This is a regression

Posted Mar 15, 2009 12:34 UTC (Sun) by nix (subscriber, #2304) [Link]

Going beyond a standard is not ignoring that standard.

This is a regression

Posted Mar 15, 2009 21:09 UTC (Sun) by bojan (subscriber, #14302) [Link]

So, when the spec says that applications need to call fsync to get data down on disk and they don't, that's going _beyond_ the standard? Sorry, that's falling short of it, ignoring it.

Of course, Ted put hacks into ext4 because application writers missed the above and it will take time to fix it. That's called a workaround.

This is a regression

Posted Mar 15, 2009 23:50 UTC (Sun) by nix (subscriber, #2304) [Link]

Er, no, 'going beyond the standard' is what *ext4* should do; i.e. it
should arrange that even if an app does something under which POSIX
*permits* data loss, that data loss is still considered bad and should be
avoided.

Agreed the apps are buggy, but I think this is a deficiency in POSIX,
rather than anything else.

This is a regression

Posted Mar 16, 2009 0:17 UTC (Mon) by bojan (subscriber, #14302) [Link]

> Er, no, 'going beyond the standard' is what *ext4* should do

And that's going to help the broken application running on another filesystem exactly how? The problem with hypocrisy here is not related to ext4 - it related to application code.

BTW, it is obvious that Ted already decided to make sure ext4 does that. The man is not stupid - he doesn't want the file system rejected over this - no matter how wrong the people blaming ext4 for this are.

> Agreed the apps are buggy, but I think this is a deficiency in POSIX, rather than anything else.

Well, yeah - the spec is, shall we say - demanding. But, it is what it is. We tell Microsoft not to ignore the specs. What makes us so special that we can? I would suggest nothing. If we take the right to demand that from Microsoft, we should make sure we do it ourselves.

This is a regression

Posted Mar 16, 2009 1:07 UTC (Mon) by nix (subscriber, #2304) [Link]

OK, so you consider 'do not want to lose data' to be 'hypocrisy'.

There's no point talking to you at all, IMNSHO.

This is a regression

Posted Mar 16, 2009 2:19 UTC (Mon) by bojan (subscriber, #14302) [Link]

No, I consider telling others to do something and then not doing it ourselves hypocrisy. It think that would be the definition of it, no?

If you don't want to talk to me, then don't. That's OK.

This is a regression

Posted Mar 16, 2009 13:45 UTC (Mon) by ikm (subscriber, #493) [Link]

>> Er, no, 'going beyond the standard' is what *ext4* should do

> And that's going to help the broken application running on another filesystem exactly how?

It's not. We are talking about fixing problems users start to experience when they switch from ext3 to ext4. None of the other goals, such as fixing all the apps, making all filesystems happy, feeding the hungry and making world a better place are being pursued here. The 2.6.30 fixes do what they are supposed to do, without breaking anything else. So it is a good thing, and I don't understand why you seem to be against it.

Sure, there's lots of stuff which ain't working right, but it's not a subject here. World's not perfect, and it's not going to be any time soon.

This is a regression

Posted Mar 15, 2009 12:57 UTC (Sun) by ikm (subscriber, #493) [Link]

> All the people here suggesting that well established standards Linux _aims_ to implement should be ignored

Gosh. What people suggest here is that standards should not be used as an excuse for unwanted filesystem behavior.

This is a regression

Posted Mar 16, 2009 0:21 UTC (Mon) by bojan (subscriber, #14302) [Link]

The problem with ignoring the standards is related to the applications, not the filesystem behaviour. ext4 implements POSIX specs just fine, both with and without the patches destined for 2.6.30. It is the applications that do not call fsync before rename that are ignoring the standard.

This is a regression

Posted Mar 15, 2009 12:34 UTC (Sun) by nix (subscriber, #2304) [Link]

Also known as DoSfs for the way it fills your memory up with unwritten
crap generated by apps that didn't call fsync().

(I wonder if we can allow it to write dirty data to disk when under memory
pressure, as well? ;) )

This is a regression

Posted Mar 15, 2009 15:18 UTC (Sun) by dcoutts (subscriber, #5387) [Link]

It doesn't have to keep everything in memory. It would be a simple variation on btrfs or any similar persistent filesystem. It could write all the new data to disk but not update the root node until you bludgeon it with fsync.

This is a regression

Posted Mar 15, 2009 12:20 UTC (Sun) by alexl (subscriber, #19068) [Link]

> 1. By default make ext3 ordered mode have fsync as a no-op. People that want current broken behaviour could specify a mount option to get it.

Are you crazy? That would break ACID guarantees for all databases, etc.
fsync() is about much more than data-before-metadata.

This is a regression

Posted Mar 15, 2009 21:28 UTC (Sun) by bojan (subscriber, #14302) [Link]

> Are you crazy?

Close to it ;-)

I admit, that was a bit tongue-in-cheek, to point out that current ext3 "lock up on fsync" behaviour is total nonsense.

This is a regression

Posted Mar 16, 2009 14:09 UTC (Mon) by ikm (subscriber, #493) [Link]

> That would break ACID guarantees for all databases, etc.

Once I had MySQL running on an XFS filesystem, and the system has hanged for some reason. The database got broken so horribly I had to restore it from backups. I wouldn't really count on any 'ACID guarantees' here :) An UPS and a ventilated dust-free environment is our only ACID guarantee :)

This is a regression

Posted Mar 17, 2009 5:41 UTC (Tue) by efexis (guest, #26355) [Link]

Note that mysql isn't always acid compliant and clearly states that fact, eg, when using myisam tables. Converting to innodb should fix that for you. If you were running innodb tables... then shut me up! Hehe never done any testing of this myself. Which storage engine are you using?

This is a regression

Posted Mar 17, 2009 11:59 UTC (Tue) by ikm (subscriber, #493) [Link]

Yep, it was myisam.

This is a regression

Posted Mar 14, 2009 13:09 UTC (Sat) by nix (subscriber, #2304) [Link]

In any case anyone writing for Unix/Linux should know about and use at least the rename trick for replacing small files. Not doing so causes much worse problems than this one.
I checked at work (a random Unix/Oracle financial shop) some time ago.

One person knew this trick (the only person there other than me who reads standards documents for fun), and not even he had spotted the old oops-better-retry-my-writes-on-EINTR trap. Most people assumed that 'oh, the O_TRUNC and the writes will all get put off until I close the file, won't they?' and hadn't even thought it through that much until I pressed them on it.

J. Random Programmer is much, much less competent than you seem to think.

fsync and spin-ups

Posted Mar 14, 2009 2:10 UTC (Sat) by njs (guest, #40338) [Link]

It really is a great thing that ext4 is relaxing the old ordering requirements of ext3('s default mode): this makes it possible for fsync to become much much cheaper. (fsync on ext3 is so expensive that *people avoid using it*, and that's no good for data reliability at all!)

I also take Ted's point that if apps want atomic-rename to be portably atomic, then they have to call fsync anyway and accept that they will pay an additional cost for getting durability in addition to atomicity.

BUT, in those cases where durability is not required, and I'm using a filesystem like ext3 or (soon) ext4 that supports atomic-but-not-durable-rename, I *really* want some way to access that.

My problem is that I spend all my time on my laptop, and most of the time my hard drive is spun down. I've had to disable emacs's fsync-on-save, because otherwise my editor freezes for a good 0.5-1 seconds every time I hit save, while it blocks waiting for the disk to spin up. Even if the filesystem side of fsync is made cheap like in ext4, fsync will never be cheap if it blocks waiting on hardware like this. And if everyone adds fsync calls everywhere (because they've read Ted's rant, or just because fsync became cheaper), then most apps won't have a handy knob like emacs does to disable it, and that will suck for laptop user experience and power saving.

So I think we need an interface that lets user-space be more expressive about what ordering requirements it actually has -- rename_after_write(2), or something.

What's the optimization?

Posted Mar 14, 2009 7:21 UTC (Sat) by dfsmith (guest, #20302) [Link]

Can someone explain to me (briefly) what optimization ext4 is exploiting by avoiding writing data "soon"?

I.e., what is the pathological situation where a background fsync() initiated on close() fails?

I fully appreciate delayed allocation on files that are still open though.

Temporary files.

Posted Mar 14, 2009 11:21 UTC (Sat) by khim (subscriber, #9252) [Link]

A lot of files are written, used and removed in short order (think about "configure" script, for example). If allocation is delayed then file never hits the disk and savings can be huge.

Temporary files.

Posted Mar 14, 2009 12:47 UTC (Sat) by mgb (guest, #3226) [Link]

Sounds like ext4 might be a good choice for /tmp.

Temporary files.

Posted Mar 16, 2009 3:24 UTC (Mon) by k8to (subscriber, #15413) [Link]

All Linux filesystems are good choices for tmp.

This allows linux to avoid the idiocy of solaris's tmpfs where writing a large file to /tmp can crash the box.

Temporary files.

Posted Mar 16, 2009 15:22 UTC (Mon) by nix (subscriber, #2304) [Link]

Alternatively you could use... tmpfs! Writing a large file to /tmp isn't
going to crash the box unless you explicitly raised the size limits
on /tmp...

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 9:15 UTC (Sat) by petegn (guest, #847) [Link]

It's all very well people ranting on about Posix allows this that and other ect but the fact of the matter is yet again the Linux community is having ALPHA quality pushed on it as distro ready . The very best thing would be for people to STOP using EXT4 completely and if their distro tries to force EXT4 then simply change distro there are enough out there to choose from .

Me personally i gave up on the entire EXT* filesystem some years ago and will NEVER go back you can slag Reiserfs off all you want but it beats the living crap outta EXT*

I use Reiserfs for the small boot partition then XFS and it just behaves correctly ALL the time .

This may be slightly off topic but this my experence with EXT systems total unreliabity

YMMV mine is VERY fixed

Finny that - you must be pretty unique

Posted Mar 14, 2009 11:26 UTC (Sat) by khim (subscriber, #9252) [Link]

I use Reiserfs for the small boot partition then XFS and it just behaves correctly ALL the time.

Actully XFS is much worse in case of crush than any EXT* filesystem and it often used as "you can do this because XFS is doing this". Sorry, but I just don't buy you story. How often you XFS-based system crashes? How consistent it's afterwards? We need to know before your evidence can be used in this dicussion...

My personal experience with XFS was horrible exactly with behavior like descibed: save the file or configuration on disk, power off the system - then reboot and find out that file has zero bytes now. Few such incidents were enough for me. If you answer is "don't do this" then how the hell your experience is relevant to topic?

Finny that - you must be pretty unique

Posted Mar 14, 2009 22:56 UTC (Sat) by ikm (subscriber, #493) [Link]

Seconded -- over time, I've lost numerous hours of my work because of XFS (which I happened to have on some machines out of the popular belief that it's cool) due to poor crash recovery, and I've never lost *any* bit of information on ext2/3.

Finny that - you must be pretty unique

Posted Mar 15, 2009 3:32 UTC (Sun) by dgc (subscriber, #6611) [Link]

> If you answer is "don't do this" then how the hell your
> experience is relevant to topic?

XFSQA tests 136-140, 178 and a couple of others "do this"
explicitly and have done so for a couple of years now. This failure
scenario is tested every single time XFSQA is run by a developer.
Run those tests on 2.6.21 and they will fail. Run them on 2.6.22
and they will pass...

FWIW, XFS is alone in the fact that it has:

a) a publicly available automated regression test suite;

(http://git.kernel.org/?p=fs/xfs/xfstests-dev.git;a=summary)

b) a test suite that is run all the time by developers;

c) ioctls and built-in framework to simulate the effect of
power-off crashes that the QA tests use.

This doesn't mean XFS is perfect, but it does mean that it is known
immediately when a regression appears and where we need improvements
to be made.

IOWs, we can *prove* that the (once) commonly seen problems in XFS
have been fixed and will remain fixed. It would be nice if people
recognised this rather relevant fact instead of flaming
indiscriminately about stuff that has been fixed....

Finny that - you must be pretty unique

Posted Mar 15, 2009 13:15 UTC (Sun) by hmh (subscriber, #3838) [Link]

Some of us do know the XFS guys are doing an excellent job tracking down regressions and bugs :-)

The only thing I never use XFS for is the root filesystem, and that's because nobody has seen fit to fix the XFS fsck to detect it is being run on a read-only filesystem, and to switch to xfs_repair on-the-fly.

It is no fun to need boot CDs to make sure everything is kosher in the root filesystem (or worse, to repair it)...

But it really works well for the MTA queues and squid spool, for example.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 15, 2009 4:02 UTC (Sun) by bojan (subscriber, #14302) [Link]

> It's all very well people ranting on about Posix allows this that and other ect but the fact of the matter is yet again the Linux community is having ALPHA quality pushed on it as distro ready.

POSIX is a standard that Linux kernel is attempting to implement. Explaining to people what it says is not ranting.

Just remember, when Firefox developers got blasted for the performance hit in in FF 3.0, it wasn't their fault. It was the fsync problem in ext3, which is still there.

Similarly this time, Ted is getting blasted for other people's bugs, partially caused by completely unusable fsync on ext3.

Finally, if you don't like using new software, then don't use it. Nobody is twisting your arm.

Credit where credit is due

Posted Mar 14, 2009 18:29 UTC (Sat) by sbergman27 (guest, #10767) [Link]

It's probably worth noting that despite all of Ted's trash talk regarding Ubuntu and its users, it was Ubuntu testers who first uncovered and reported this flaw in ext4. And that distro was also the first to include a safe ext4 implementation in its development branch, having applied the patches which work around this data-munching problem some 5 weeks ago.

Credit where credit is due

Posted Mar 14, 2009 19:45 UTC (Sat) by jspaleta (subscriber, #50639) [Link]

That's not entirely true.... they attempted to fix this problem several weeks ago by cherry picking proposed patches from Ted before the merge into mainline. But the cherry pick doesn't have appeared to work.

As of 03-06 people were still reporting problems in that ticket.

Even Kirkland reports the problem persists in some form as of 03-07 and affecting encrypted home directories. Lets just hope this wasn't the issue which forced Ubuntu to pull encrypted home directory support from the installer in Alpha6 and Jaunty final. That was a pretty cool feature to make available at install time for laptop users, it's a shame they had to pull the plug on it.

Its unfortunate that an upstream ticket was never created as part of that launchpad process to keep Ted in the loop. It seems in the past, Ubuntu user discovered real kernel bugs get a faster fix if the bugs find their way to the upstream kernel tracker for other kernel developers to be made aware of. For example:

http://bugzilla.kernel.org/show_bug.cgi?id=12821

But the external Ubuntu community certainly gets the credit for finding the bug...that's for sure. It seems Canonical has decided to make it easy for Ubuntu users to crash a system at unexpected times quite frequently. We should definitely thank Canonical for that. That is an important part of fuzz testing that I think the upstream kernel developers may overlook since their rabid passionate focus on keeping crashes from happening at all. How many of us do things like pull the power at random times on our systems to recreate a situation that looks like an unexpected crash scenario. I rarely think to do it. Physically pulling out the battery on my laptop to fuzz test a crash scenario isn't something I've even thought about doing.

It's good to see Canonical strongly committing to the idea of comprehensive testing to the extent that they are willing to give their users highly unstable combinations of kernel+module environments to work with to find these sorts of bugs which only crop up in unexpected and unlooked for kernel crashes. Hats off to them. By sacrificing overall system reliability for millions of users on a daily basis, they are able to flush out bugs which can only be seen when other problems(which can't be fixed by the kernel developers themselves such as out of tree drivers) cause the kernel to crash in an unexpected way. I'm not sure this is the sort of commitment to directly support upstream development that some of use we hoping to see from Canonical...but its something.

-jef

Credit where credit is due

Posted Mar 14, 2009 21:30 UTC (Sat) by drag (guest, #31333) [Link]

Well it's not really due to cononical or anybody like that.

What it seems to me is that they have been able to garner a following among neophite Linux users that not only want to test and run bleeding edge software, they want to get all their graphic goodness to go with it.

Since graphics in Linux sucked for so long the only effective vendor for people wanting superior 3d performance is Nvidia. It's not the user's fault that this has simply been a fact of life for Linux for a long time now.

Nvidia makes unstable drivers. For Vista or XP or Linux they are going to be a major source of issues; it doesn't matter. With bleeding edge versions of Linux that are unstable to begin with then your going to run into a lot of crashes.

So it's nobody's fault really.. not anybody here or involved with Ubuntu. This is why we have beta testers and they are doing their jobs. Everybody should be thankful that this problem is being resolved now rather then on your servers and workstations.

Credit where credit is due

Posted Mar 14, 2009 23:19 UTC (Sat) by jspaleta (subscriber, #50639) [Link]

No it really is Canonical's fault...for making it easy for non-technical users to build and use these proprietary drivers. Drivers that would be a gpl violation for Canonical to build and distribute in binary form.

http://www.linux.com/feature/55915

Canonical is doing a very very neat tap dance here. They don't distribute the binary drivers...if they did that they'd be engaging in an activity that is restricted by the gpl and could easily be called out on it. No, Canonical is much more clever, they encourage users to download the source and binary bits together and compile them locally on the system, avoiding the distribution of the infringing binary module all together. They've even made a point and click gui to kick off this process. Clever Canonical, skating around the edges of gpl infringement and at the same time introducing significant instability into their userbase's experience without an effort to adequately inform users as to what is going on and the pontential downsides.

I don't think its exactly fair to expect the upstream kernel developers to support this sort of behaviour. And I'll note, that because of the tricks being used to avoid a gpl violation scenario, Canonical can't even do its own integration testing on these modules. There are no centrally built and distributed binaries.

Canonical might be avoiding a gpl violation by doing things this way, but they are certainly breaking some of the spirit of the intent.

-jef

Credit where credit is due

Posted Mar 15, 2009 13:29 UTC (Sun) by ovitters (guest, #27950) [Link]

No it really is Canonical's fault
Fedora dude blaming/attacking Canonical again...

Credit where credit is due

Posted Mar 15, 2009 17:25 UTC (Sun) by drag (guest, #31333) [Link]

> No it really is Canonical's fault...for making it easy for non-technical users to build and use these proprietary drivers. Drivers that would be a gpl violation for Canonical to build and distribute in binary form.

No.

Cononical exists in reality, they are not the ones that created it.

People were installing nvidia drivers on Nvidia, Fedora, Redhat, Debian, etc etc etc long long before Ubuntu ever came onto the scene.

Credit where credit is due

Posted Mar 14, 2009 23:13 UTC (Sat) by sbergman27 (guest, #10767) [Link]

Jef, your defensive and hateful attitude pretty well sums up (one of) the reason(s) that I gave up on Fedora after having used Red Hat at my home and in my consulting since the old Red Hat Linux 4.2 back in 1997 or whenever it was. Yes, I'm still working on migrating machines off of it. (BTW, instability is another major reason I gave up on Fedora.)

To think that a Fedora advocate, of all people, could have the gall to criticize *any* other distro's stability is nearly unbelievable.

I currently have Intel video. But I have experience with both NVidia's drivers and Fedora's FOSS Radeon drivers and Intel drivers. And NVidia wins hands down for stability. Which is not to say that my stability complaints end there. Not by a long shot.

It seems that the Jef Spaleta FUD-fountain flows eternal. Fortunately, the more it flows, the less credibility people accord it.

Credit where credit is due

Posted Mar 14, 2009 23:46 UTC (Sat) by jspaleta (subscriber, #50639) [Link]

Please, don't take my word for it. My word is work about as much as a piece of SCO stock...which I have to say looks like a safe bet as far as stocks go given all the other craziness going around right now.

How NVidia impedes Free Desktop adoption.
http://vizzzion.org/?blogentry=819
July 2008
"As a Free Software developer, user and advocate, I feel screwed by NVidia, and as a customer, even more so. I would recommend not buying NVidia hardware at this point. For both political reasons, and for practical ones: Pretty much all other graphics cards around there work better with KDE4 than those requiring nvidia.ko."

Linux Graphics Essay
https://www.linuxfoundation.org/en/Linux_Graphics_Essay

"Nvidia isn't in the list of top oops causers as part of some grand strategy to make itself (and Linux) look bad. It's there because the cost of doing the QA and continuous engineering to support the changing interfaces and to detect and correct these problems outweighs the revenue it can bring in from the Linux market. In essence this is because binary drivers totally negate the shared support and maintenance burden which is what makes Open Source so compelling."

If you must, feel free to shoot the messenger. Make it personal if that's what you need in order to engage on the issues. I'm more than happy to take a few bullets. But for anyone who cares about sustainable growth of the linux ecosystem to disregard the validity of the information is pure folly. Binary drivers are a real and significant problem to the stability of linux systems. The people doing the actually development of the linux desktop realize this, even if individual users and the entire Canonical workforce do not.

-jef

Credit where credit is due

Posted Mar 15, 2009 0:47 UTC (Sun) by sbergman27 (guest, #10767) [Link]

"""
Please, don't take my word for it.
"""

There surely was never any danger of that. If bias were black body radiation, you'd be glowing at about 3500K. (Which hurts your credibility substantially. And likely not just with me.)

Now, I certainly don't care for NVidia keeping their drivers closed source. But in my experience with graphics drivers under Linux (which goes back to VooDoo1 and the original, pre Daryll Strauss, glide driver, and having lived through the hell that has been FOSS Radeon driver, I'd have to say that on all the cards I've had, including some NVidias, the NVidia driver has been flawless compared to the big mess that the usually incomplete FOSS video drivers often seem to be.

I wish that were not that case. But it has been my experience for about 10 years now.

Credit where credit is due

Posted Mar 15, 2009 1:15 UTC (Sun) by jspaleta (subscriber, #50639) [Link]

I guess your 10 year streak of good luck is the silver lining of hopeful comfort for all the Ubuntu users running the nvidia drivers and experiencing lock ups when exiting World of Goo. And I'm sure the upstream kernel developers are willing to discount their own long experience looking at crash reports now that you've made us aware of your high regard of the nvidia driver's stability.

I like that vibe you've got going on there. One man's personal experience against a mountain of contrary opinion and evidence. That's the sort of awe inspiring situational awareness and view of the big picture that causes me to respect opponents to global warming and evolutionary theory so very very highly. I salute you!

I'm really glad the Ubuntu developers decided to finally enable kerneloop reporting so we can get a more comprehensive and unbiased view of the sources of instability in the Ubuntu kernel. Though I'm not sure they are in the kerneloops.org database yet. I personally fully expect that the Ubuntu experience will be much like the Fedora one in the kerneloops record. The proprietary drivers will dominate the crash reporting statistics...Canonical will introduce some bugs via patches, which will be quickly fixed (just as Fedora has)...but the proprietary driver bugs, like nvidia, will linger and linger...contrary to your singular personal experience.

-jef

Credit where credit is due

Posted Mar 15, 2009 12:23 UTC (Sun) by nix (subscriber, #2304) [Link]

What's wrong with the FOSS radeon driver? I've had no trouble with it
(mach64 and now 9250). 3D is fast enough for my purposes, 2D is blinding
(and is now even faster thanks to a not-yet-committed patch from Michel
Dänzer to defragment the EXA glyph cache)... I've had a total of one crash
in ten years, and that was due to a device-independent Mesa bug.

Credit where credit is due

Posted Mar 16, 2009 1:18 UTC (Mon) by motk (subscriber, #51120) [Link]

I plugged a USB thumbdrive into my x86-64 machine running then nvidia driver last night and the driver crashed, hard. WTF?

Of course, anecdote != data, but the kerneloops website tells the story.

Credit where credit is due

Posted Mar 15, 2009 14:18 UTC (Sun) by sbergman27 (guest, #10767) [Link]

"""
That's not entirely true.... they attempted to fix this problem several weeks ago by cherry picking proposed patches from Ted before the merge into mainline. But the cherry pick doesn't have appeared to work.
"""

Well, Ted has clearly pointed to the patches he thinks are the best work-around (for at least sparing files which already exist) he has. I started to say "safer" instead of "safe" in my post, but decided to give Ted's patches the benefit of any doubt. Apparently, they are not such an effective work-around, and the Ext4 guys need to come up with something better that does not eat data. Ext4 has about zero chance of becoming the default for any distro except Fedora until they do.

BTW, I hope you don't mind that I'm starting to quote you as an example of how threatened advocates of some distros feel about other distros which they perceive as being more successful. You're the most illustrative example of the phenomenon of which I am aware. (Though I'll stop short of thanking you for providing such a clear example.)

Credit where credit is due

Posted Mar 16, 2009 1:25 UTC (Mon) by motk (subscriber, #51120) [Link]

Warggahble, etc.

Did you even *read* the article?

Credit where credit is due

Posted Mar 15, 2009 3:54 UTC (Sun) by bojan (subscriber, #14302) [Link]

> first uncovered and reported this flaw in ext4

And if you repeat a lie enough times, it becomes the truth.

Credit where credit is due

Posted Mar 15, 2009 14:06 UTC (Sun) by sbergman27 (guest, #10767) [Link]

Please present your evidence. Who found it and reported it before the launchpad bug documented in the launchpad link in Ted's own blog posting? And do please provide source.

The lengths to which some people will go to avoid crediting Ubuntu is simply amazing. I've never seen such a notable sour grapes response in the community as the reactions I see today from advocates of less popular distros. Most of the really notable ones seem to come from the Fedora camp, where the perceived threat level is apparently particularly high. But then, Fedora has always had more than its fair share of spiteful advocates.

Credit where credit is due

Posted Mar 15, 2009 19:03 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link]

It seems that the side effect of delayed allocations and the new multi block allocator was well known to Ted and others for a while.

http://lxer.com/module/newswire/view/116126/#ext4

"Ext4 isn't all good news though, the new allocator that it uses is likely to expose old bugs in software running on top of it. With ext3, applications were pretty much guaranteed that data was actually written to disk about 5 seconds after a write call. This was never official but simply resulted from the way ext3 was implemented. The new allocator used for ext4 means that this can take between 30 seconds and 5 minutes or more if you are running in laptop mode. It exposes a lot of applications that forget to call fsync() to force data to the disk but nevertheless assume that it has been written."

I recommend that you listen to

http://fosdem.unixheads.org/2009/maintracks/ext4.xvid.avi

His talk actually goes into quite a bit about how to avoid this problem and potential workarounds he was looking at. He mentioned that Eric Sandeen, XFS developer now working on Ext4 in Red Hat had talked to Ted about how XFS had some hacks at the filesystem level to workaround this problem of applications writers relying on Ext3 like behaviour. The current Ext4 patches are based on the same ideas. The rawhide kernel already had backported patches already

https://www.redhat.com/archives/fedora-devel-list/2009-Ma...

It appears that proprietary kernel modules causing more instability aggravates the problem as well. Good to get more exposure on the gotchas however. It looks like Btrfs has now similar patches as well in part as a result of such wider discussions.

Credit where credit is due

Posted Mar 15, 2009 21:14 UTC (Sun) by bojan (subscriber, #14302) [Link]

> Please present your evidence.

The lie is that there was a flaw in ext4. There is no flaw in ext4 (not when it comes to this, at least) - applications are broken, because they don't do what's required. They are falling short.

Ted put a workaround into ext4 to address the shortcomings of applications.

The evidence is in your manual pages. Just read them.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 15, 2009 15:59 UTC (Sun) by NinjaSeg (guest, #33460) [Link]

"I would consider that kind of system instability to be completely unacceptable, but I guess gamers have very different priorities than I do."

Why yes, they want to play games. Bitch about proprietary drivers all you want, but open source drivers simply don't cut it for games less than 5 years old. Recent games have started *requiring* shader support. Do open source drivers provide it? Other than maybe the newest Intel chips, no. And ever since the great modesetting revolution, performance has gone to crap. Quake 3 isn't even playable on a Radeon 9600XT anymore.

And open source drivers aren't particularly stable at the moment either:

https://bugzilla.redhat.com/show_bug.cgi?id=441665
https://bugzilla.redhat.com/show_bug.cgi?id=474977
https://bugzilla.redhat.com/show_bug.cgi?id=487432

Just when things were starting to work, modesetting comes along and breaks everything. With great reluctance I have given up on gaming in Fedora, and have gone back to multibooting WinXP to game. When in rome...

Here's to another year or two of waiting before drivers stablize... again.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 8:57 UTC (Mon) by kornelix (guest, #57169) [Link]

I do not understand why there is a debate about this. If a file is written and closed by the application, I see no reason to delay committing it to disk. No work will be saved, only delayed. Nothing can be better optimized by the delay (well maybe a bit of seek time on a busy disk but this only applies to commit delays < 1 second or so). The only impact of the delay is greater risk that the update will get lost. Of course the buffers should be marked "clean" and retained in cache for a while in case a read of the same file is requested shortly later.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 9:28 UTC (Mon) by dlang (guest, #313) [Link]

think about temporary files created during a compile. you may create them, fill them, and close them with one program. then a second program comes along a few seconds later to read and delete the program. it never actually needs to hit the disk

not all temporary files are only used by a single program that keeps them open the entire time.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 16, 2009 10:24 UTC (Mon) by oseemann (subscriber, #6687) [Link]

So it turns out there are really very different use cases for files. As the name implies, temporary files need never hit the disk and could thus even happily reside on a ramdisk (many systems clear /tmp upon reboot anyways).

For /home or /var many users might want a more conservative approach, e.g. fsync on close or something similar, accepting performance penalties where necessary.

I believe this is a larger issue and I'm glad the current behavior of ext4 receives such wide attention and makes people think about the actual requirements for persistent storage.

I'm certain in the long run the community will come up with a proper approach for a solution.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 23, 2009 4:56 UTC (Mon) by dedekind (guest, #32521) [Link]

I'm confused. So userspace people refuse to understand that before they call 'fsync()', the data does not have to be on the disk. And "man 2 write" even says this at the end.

And now what Theo is doing - he is fixing userspace bugs and pleasing angry bloggers by hacking ext4? Because ext4 wants more users? And now we have zero chance to have userspace ever fixed?


Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds