Ts'o: Delayed allocation and the zero-length file problem
Whats the best path forward? For now, what I would recommend to Ubuntu gamers whose systems crash all the time and who want to use ext4, to use the nodelalloc mount option. I havent quantified what the performance penalty will be for this mode of operation, but the performance will be better than ext3, and at least this way they wont have to worry about files getting lost as a result of delayed allocation. Long term, application writers who are worried about files getting lost on an unclean shutdown really should use fsync."
(Log in to post comments)
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 14:20 UTC (Fri) by welinder (guest, #4699) [Link]
Ts'o says "POSIX allows this behaviour".
Compare that to the gcc developers who say "The C standard allows this
or that behaviour" and the kernel people's response that gcc should do
the reasonable thing (whatever that is) regardless of what the standard
allows.
Similarly, all FSs should try hard to make sure open-write-close-rename
leaves you with either the new or the old file. Anything else isn't
reasonable.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 14:33 UTC (Fri) by jbailey (guest, #16890) [Link]
Relying on undocumented behaviour and then crying foul when it changes is just bad. The OS cannot and should not provide bug for bug compatibility with itself.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 14:46 UTC (Fri) by mrshiny (subscriber, #4266) [Link]
To what extent that compatibility is required or desired is a different matter.
In any case, "Linux" has, for many years now, been commonly found with EXT3 as its default filesystem. This filesystem did not exhibit data-loss for the scenario in question. EXT4 does. How is this not a regression? We're not talking about a program crashing or running slowly or anything like that, we're talking about data loss. DATA LOSS. If there's one thing a computer should get right, it's storing the darn data and not losing it.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 15:42 UTC (Fri) by drag (guest, #31333) [Link]
1. It's not the file system causing dataloss. It's combination of buggy drivers and incorrect application developer assumptions that are causing dataloss. The file system is working correct.
2. Ext3 exhibits the same behavior, as does all modern file systems. You have the same problems with badly designed applications with XFS or Btrfs, for example.
The only difference between Ext3 and Ext4 in this manner is that with Ext3 you had a 5 second window and with Ext4 you generally have up to 60 seconds. That is there is no difference in behavior if your system crashes in 4 seconds after a write.
3. Linux supports multiple different file systems and it always has. Your not dealing with Windows were your only choices in life are NTFS or Fat32.
Therefore if you want a OS that can benefit from the positive qualities of anything other then Ext3 then it's shitty policy to bend over backwards to support badly written applications because those developers never bothered to test on anything other then Ext3 or understand what the code they are writing actually does.
4. If you want your software to be portable at all to other operating systems, say OS X, Windows, FreeBSD, etc etc... then depending on the dumb-luck chance characteristics of a common configuration a nearly obsolete file system on a single operating system is not a good way to go.
5. Tso introduced a patch that helps mitigate this issue anyways.
----------------------------------
Anyways. If you at any time complained about the lack of standards of banks that demand IE only, then what your saying now is slightly hypocritical. POSIX is a standard for accessing file systems and specific chance behavior of Ext3 isn't.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 16:14 UTC (Fri) by forthy (guest, #1525) [Link]
You have the same problems with badly designed applications with XFS or Btrfs, for example.
Actually, not with Btrfs. This is a log structured file system with COW, i.e. it preserves file system operation order. You either get new data and new metadata, or you get old data and old metadata, but no mix like in ext4 or XFS. BTW: This discussion is old as dirt. See Anton Ertl's very old rant on it: What's wrong with synchronous metadata updates. If you don't understand what's wrong with synchronous metadata updates, don't even try to write a robust file system. You can still write a fast, but fragile file system, but this doesn't need synchronous metadata updates.
If you move the problem of creating a consistent file system to the applications, you will fail. fsync() only syncs the data - the metadata still can be whatever the file system designer likes (old, new, not yet allocated, trash that will settle down in five seconds). The file system still can be broken beyond repair after a crash. And using fsync() is horribly slow, increases file system fragmentation, and so on. If you are a responsible file system designer, you don't mitigate your consistency problems to the user. You solve them.
The fact that ext3 is broken the same way, just with a shorter period, is no excuse. ReiserFS was broken in a similar way, until they added data journalling as default. Btrfs gets it right by design, but then, we better wait for Btrfs to mature than to "fix" ext4.
The applications in question like KDE are not broken. They just rely on a robust, consistent file system. There is no guarantee for that, as POSIX does not specify that the file system should be robust in case of crashes. But it is a sound assumption for writing user code. If you can't rely on your operating system, write one yourself.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 18:47 UTC (Fri) by masoncl (subscriber, #47138) [Link]
If you:
create(file1)
write(file1, good data)
fsync(file1)
create(file2)
write(file2, new good data)
rename(file2 to file1)
FS transaction happens to commit
<crash>
file2 has been renamed over file1, and that rename was atomic. file2 either completely replaces file1 in the FS or file1 and file2 both exist.
But, no fsync was done on file2's data, and file2 replaced file1. After the crash, the file1 you see in the directory tree may or may not have data in it.
The btrfs delalloc implementation is similar to XFS in this area. File metadata is updated after the file IO is complete, so you won't see garbage or null bytes in the file, but the file might be zero length.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 19:13 UTC (Fri) by foom (subscriber, #14868) [Link]
> is updated after the file IO is complete, so you won't see garbage or null
> bytes in the file, but the file might be zero length.
Better get to fixing that, then!
I'm rather amused at the number of comments along the lines of "XFS already does it so it must be
okay!" A filesystem known for pissing off users by losing their data after power-outages is not one
I'm happy to see filesystem developers hold up as a shining example of what to do...
(and apparently XFS doesn't even do this anymore, after the volume of complaints raised against it,
according to other comments!)
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 19:44 UTC (Fri) by masoncl (subscriber, #47138) [Link]
This is a fundamental discussion about what the FS is supposed to implement when it does a rename.
The part where applications expect rename to also mean fsync is a new invention with broad performance implications.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 20:11 UTC (Fri) by alexl (subscriber, #19068) [Link]
This isn't about having the data on disk *now*.
We just expect it to don't write the new metadata for "newpath" to disk before the data in oldpath is on disk.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 20:34 UTC (Fri) by masoncl (subscriber, #47138) [Link]
We have two basic choices to accomplish this:
1) Put the new file into a list of things that must be written before the commit is done. This is pretty much what the proposed ext4 changes do.
2) Write the data before the rename is complete.
The problem with #1 is that it reintroduces the famous ext3 fsync behavior that caused so many problems with firefox. It does this in a more limited scope, just for files that have been renamed, but it makes for very large side effects when you want to rename big files.
The problem with #2 is that it is basically fsync-on-rename.
The btrfs fsync log would allow me to get away with #1 without too much pain, because fsyncs don't force full btrfs commits and so they won't actually wait for the renamed file data to hit disk.
But, the important discussion isn't if I can sneak in a good implementation for popular but incorrect API usage. The important discussion is, what is the API today and what should it really be?
Applications have known how to get consistent data on disk for a looong time. Mail servers do it, databases do it. Changing rename to include significant performance penalties when it isn't documented or expected to work this way seems like a very bad idea to me.
I'd much rather make a new system call or flag for open that explicitly documents the extra syncing, and give application developers the choice.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 20:46 UTC (Fri) by alexl (subscriber, #19068) [Link]
At the level of the close happening we don't really know what kind of data this is, as this is generally some library helper function. And even at the application level, how do you know that its important to not zero out a file on crash? It depends on how the user uses the application.
It all comes back to the fact that for "proper" use of this more or less all cases would turn into sync-on-write (or the new flag or whatever). So, the difference wrt the filesystem wide implementation of this will get smaller as apps gets "fixed" until the difference is more or less zero.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 21:19 UTC (Fri) by drag (guest, #31333) [Link]
But as it's pointed out that many applications do get it right consistantly. Vim, OpenOffice, Emacs, mail clients, databases, etc etc. All sorts of them. Right?
So you have the choice of making undocumented behavior documented and then forcing that behavior on all the file systems that Linux supports and all the file systems that you expect your application to run on, or you can fix the application to do it right.
And the assumptions that were made to create this bad behavior are not even true. So even then it's not even a question of backward compatability... They've always gotten it wrong, it's just that the it's been dumb luck that that it wasn't a bigger issue in the past.
As long as file systems are async then your going to have a delay between when the data is created and when that data is commited to disk. You can do all sorts of things to help reduce the damage that can cause, but it's still the fundamental nature of the beast. If you lose power or crash your computer you WILL lose some data.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 0:38 UTC (Sat) by drag (guest, #31333) [Link]
I've been reading what has been written and I think I am understanding what is going on better now.
But here is my new thought on the subject:
The reason why they are going:
create file 1
write file 1
.... (time passes)
create file 2
write file 2
rename file 2 over file 1
Is because they are trying to get a 'atomic' instruction, right?
They are making a few assumptions about the file system that:
A. the file system does not operate in atomic operations and thus they have to do this song and dance to do the file system's work for them... (protect their data)
B. that while the file system is going to fail them otherwise, it is still able to do rename in one single operation.
C. That by renaming the file they are actually telling the file to commit it to disk.
So in effect they are trying to outguess or outthink the operating system. But their assumptions, in the case of Ext4 and most others, are not correct and their software is doing what they told it do, but what they told it to do is not what they think its doing.
You see they are putting extra effort into compinsating for the file system already. So if they are putting the extra effort into out thinking the OS, then why don't they at least do it correctly?
Instead of writing out hundreds of files and trying the rename atomic trick, which isn't really right anyways, there are a half a dozen different design approaches that would yeild better results.
Or am I completely off-base here?
I understand the need for the file system to protect a user's data despite what the application developers actually wrote. Really I do.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 8:15 UTC (Sat) by alexl (subscriber, #19068) [Link]
The "atomic" part is protection against other apps that are saving to the file at the same time, not crashes. The fsync is only required not to get problems when the system crashes.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 16:45 UTC (Sat) by drag (guest, #31333) [Link]
Not getting it right
Posted Mar 14, 2009 12:17 UTC (Sat) by man_ls (guest, #15091) [Link]
The "other applications get it right with fsync" part is a bit of a fallacy. When you save a document in an application (be it a word processor or a mail client) you really want it to be safe, and it is normally not an issue to wait a few seconds. IOW the user is expected to wait for disk activity because we have been trained this way, so we are willing to accept this trade-off.But there are other programs doing file operations all the time, and nobody wants to wait a few seconds for them. Most notably background tasks like the operating system or a desktop environment. Is it reasonable to expect all of them to do something which potentially slows the system to a crawl on other filesystems, just to play safe with the newcomer ext4?
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 19, 2009 19:34 UTC (Thu) by anton (subscriber, #25547) [Link]
But as it's pointed out that many applications do get it right consistantly. Vim, OpenOffice, Emacs, mail clients, databases, etc etc. All sorts of them. Right?Emacs did get it right when UFS lost my file that I had just written out from emacs, as well as the autosave file. But UFS got it wrong, just as ext4 gets it wrong now. There may be applications that manage to work around the most crash-vulnerable file systems in many cases, but that does not mean that the file system has sensible crash consistency guarantees.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 19, 2009 19:50 UTC (Thu) by alexl (subscriber, #19068) [Link]
However, all apps doing such syncing results in lower overall system performance than if the system could guarantee data-before-metadata on rename. So, ideally we would either want a new call to use instead of fsync, or a way to tell if the guarantees are met so that we don't have to fsync.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 21:16 UTC (Fri) by foom (subscriber, #14868) [Link]
Whether an app called fsync for a file or not is rather irrelevant in such a case. Obviously, this
kind of failure is allowed by the standards. Just as obviously, it sucks for users, so people made
better filesystems that don't have that failure mode. That's good, I quite enjoy using a system
that doesn't lose my entire filesystem randomly if the power fails.
So it seems to me that claiming that since failing to call fsync before rename is "incorrect" API
usage, and thus it's okay to lose both old data and new, is simply wishful thinking on the part of
the filesystem developers. Sure it may be allowed by standards (as would be zeroing out the
entire partition...), but it sucks for users of that filesystem. So filesystems shouldn't do it. That's
really all there is to it.
Unless *every* call to rename is *always* be preceded by a call to fsync (including those in "mv"
and such), it will suck for users. And there's really no point in forcing everyone to put fsync()s
before every rename, when you could just make the filesystem work without that, and get to the
same place faster.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 15, 2009 7:35 UTC (Sun) by phiggins (guest, #5605) [Link]
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 20, 2009 18:29 UTC (Fri) by anton (subscriber, #25547) [Link]
But, the important discussion isn't if I can sneak in a good implementation for popular but incorrect API usage. The important discussion is, what is the API today and what should it really be?The API in the non-crash case is defined by POSIX, so I translate this as: What guarantees in case of a crash should the file system give?
One ideal is to perform all operations synchronously. That's very expensive, however.
The next-cheaper ideal is to preserve the order of the operations (in-order semantics), i.e., after crash recovery you will find the file system in one of the states that it logically had before the crash; the file system may lose the operations that happened after some point in time, but it will be just as consistent as it was at that point in time. If the file system gives this guarantee, any application that written to be safe against being killed will also have consistent (but not necessarily up-to-date) data in case of a system crash.
This guarantee can be implemented relatively cheaply in a copy-on-write file systems, so I really would like Btrfs to give that guarantee, and give it for its default mode (otherwise things like ext3's data=journal debacle will happen).
How to implement this guarantee? When you decide to do another commit, just remember the then-current logical state of the file system (i.e., which blocks have to be written out), then write them out, then do a barrier, and finally the root block. There are some complications: e.g., you have to deal with some processes being in the middle of some operation at the time; and if a later operation wants to change a block before it is written out, you have to make a new working copy of that block (in addition to the one waiting to be written out), but that's just a variation on the usual copy-on-write routine.
You would also have to decide how to deal with fsync() when you give this guarantee: Can fsync()ed operations run ahead of the rest (unlike what you normally guarantee), or do you just perform a sync when an fsync is requested.
The benefits of providing such a guarantee would be:
- Many applications that work well when killed would work well on Btrfs even upon a crash.
- It would be a unique selling point for Btrfs. Other popular Linux file systems don't guarantee anything at all, and their maintainers only grudgingly address the worst shortcomings when there's a large enough outcry while complaining about "incorrect API usage" by applications, and some play fast and lose in other ways (e.g., by not using barriers). Many users value their data more than these maintainers and would hopefully flock to a filesystem that actually gives crash consistency gurarantees.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 23:17 UTC (Sat) by masoncl (subscriber, #47138) [Link]
It works differently in btrfs than xfs and ext4 because fsyncs go through a special logging mechanism, and so an fsync on one file won't have to wait for the rename flush on any other files in the FS.
I'll go ahead and queue this patch for 2.6.30.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 8:38 UTC (Mon) by njs (guest, #40338) [Link]
http://btrfs.wiki.kernel.org/index.php/FAQ#Does_Btrfs_hav...
I'm curious what I'm missing...
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 10:46 UTC (Mon) by forthy (guest, #1525) [Link]
I'm curious, too. I thought btrfs did it right, by being COW-logging of data&metadata and having data=ordered mandatory, with all the explication in the FAQ that make complete sense (correct checksums in the metadata also mean correct data). Now Chris Mason tells us he didn't? Ok, this will be fixed in 2.6.30, and for now, we all don't expect that btrfs is perfect. We expect bugs to be fixed; and that's going on well.
IMHO a robust file system should preserve data operation ordering, so that a file system after a crash follows the same consistency semantics as during operation (and during operation, POSIX is clear about consistency). Delaying metadata updates until all data is committed to disk at the update points should actually speed things up, not slow them down, since there is an opportunity to coalesce several metadata updates into single writes without seeks (delayed inode allocation e.g. can allocate all new inodes into a single consecutive block, delayed directory name allocation all new names into consecutive data, as well).
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 16:50 UTC (Mon) by masoncl (subscriber, #47138) [Link]
This means the transaction commit doesn't have to wait for the data blocks because the metadata for the file extents always reflects extents that are actually on disk.
When you rename one file over another, the destination file is atomically replaced with the new file. The new file is fully consistent with the data that has already been written, which in the worst case means it has a size of zero after a crash.
I hope that made some kind of sense. At any rate, 2.6.30 will have patches that make the rename case work similar to the way ext3 does today. Files that have been through rename will get flushed before the commit is finalized (+/- some optimizations to skip it for destination files that were from the current transaction).
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 21:23 UTC (Mon) by njs (guest, #40338) [Link]
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 22:51 UTC (Mon) by masoncl (subscriber, #47138) [Link]
Without data=ordered, after a crash the file could have garbage in it, or bits of old files that had been deleted.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 22:56 UTC (Mon) by njs (guest, #40338) [Link]
Ts'o: Delayed allocation and the zero-length file problem
Posted Apr 7, 2009 22:27 UTC (Tue) by pgoetz (subscriber, #4931) [Link]
Sorry this doesn't make any sense. Atomicity in this context means that when executing a rename, you always get either the old data (exactly) or the new data. Your worst case scenario -- a size of zero after crash -- precisely violates atomicity.
For the record, the first 2 paragraphs are equally mysterious: "This means the transaction commit doesn't have to wait for the data blocks...". Um, is the data ordered or not? If you commit the transaction -- i.e. update the metadata before the data blocks are committed, then the operations are occurring out of order and ext4 open-write-close-rename mayhem ensues.
"bug for bug compatibility" is why Windows sucks
Posted Mar 13, 2009 16:50 UTC (Fri) by JoeBuck (subscriber, #2330) [Link]
The OS can provide bug-for-bug compatibility with itself. To do so is clearly possible given the evidence that Windows does it.
Microsoft has managed to hire some of the most brilliant developers on the planet. But they have a heavy burden: every mistake they ever made in the last 20 years, every misdesigned API, every unspecified behavior, has come to be relied on by some significant application developer, so they have to keep this mountain of crap duct-taped together and running. Some of the worst offenders have been Microsoft's own application developers, relying on undocumented behavior that they can find out by reading the source code as a competitive edge.
The wisest thing the Linux developers have done is that they decided they're willing to regularly break kernel APIs, other than system calls. It's the main reason that Linux has been able to catch up so rapidly.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 15:41 UTC (Fri) by qg6te2 (guest, #52587) [Link]
Hang on, comparing the changing functionality of glibc, windows, etc to a file system is misleading. Ensuring that "open-write-close-rename" does what it says is a reasonable requirement, even if POSIX is silent about it.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 15:03 UTC (Fri) by rsidd (subscriber, #2582) [Link]
It reminds me of Richard Gabriel's famous "Worse is better" screed.
From Gabriel's article:
The New Jersey guy said that the Unix folks were aware of the [PC-losering] problem, but the solution was for the system routine to always finish, but sometimes an error code would be returned that signaled that the system routine had failed to complete its action. A correct user program, then, had to check the error code to determine whether to simply try the system routine again.Which, to me, sounds very much the same as saying, like Ted Ts'o, that a correct user program has to fsync() its data and not rely on fclose() actually flushing anything to disk. Also, there are lots of people (like scientific programmers) who write their own short file-handling code without being fanaticallyc "correct" C programmers; buffer overflows and other such bugs are probably OK for them, since their systems are trusted, but data loss really is not OK.
Still, as Gabriel says, Unix won against Lisp systems. And Windows (which was even worse up until Windows ME) won against Unix. So there's food for thought there.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 16:19 UTC (Fri) by ajross (guest, #4563) [Link]
Agreed. What's the point of putting all these fancy journaling and reliability features into a file system if they don't work by default? I mean, hell, we could lose data after a system crash with ufs in 1983. Why bother with ext4?Hiding behind POSIX here is just ridiculous. POSIX allows this absurd lack of reliability not because it's a good idea, but because filesystems available when the standard was drafted can't support it.
Worse is better
Posted Mar 13, 2009 23:03 UTC (Fri) by pboddie (guest, #50784) [Link]
Although people can argue that UNIX "got things about right" in comparison to competing (and presumably discontinued) operating systems which were more clever in their implementation, there's a lot to be said for not pestering application programmers with, for example, the tedious details of fsync and friends at the expense of common sense idioms that just work, like those which assume that closed files can safely have filesystem operations performed on them. Those tedious details involving, of course, figuring out which sync-related function actually does what the developer might anticipate from one platform to the next.
Sometimes worse really is worse.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 3:37 UTC (Sat) by bojan (subscriber, #14302) [Link]
It's not Ted saying this. It is how it works. From man 2 close:
> A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a file system to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored use fsync(2). (It will depend on the disk hardware at this point.)
OK?
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 6:13 UTC (Mon) by jamesh (guest, #1159) [Link]
Conversely, if the file data hasn't been written to disk then they expect that the rename over the old data won't occur.
There is no expectation that either of these operations will occur immediately, which is why they don't request that happen via fsync().
If the current method applications use when expecting this behaviour, then it'd be nice to define an API that does provide the desired semantics. That said, I can't think of any cases where you wouldn't want the new data blocks written out before renaming over an existing file.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 15:44 UTC (Fri) by forthy (guest, #1525) [Link]
"POSIX allows this" is an anal point. Like "the C standard allows this". This is not what standard documents are about. I'm working in a standard committee, and standards are about compromises between different implementations who all have different design goals. If you design a file system, it is your task to define those parts left out by the standard. Do you want it fast, do you want it robust, or do you want it compatible with some cruft out there for 25 years (like FAT)? A robust file system has to preserve a consistent state of the file system in case of a crash - and POSIX defines what consistent is: no reordering of operations allowed. The real solution therefore is log-structured file system with COW. Ok, so wait for btrfs.
But in the meantime, ext4 could just delay metadata writes as well, and make sure that at these sync points it does not commit meta data until it also has written the data out. The bug here is that ext4 writes "commit" to a metadata update even though data updates issued prior to this metadata update are still pending.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 15:58 UTC (Fri) by kh (subscriber, #19413) [Link]
Server farms are not the target
Posted Mar 13, 2009 19:21 UTC (Fri) by khim (subscriber, #9252) [Link]
Ext2 is plenty fast for most operations and if you can just throw away everything (like Google can) - it's not a bad choice. And if ext4 can not provide sane guarantee (and yes, open/write/close/rename must be atomic no matter what POSIX says) it's filesystem for nobody...
Server farms are not the target
Posted Mar 13, 2009 19:41 UTC (Fri) by martinfick (subscriber, #4455) [Link]
Server farms are not the target
Posted Mar 14, 2009 1:21 UTC (Sat) by droundy (subscriber, #4559) [Link]
Server farms are not the target
Posted Mar 14, 2009 1:47 UTC (Sat) by martinfick (subscriber, #4455) [Link]
You may be right, but now you have me questioning both wikipedia and my own interpretation. :)
Atomicity vs durability
Posted Mar 14, 2009 13:42 UTC (Sat) by man_ls (guest, #15091) [Link]
"Atomic without a crash" is not good enough; "atomic" means that a transaction is committed or not, no matter at what point we are -- even after a crash. I think that durability is not needed here:In database systems, durability is the ACID property which guarantees that transactions that have committed will survive permanently.In ext3 it does not matter (too much) if the transaction stays committed, but it cannot be left in the middle of an operation (crashes notwithstanding).
Let's see an example. Say we have the following sequence:
atomic change -> commit -> 5 secs -> flushed to diskThe change might be a rename operation. If the system crashes during those 5 secs, the transaction might not survive the crash -- the filesystem would appear as before the atomic change, and thus ext3 is not durable. But the transaction can only appear as not started or as finished, and not in any state in between; thus ext3 is atomic. I guess that is what fsync() is about: durability of the requested change.
But the problem Ts'o is talking about is different: the transaction has been committed but only part of it may appear on disk -- a zero-length version of the transaction to be precise. So the system is not atomic. It can be made durable with fsync() but that is not really the point.
I may very well have confused everything up, and would be grateful for any clarification. My coffee intake is not what it used to be these days.
Atomicity vs durability
Posted Mar 15, 2009 3:21 UTC (Sun) by bojan (subscriber, #14302) [Link]
I think you're missing the crucial distinction here. When the atomic rename happens (and atomicity refers to file names _only_ here), "so that there is no point at which another process attempting to access newpath will find it missing" (from rename(2) manual page), the new file will replace the old file. However, because the application writer didn't commit the _data_ to the new file yet, it may not be on disk.
In other words, rename(2) does _not_ specify atomicity of _data_ inside the file, only that at no point the file _name_ will be missing. For data to be in that new file, durability is required. Ergo, fsync.
The whole API, including write(), close(), fsync() and rename() has absolutely no idea that the application writer is trying to atomically replace the file. Only the application writer knows this and must act accordingly.
Atomicity vs durability
Posted Mar 15, 2009 9:44 UTC (Sun) by alexl (subscriber, #19068) [Link]
POSIX guarantees an atomic replacement of the file, and this means *both* the data an the filenames[1]. However, POSIX doesn't specify anything about system crashes. So, this guarantee is only valid for the non-crashing case.
For the crashing case POSIX doesn't guarantee anything. In fact, many POSIX filesystems such as ext2 can (correctly, by the spec) result in a total loss of all filesystem data in the case of a system crash. And in fact, this is allowed even if the application fsync()ed some data before the crash.
Now, in order to have some way of getting better guarantees than this POSIX also supplies fsync that guarantees that the files have been written to disk. However, nowhere in the specs of fsync does it say that it guarantees that this will survive a system crash. Of course if it *does* survive it is nice to have the fsync guarantee because that means if the metadata change survived we're more likely to get the whole new file.
But, your discussions about how the "atomic" part is only refering to the filenames is bullshit. POSIX does give full guarantees for both filename and content in the case it specifies. Everything else is up to the implementation. This is why its a good idea for a robust filesystem to give the write-data-before-metadata-on-rename guarantee, since it turns an non-crash POSIX guarantee into a post-crash guarantee. (Of course, this is by no means necessary, even ext2 with full data loss on crash is POSIX compliant, its just a *good* implementation.)
[1] From the POSIX spec:
If the link named by the new argument exists, it shall be removed and old renamed to new. In this case, a link named new shall remain visible to other processes throughout the renaming operation and refer either to the file referred to by new or old before the operation began.
(Notice how this has no separation about "filenames" and "data")
Atomicity vs durability
Posted Mar 15, 2009 10:18 UTC (Sun) by bojan (subscriber, #14302) [Link]
Notice how it doesn't say _anything_ at all about the content of the file _on_ _disk_ that is being renamed into the old file. That is because in order to see the file _durably_, you have to commit its content. Completely unrelated to committing the _rename_.
Just because another process can see the link (and access the data correctly, which may still just be cached by the kernel) does _not_ mean data reached the disk yet.
BTW, thanks for working on fixes of this in Gnome.
Atomicity vs durability
Posted Mar 15, 2009 10:29 UTC (Sun) by alexl (subscriber, #19068) [Link]
I argue that a robust useful filesystem should give data-before-metadata-on-rename guarantees, as that would make it a better filesystem. And without this guarantee I'd rather use another filesystem for my data. This is clearly not a requirement, and important code should still fsync() to handle other filesystems. But its still the mark of a "good" filesystem.
Atomicity vs durability
Posted Mar 15, 2009 10:37 UTC (Sun) by bojan (subscriber, #14302) [Link]
Atomicity vs durability
Posted Mar 15, 2009 11:04 UTC (Sun) by man_ls (guest, #15091) [Link]
That is why we are willing to replace it in the first place! but not if it means losing in the process its good properties (be them in the spec or not).
Atomicity vs durability
Posted Mar 20, 2009 11:24 UTC (Fri) by anton (subscriber, #25547) [Link]
Similarly, any file system that on fsync() locks up for several seconds is not a very good one ;-)If the fsync() has to write out 500MB, I certainly would expect it to take several seconds and the fsync call to block for several seconds. fsync() is just an inherently slow operation. And if an application works around the lack of robustness of a file system by calling fsync() frequently, the application will be slow (even on robust file systems).
Atomicity vs durability
Posted Mar 15, 2009 10:36 UTC (Sun) by bojan (subscriber, #14302) [Link]
Consider this. A file has been renamed and just then kernel decides that the directory will be evicted from the cache (i.e. committed to disk). What will be written to disk? The _new_ content of the directory will be written out to disk, because any other process looking for the file must see the _new_ file (and must never _not_ see the file). At the same time, the content of the file can still be cached just fine and _everything_ is atomically visible by all processes.
And yet, the atomicity of rename _only_ refers to filenames.
Atomicity vs durability
Posted Mar 15, 2009 11:02 UTC (Sun) by man_ls (guest, #15091) [Link]
No, you are just blurring the issue; transactionality does not work that way. I think the interpretation of alexl is correct here. It does not matters if the contents of the file are still cached; other processes can see either the old contents or the new contents, but not both and not a broken file. The rename cannot be atomic if the name points to e.g. an empty file; not only the filename must be valid but the contents of the file as well, up to the point when the rename is done. It is no good if the file appears as it was before the atomic operation was issued (e.g. empty).With fsync you make the contents persistent i.e. durable, but the operation should be atomic even without the fsync.
Atomicity vs durability
Posted Mar 15, 2009 11:34 UTC (Sun) by dlang (guest, #313) [Link]
the only time they could see a blank or corrupted file is if the system crashes.
so the atomicity is there now
it's the durability that isn't there unless you do f(data)sync on the file and directory (and have hardware that doesn't do unsafe caching, which most drives do by default)
Atomicity vs durability
Posted Mar 15, 2009 12:48 UTC (Sun) by man_ls (guest, #15091) [Link]
Exactly. What is missing now is atomicity in the face of a crash. To quote myself from a few posts above:"Atomic without a crash" is not good enough; "atomic" means that a transaction is committed or not, no matter at what point we are -- even after a crash.Durability (what fsync provides) is not needed here; durability means that the transaction is safely stored to disk. What we people are requesting from ext4 is that the property of atomicity be preserved even after a crash.
Even if the POSIX standard does not speak about system crashes it is good engineering to take them into account IMHO.
Atomicity vs durability
Posted Mar 15, 2009 13:06 UTC (Sun) by bojan (subscriber, #14302) [Link]
Which is not what POSIX requires.
Atomicity vs durability
Posted Mar 15, 2009 13:13 UTC (Sun) by man_ls (guest, #15091) [Link]
That is right, that is why we are not using ext2 (which is POSIX-compliant), FreeBSD (which is POSIX-compliant) or even Windows Vista (which can be made to be POSIX-compliant). We are running a journalling file system in the (apparently silly) hope that the system will hold our data and then give it back.
Atomicity vs durability
Posted Mar 15, 2009 13:19 UTC (Sun) by bojan (subscriber, #14302) [Link]
I think we should come up with a new API that guarantees what people really want. Making the existing API do that on a particular FS is just going to make applications non-portable to any FS that doesn't work that way using existing POSIX API. We've seen this with XFS. Who knows what's lurking out there. Better do the proper thing, fsync and be done with it. Then we can invent the new, better, smarter API.
Atomicity vs durability vs reliability
Posted Mar 15, 2009 13:35 UTC (Sun) by man_ls (guest, #15091) [Link]
No, you are not all for reliability if you cannot see beyond your little POSIX manual. Or if you don't care about system crashes because the manual is silent about this particular point. Sorry to break it to you: reliability is such little details such as having predictable response to a crash, or surviving the crash while retaining all the nice properties.I think we should come up with a new API that guarantees what people really want.APIs are good enough as they are -- we don't need a special "reliability API" so we can build a special "reliability manual" for guys who just follow the book.
We've seen this with XFS.Nope. What we have seen with XFS is how some anal-retentive developers lost most of their user base while trying to argue such points as "POSIX-compliance", and then they finally give in. With ex4 we are hoping to get to the point where the devs give in before they lose most of their user base. Just because ext4 is important for Linux and for our world domination agenda. Meanwhile you can keep waving the POSIX standard in our face. The POSIX standard seems to be about compatibility, not about reliability, and it should keep playing that role. Reliability is left as an exercise for the attentive reader. Let us hope that Mr Ts'o is attentive and can tell atomicity, reliability and durability apart.
Actually it's done deal...
Posted Mar 15, 2009 17:34 UTC (Sun) by khim (subscriber, #9252) [Link]
If you read the comments on tytso's blog you'll see that current position is: "POSIX is right while applications are broken yet we'll save them anyway". Even if "proper way" is fix thousands of applications its just not realistic - so ext4 (starting from 2.6.30) will try to save these broken applications by default. And if you want performance - there are a switch. Good enough for me. Can we close the discussion?
Actually it's done deal...
Posted Mar 15, 2009 21:10 UTC (Sun) by bojan (subscriber, #14302) [Link]
Sorry
Posted Mar 15, 2009 21:20 UTC (Sun) by man_ls (guest, #15091) [Link]
Sure, I have polluted the interwebs enough with my ignorance, and there is little chance to learn anything else.
Atomicity vs durability vs reliability
Posted Mar 15, 2009 21:06 UTC (Sun) by bojan (subscriber, #14302) [Link]
POSIX manual is not little ;-)
Seriously, we tell Microsoft that going out of spec is bad, bad, bad. But, we can go out of spec no problem. There is a word for that:
http://en.wikipedia.org/wiki/Hypocrisy
> What we have seen with XFS is how some anal-retentive developers lost most of their user base while trying to argue such points as "POSIX-compliance", and then they finally give in.
Yep, blame the people that _didn't_ cause the problem. We've seen that before.
Sorry, but I don't see it this way...
Posted Mar 15, 2009 22:08 UTC (Sun) by khim (subscriber, #9252) [Link]
I'm yet to see anyone who asks Microsoft to never go beyond the spec. It'll be just insane: if you can not ever add anything beyond what the spec says how any progress can occur?
When Microsoft is blamed it's because Microsoft
1. Does not implement spec correctly, or
2. Don't say what's the spec requirements and what's extensions.
When Microsoft says "JNI is not sexy so we'll provide RMI instead" the ire is NOT about problems with RMI. Lack of JNI is to blame.
I don't see anything of the sort here: POSIX does not require to make open/write/close/rename atomic but it certainly does not forbid this. And it's useful thing to have so why not? It'll be best to actually document this behaviour, of course - after that applications can safely rely on it and other systems can implement it as well if they wish. We even have nice flag to disable this extensions if someone wants this :-)
Sorry, but I don't see it this way...
Posted Mar 15, 2009 22:24 UTC (Sun) by bojan (subscriber, #14302) [Link]
Which is exactly what our applications are doing. POSIX says, commit. We don't and then we blame others for it.
This is the same thing HTML5 is doing
Posted Mar 15, 2009 22:33 UTC (Sun) by khim (subscriber, #9252) [Link]
Sorry, but it's not the problem with POSIX or FS - it's problem with number of applications. Once a lot of applications are starting to depend on some weird feature (content sniffing in case of HTML, atomicity of open/write/close/rename on case of filesystem) it makes no sense to try to fix them all. Much better to document it and make it official. This is what Microsoft did with a lot of "internal" functions in MS-DOS 5 (and it was praised for it, not ostracized), this is what HTML is doing in HTML5 and this is what Linux filesystems should do.
Was it good idea to depend on said atomicity? May be, may be not. But the time to fix these problems come and gone - today it's much better to extend the spec.
This is the same thing HTML5 is doing
Posted Mar 15, 2009 23:37 UTC (Sun) by bojan (subscriber, #14302) [Link]
Time to fix these problems using the existing API is now, because right now we have the attention of everyone on how to use the API properly. To the credit of some in this discussion, bugs are already being fixed in Gnome (as I already mentioned in another comment). I also have bugs to fix in my own code - there is no denying that :-(
In general, I agree with you on extending the spec. But, before the spec gets extended officially, we need to make sure that _every_ POSIX compliant file system implements it that way. Otherwise, apps depending on this new spec will not be reliable until that's the case. So, can we actually make sure that's the case? I very much doubt it. There is a lot of different systems out there that are implementing POSIX, some of them very old. Auditing all of them and then fixing them may be harder than fixing the applications.
Why do we need such blessing?
Posted Mar 16, 2009 0:05 UTC (Mon) by khim (subscriber, #9252) [Link]
Linux extends POSIX all the time. New syscalls, new features (things like "According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However, since version 1.3.20 Linux does actually wait."), etc. If application wants to use such "extended feature" - it can do this, if not - it can use POSIX-approved features only.
As for old POSIX systems... it's up to application writers again. And you can be pretty sure A LOT OF them don't give a damn about POSIX compliance. They are starting to consider Linux as third platfrom for their products (first two are obviously Windows and MacOS in that order), but if you'll try to talk to them about POSIX it'll just lead to the removal of Linux from list of supported platforms. Support of many distributions is already hard enough, support of some exotic filesystems "we'll think about it but don't hold your breath...", support for old exotic POSIX systems... fuggetaboudit!
Now - the interesting question is: do we welcome such selfish developers or not? This is hard question because the answer "no, they should play by our rules" will just lead to exodus of users - because they need these applications and WINE is not a good long-term solution...
Atomicity vs durability
Posted Mar 15, 2009 22:05 UTC (Sun) by dcoutts (subscriber, #5387) [Link]
As for a new API, yes, that'd be great. There are doubtless other situations where it would be useful to be able to constrain write re-ordering. For example for writes within a single file if we're implementing a persistent tree structure where the ordering is important to provide atomicity in the face of system failure.
Having a nice new API does not mean that the obvious cases that app writers have been using for ages are wrong. We should just insert the obvious write barriers in those cases.
Atomicity vs durability
Posted Mar 16, 2009 4:52 UTC (Mon) by dlang (guest, #313) [Link]
so everything that you are screaming that the OS should guarantee can be broken by the hardware after the OS has done it's best.
you can buy/configure your hardware to not behave this way, but it costs a bunch (either in money or in performance). similarly you can configure your filesystem to give you added protection, at a significant added cost in performance.
Atomicity vs durability
Posted Mar 16, 2009 11:00 UTC (Mon) by forthy (guest, #1525) [Link]
Any reasonable hard disk (SATA, SCSI) has write barriers which allow file system implementers to actually implement atomicy.
Atomicity vs durability
Posted Mar 15, 2009 23:51 UTC (Sun) by vonbrand (guest, #4458) [Link]
I just don't understand all this "extN isn't crash-proof" whining... Yes, Linux systems do crash on occasion. It is thankfully very rare. Yes, hardware does fail. Even disks do fail. Yes, if you are unlucky you will lose data. Yes, the system could fail horribly and scribble all over the disk. Yes, the operating system could mess up its internal (and external) data structures.
It is just completely impossible for the operating system to "do the right thing with respect to whatever data the user values more", even more so in the face of random failures. Want performance? Then you have to do tricks caching/buffering data, disks are horribly _s_l_o_w_ when compared to your processor or memory.
Asking Linux developers to create some Linux-only beast of a filesystem in order to make application developer's life easier doesn't cut it, there are other operating systems (and Linux systems with other filesystems) around, and always will be. Plus asking for a filesystem that is impossible in principle won't get you too far either.
Atomicity vs durability
Posted Mar 16, 2009 0:08 UTC (Mon) by man_ls (guest, #15091) [Link]
Yes, isn't it silly to ask for the moon like this? Apart from the fact that ext3 does exactly what we are asking for; and XFS since 2007; and now ext4 with the new patches. Oh wait... maybe you really didn't understand what we were asking for.Listen, the sky might fall on our heads tomorrow and eventually we are all to die, we understand that. But until then we really want our filesystems to do atomic renames in the face of a crash (i.e. what the rest of the world [except POSIX] understands as "atomic"). Not durable, not crash-proof, not magically indestructible -- just all-or-nothing. Atomic.
YMMV
Posted Mar 16, 2009 0:26 UTC (Mon) by khim (subscriber, #9252) [Link]
Yes, Linux systems do crash on occasion. It is thankfully very rare.
Depends of what hardware and what kind of drivers you have.
Want performance? Then you have to do tricks caching/buffering data, disks are horribly _s_l_o_w_ when compared to your processor or memory.
The problem is: fast filesystem is useless if it can't keep my data safe. Microsoft knows this - that's why you don't need to explicitly unmount flash drive there. Yes, cost is huge, it means flash wears down faster and speed is horrible - but anything else is unacceptable. Oh, and I know A LOT OF users who just turn off computer at the end of day. This problem is somewhat mitigated by design of current systems ("power off" button is actually "shutdown" button), but people are finding ways to cope: they just switch power to the desk.
The same thing applies to developers. They are lazy. Most application writers do not use fsync and do not check the error code from close. Yet if data is lost - OS will be blamed. Is it fair to OS and FS developers? Not at all! Can it be changed? Nope. Life is unfair - deal with it.
The whining started when it was found it that new filesystem can lose valuable data - where ext3 never does it in this fashion (it can do this with O_TRUNC, but not with rename). This is pretty serious regression to most people. The approach "let's fix thousads upon thousands applications" (including proprietary ones) was thankfully rejected. This is good sign: this means Linux is almost ready to be usable by normal people. Last time such problem happened (OSS->ALSA switch) offered solution was beyond the pale.
Atomicity vs durability
Posted Apr 8, 2009 15:30 UTC (Wed) by pgoetz (subscriber, #4931) [Link]
Atomicity vs durability
Posted Mar 15, 2009 13:14 UTC (Sun) by bojan (subscriber, #14302) [Link]
As long as processes running on the system can see a consistent picture (and they can), according to POSIX it is.
> not only the filename must be valid but the contents of the file as well
From the point of view of any process running on the system, it is.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 5:45 UTC (Sat) by Nick (guest, #15060) [Link]
> "the C standard allows this". This is not
> what standard documents are about.
To everyone whining about this behaviour because "posix allows it", it is not like a C compiler that adds some new crazy optimisation that can break any threaded program that previously was working (like speculatively writing to a variable that is never touched in logical control flow). POSIX in this case *IS* basically ratifying existing practices.
fsync() is something that you have to do on many OSes and many filesystems for a long time in order to get correct semantics. Programmers that come along a decades or two later and decide to ignore basics like that because it mostly appears to work on one filesystem on one OS cannot complain that a new filesystem or OS breaks their app really don't have a high horse on which to sit.... a mini pony at best.
It's quite fair to hope for a best effort, but it is also fair to make an appropriate choice about the performance tradeoff so that well written applications don't get horribly penalised.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 20, 2009 13:48 UTC (Fri) by anton (subscriber, #25547) [Link]
[...] it is not like a C compiler that adds some new crazy optimisation that can break any threaded program that previously was working (like speculatively writing to a variable that is never touched in logical control flow)That's actually a very good example: ext4 performs the writes in a different order than what corresponds to the operations in the process(es), resulting in an on-disk state that never was the logical state of the file system at any point in time. One difference is that file systems have been crash-vulnerable ("crazy") for a long time, so in a variation of the Stockholm syndrome a number of people now insist that that's the right thing.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 16:25 UTC (Fri) by droundy (subscriber, #4559) [Link]
Similarly, all FSs should try hard to make sure open-write-close-rename leaves you with either the new or the old file. Anything else isn't reasonable.Absolutely. The problem with the argument that it must be open-write-fsync-close-rename is that that describes a *different* goal, which is to ensure that the new file is actually saved. When an application doesn't particularly care whether the new or old file is present in case of a crash, it'd be nice to allow them to ask only for the ordinary POSIX guarantees of atomicity, without treating system crashes as a special exception.
And on top of that, I'd prefer to think of all those Ubuntu gamers as providing the valuable service of searching for race conditions relating to system crashes. Personally, I prefer not to run nvidia drivers, but I'd like to use a file system that has been thoroughly tested by hard rebooting the computer, so that on the rare power outages, I'll not be likely to lose data. It'd be even nicer if lots of people were to stress-test their ext4 file systems by simply repeatedly hitting the hard reset button, but it's hard to see how to motivate people to do this.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 18:26 UTC (Fri) by oak (guest, #2786) [Link]
systems by simply repeatedly hitting the hard reset button, but it's hard
to see how to motivate people to do this.
If you have very small children, they can do it for you. When you least
want/expect it...
Anyway, it's not so hard to do an automated system for this and you can
also buy one. All the file system maintainers I know, do this kind of
testing.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 18:03 UTC (Sat) by dkite (guest, #4577) [Link]
experience.
Derek
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 23, 2009 5:02 UTC (Mon) by dedekind (guest, #32521) [Link]
$ man 2 write
...
"A successful return from write() does not make any guarantee that data has been committed to disk."
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 15:57 UTC (Fri) by sbergman27 (guest, #10767) [Link]
I'm shocked and dismayed that a developer of an ext# filesystem can be so cavalier regarding a data integrity issue. This attitude would *never* have been taken by a dev during this period in ext3's life-cycle.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 16:09 UTC (Fri) by johnkarp (guest, #39285) [Link]
The only thing that surprised me was that writing to a second file then and renaming to the first wasn't fully sufficient.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 16:15 UTC (Fri) by dcoutts (subscriber, #5387) [Link]
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 18:06 UTC (Fri) by jgg (subscriber, #55211) [Link]
This whole discussion has really not been focused much on what actually are sane behaviors for a FS to have across an unclean shutdown. To date most writings by FS designers I've read seem to focus entirely on avoiding FSCK and avoiding FS meta-data inconsistencies. Very few people seem to talk about things like what the application sees/wants.
One of the commentors on the blog had the best point - insisting on adding fsync before every close/rename sequence (either implicitly in the kernel as has been done, or explicitly in all apps) is going to badly harm performance. 99% of these case do not need the data on the disk, just the write/close/rename order preserved.
Getting great performance by providing weak guarentees is one thing, but then insisting everyone who cares about their data use explicit calls that provide a much stronger and slower guarantee is kinda crazy. Just because POSIX is silent on this matter doesn't mean FS designers should get a free pass on transactional behaviors that are so weak they are useless.
For instance under the same POSIX arguments Ted is making it would be perfectly legitimate for a write/fsync/close/rename to still erase both files because you didn't do a fsync on the directory! Down this path lies madness - at some point the FS has to preserve order!
I wonder how bad a hit performance sensitive apps like rsync will get due to the flushing on rename patches?
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 19:16 UTC (Fri) by endecotp (guest, #36428) [Link]
Yes, I was just thinking the same thing! Come on Ted, what exactly do you want us to write to be portably safe? I have just added an fsync() to my write() close() rename() code, but I checked man fsync first and it tells me that I need to fsync the directory. So is it:
open()
write()
fsync()
close()
rename()
opendir()
fsync()
closedir()
? Or some re-ordering of that? Is there more? Do I have to fsync() the directories up to the root? Can I avoid all this if I call sync()?
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 20:17 UTC (Fri) by alexl (subscriber, #19068) [Link]
If the metadata is not written out but the data is and then things crash, you will just have the old file as it was, and either a written file+inode with no name (moved to lost+found) or the written file with the temporary name.
As far as i can see syncing the directory is not needed. (Unless you want to guarantee the file being on disk, rather than just not breaking the atomic file replace behaviour.)
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 20:41 UTC (Fri) by masoncl (subscriber, #47138) [Link]
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 21:03 UTC (Fri) by endecotp (guest, #36428) [Link]
Thinking a bit more about this from the "application requirements" point of view, I can see three cases:
1- The change needs to be atomic wrt other processes running concurrently.
2- The change needs to be atomic if this process terminates (ctrl-C, OOM).
3- The change needs to be atomic if the system crashes.
I can't think of a scenario where the application author would reasonably say, "I need this data to be safe in cases 1 and 2 but I don't care about 3." Can anyone else?
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 22:45 UTC (Fri) by jgg (subscriber, #55211) [Link]
It isn't that uncommon really, anytime you want to protect against a failing program and not mess up its output file rename is the best way. For instance programs that are used with make should be careful to not leave garbage around if they crash. rsync and other downloading software does the rename trick too, for the same basic reasons. None of these uses require fsync or other performance sucking things.
The reason things like emacs and vim are so careful is because they are almost always handling critical data. I don't think anyone would advocate rsync should use fsync.
The considerable variations in what FSs do is also why, as an admin, I have a habit of knocking off a quick 'sync' command after finishing some adminy task just to be certain :)
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 10:37 UTC (Mon) by endecotp (guest, #36428) [Link]
Ted seems to have answered this in his second blog post: YES you DO need to fsync the directory if you want to be certain that the metadata has been saved.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 20:28 UTC (Fri) by amk (subscriber, #19) [Link]
Why this behavour is broken? It's perfectly normal behaviour...
Posted Mar 14, 2009 11:10 UTC (Sat) by khim (subscriber, #9252) [Link]
This P2P client. Good p2p client will keep information about peers for each file - this way if the the system is rebooted the lenghty process of finding peers can be avoided. Since there are hundreds (sometimes thousands) peers this means hundreds of files are rewritten every minute or so. If filesystem can not provide guarantees without fsync - I just refuse to use it. XFS went this way. XFS developers long argues their right to destroy files on crush and we've all agreed that they can do this and I can answer the question "What do you think about XFS?" with just "Don't use it. Ever." And everyone was happy.
Looks like tytso actually fixed the problem in ext4 (even if actual words were akin "application developers are crazy and this is incorrect usage but we can not replace all of them") so at least I can conclude he's more sane then XFS developers...
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 12:32 UTC (Sat) by nix (subscriber, #2304) [Link]
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 20, 2009 14:22 UTC (Fri) by anton (subscriber, #25547) [Link]
I'm shocked and dismayed that a developer of an ext# filesystem can be so cavalier regarding a data integrity issue. This attitude would *never* have been taken by a dev during this period in ext3's life-cycle.The more recent ext3 developers have a pretty cavalier attitude to data integrity: For some time the data=journal mode (which should provide the highest integrity) was broken. Also, ext3 disables using barriers by default, essentially eliminating all the robustness that adding a journal gives you (and without getting any warning that your file system is corrupted).
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 16:11 UTC (Fri) by dcoutts (subscriber, #5387) [Link]
There's a very low performance penalty for using a write barrier. All modern disks support it without having to issue a full flush.
App authors are not demanding that the file date make it to the disk immediately. They're demanding that the file update is atomic. It should preserve the old content or the new but never give us a zero length file.
Can this be that difficult? We do not need a total ordering on all file system requests. We just need it for certain meta-data and content data writes.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 16:25 UTC (Fri) by forthy (guest, #1525) [Link]
We don't even need write barriers. The updates of the file system can update lots of data and metadata in one go. But it should keep the consistency POSIX promises: All file system operations are performed in order. This is actually a POSIX promise; it just doesn't hold for crashes (because crashes are not specified). I.e. if delayed updates are used, they should be delayed all together and then done in an atomic way - either complete them or roll them back to the previous state. This is actually not difficult.
Btrfs does this; Ted Ts'o doesn't seem to get it. Many file system designer don't get it, they are anal about their metadata, and don't care at all about user data. Unix file systems have lost data in that way since the invention of synchronous metadata updates (prior to that, they also lost metadata ;-).
IMHO there is absolutely nothing wrong with the create-write-close-rename way to replace a file. As application writer, we have to rely somehow on our OS. If we don't, we better write it ourselves. When the file system designer don't get it, and are anal about some vague spec, fsck them.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 17:04 UTC (Fri) by jwb (guest, #15467) [Link]
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 20:25 UTC (Fri) by alexl (subscriber, #19068) [Link]
fd1=open(dirname(file))
fd2=openat(fd1, NULL, O_CREAT) // Creates an unlinked file
write(fd2)
flinkat(fd1, fd2, basename(file)) // Should guarantee fd2 is written to disk before linking.
close(fd2)
close(fd1)
This seems race free:
doesn't break if the directory is moved during write
doesn't let other apps see or modify the temp file while writing
doesn't leave a broken tempfile around on app crash
doesn't end up with an empty file on system crash
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 1:41 UTC (Sat) by dcoutts (subscriber, #5387) [Link]
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 15, 2009 23:25 UTC (Sun) by halfline (guest, #31920) [Link]
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 21, 2009 0:34 UTC (Sat) by spitzak (guest, #4593) [Link]
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 1:35 UTC (Sat) by dcoutts (subscriber, #5387) [Link]
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 15:59 UTC (Sat) by Pc5Y9sbv (guest, #41328) [Link]
So, is there an appropriate set of heuristics to infer write barriers sometimes but not others? The specific case in this discussion would be something like "insert a write barrier after file content operations requested before metadata operations affecting linkage of that file's inode"? Is this sufficient and defensible?
Ideally, we should have POSIX write-barriers that can be applied to a set of open file and directory handles, and use them to get the proper level of ordering across crashes. The fsync solution is far too blunt an instrument to provide the transactionality that everyone is looking for when they relink newly created files into place.
But then what about all those shell scripts out there which do "foo > file.tmp && mv file.tmp file"? We would need a new write-barrier operation applicable from the shell script (somehow selecting partial ordering of requests issued from separate processes), or a heuristic write-barrier as above...
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 17:32 UTC (Sat) by dcoutts (subscriber, #5387) [Link]
Certainly since rename is supposed to be atomic and because it is used in this common idiom then it should have a write barrier wrt other operations on the same file. I don't think we should demand barriers between write operations within the same file or between different files. As you say it would be useful to be able to add explicit barriers sometimes, just as we can for CPU operations on memory.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 15, 2009 7:57 UTC (Sun) by Pc5Y9sbv (guest, #41328) [Link]
I think I agree now that it would be sensible to infer a write barrier between file content requests and inode linking requests for the same inode. This would cover a large percentage of "making data available" scenarios.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 20, 2009 14:52 UTC (Fri) by anton (subscriber, #25547) [Link]
To provide a completely unsurprising behavior, i.e. provide expected inter-process POSIX semantics between pre-crash and post-crash processes, you would need to infer a write barrier between every IO operation by a process.You have the right idea about the proper behaviour, but there is more freedom in implementing it: You can commit a batch of operations at the same time (e.g., after the 60s flush interval), and you need only a few barriers for each batch: essentially one barrier between writing everything but the commit block and writing the commit block, and another barrier between writing the commit block and writing the free-blocks information for the blocks that were freed by the commit (and if you delay the freeing long enough, you can combine the latter barrier with the former barrier of the next cycle).
This can be done relatively easily on a copy-on-write file system. For an update-in-place file system, you probably need more barriers or write more stuff in the journal (including data that's written to pre-existing blocks).
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 15, 2009 9:40 UTC (Sun) by k8to (subscriber, #15413) [Link]
Note that fclose() only flushes the user space buffers provided by the C library. To ensure that the data
is physically stored on disk the kernel buffers must be flushed too, for example, with sync(2) or fsync(2).
It seems fclose doesn't imply write.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 20, 2009 14:40 UTC (Fri) by anton (subscriber, #25547) [Link]
Write barriers or something equivalent (properly-used tagged commands, write cache flushes, or disabling write-back caches) are needed for any file system that wants to provide any consistency guarantee. Otherwise the disk drive can delay writing one block indefinitely while writing out others that are much later logically.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 17:52 UTC (Fri) by iabervon (subscriber, #722) [Link]
The case where you want fsync() is when you do something like: file A names a file, B, which exists and has valid data. You create file C, put valid data in it, and atomically change file A to name file C instead of B, and you want to be sure that file A always names a file which exists and has valid data. You can't be sure, without an fsync(), that the disk won't see the change to file A without seeing some operation on file C.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 20:41 UTC (Fri) by pr1268 (subscriber, #24648) [Link]
Linus thinks that rename() without fsync() is safe
"is safe" or _SHOULD_BE_ safe? Far be it from Linus being naïve on these things, but then again I'm certain he's been following this discussion closely for the past few days.
I do know that Linus' big soap box is about programming abstraction, and he'd certainly take the side that open/write/close/rename (in that order) should do exactly that, without any mysterious data loss.
My own "from-the-cuff" perception is that fsync(2)/fdatasync(2) are "band-aids" to address POSIX's lack of specification in this matter. One suggestion is that close(2) should implicitly include an fsync() call, and that programmers should be taught that open() and close() are expensive and best used judiciously.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 21:51 UTC (Fri) by iabervon (subscriber, #722) [Link]
I think fsync() makes sense to have. If the system stops running, there are some things that would have happened, except that the system stopped running first. Furthermore, when the whole system stops running, it becomes difficult to know what things had happened and which had not happened. Furthermore, it's too inefficient to serialize everything, particularly for a multi-process system. Falling back to the concurrency model, you can say that the filesystem after an emergency restart should be in some state that could have been seen by a process that was running before the restart. But there needs to be a further restriction, so that you know that the system won't go back to the blank filesystem that you had before installing anything; so fsync() makes sense as a requirement that the filesystem after a restart will be some state that could have been seen after the last fsync() that returned successfully.
(Of course, any time the system crashes, you might lose some arbitrary data, since the system has crashed; but a better system will lose less or be less likely to lose things. This is qualitatively different from the perfectly reasonable habit of ext4 of deciding that the post-restart state is right after every truncate.)
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 0:47 UTC (Sat) by njs (guest, #40338) [Link]
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 12:40 UTC (Sat) by nix (subscriber, #2304) [Link]
If even *coreutils*, written by some of the most insanely
portability-minded Unix hackers on the planet, doesn't do this
fsync()-source-and-target-directories thing, it's safe to say that, to a
first approximation, nobody ever does it.
The standard here is outvoted by reality.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 14:20 UTC (Sat) by endecotp (guest, #36428) [Link]
- Since mv doesn't fsync and mv is expected to leave things in a sane state after a crash, the kernel must be expected to "do the right thing" wrt rename().
OR
- Since mv doesn't sync, mv is not guaranteed to leave things in a sane state after a crash; if you thought that it was guaranteed to do so you were wrong.
Both
Posted Mar 14, 2009 15:25 UTC (Sat) by man_ls (guest, #15091) [Link]
What I read from this very interesting discussion is that both assumptions are right depending on the circumstances. An inherently unsafe fs like ext2 is not expected to guarantee anything, and mv on ext2 may be left in a unstable state after a crash (including zero-length files). Coreutils developers probably did not see fit to fsync since it would not increase the robustness significantly in these cases: the system might crash in the middle of the fsync anyway.But on a journalled fs like ext3 users will expect their system to be robust in the event of a crash -- and as the XFS debacle shows, not only for metadata. Both are POSIX-compliant, only ext3 is held to higher standards than ext2. What this means for ext4 is obvious.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 17:56 UTC (Fri) by MisterIO (guest, #36192) [Link]
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 23:34 UTC (Fri) by sbergman27 (guest, #10767) [Link]
Well, after I saw that with nodelalloc "the performance will (still) be better than ext3", I immediately made that my default mount option.
"""
On the other hand, you are now on the less well tested path. I remember an ext3 bug a few years ago that caused data loss... but only for those people who mounted "data=journal" just to be safe. I'm always a bit nervous about straying from defaults.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 20:44 UTC (Fri) by tialaramex (subscriber, #21167) [Link]
If you're in that camp, you need to get out and start campaigning for programmers to fallocate() more, because without that you're losing a lot of performance to ensure your ordering constraint. With fallocate() the allocation step can be brought forward and avoid the performance loss. At the very least, file downloaders (e.g. curl, or in Firefox) and basic utilities like /bin/cp and the built-in-copy of modern mv implementations for crossing filesystems, need to fallocate() or you'll fragement just as badly as in ext3 and perhaps worse (since now the maintainers assume you have delayed allocation protecting you).
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 22:24 UTC (Fri) by MisterIO (guest, #36192) [Link]
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 1:10 UTC (Sat) by tialaramex (subscriber, #21167) [Link]
2. Run defragmenter on file to collect fragments into contiguous areas
is crazy. It's so crazy ext4's default behaviour waits as long as possible to allocate in order to avoid this scenario and causes this "bug" that got Ubuntu testers in such a tizzy. The online defragmenter, if and when it arrives in mainline, is a workaround not a fix, you don't to make it part of your daily routine, so most likely what you'll actually do is live with the reduced performance, all so that some utility developers can avoid writing a few lines of code.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 1:40 UTC (Sat) by MisterIO (guest, #36192) [Link]
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 4:19 UTC (Sat) by nlucas (subscriber, #33793) [Link]
fallocate fdatasync sync_file_range
Posted Mar 14, 2009 8:30 UTC (Sat) by stephen_pollei (guest, #23348) [Link]
Yes fallocate is a good thing for a programmer to know. tytso has mentioned that sqlite most likely should have used fdatasync and fallocate . He also mentioned that fsync wouldn't have really been a problem even in ext3 with data=ordered mode if it was called in a thread. I Also think sync_file_range() and fiemap and msync() are good things to know about. I can kind of see how something like mincore() that returned more information that would be in the page tables might be nice; so you could check to see if a page you scheduled for writeout is still dirty or not.
I don't think any of these things would help the case of many small text files being replaced by a rename though -- you need a fsync() to flush the metadata of the filesize increasing, I assume.
a) open and read file ~/.kde/foo/bar/baz
b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
d) sync_file_range or msync to schedule but not wait for stuff to hit disk --- this is optional
e) close(fd)
f) rename("~/.kde/foo/bar/baz", "~/.kde/foo/bar/baz~")
g) wait for the stuff to hit the disk somehow
h) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")
I think a lot of time not being in such a rush to clobber the old data but have both around for a while might work just fine. Heck keep a few versions around to roll back to and lazily garbage collect when you can see the things are more stale. I could be totally wrong though -- just brain-storming.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 20, 2009 15:04 UTC (Fri) by anton (subscriber, #25547) [Link]
If you insist that everything should appear as if it happened in order, then by definition delaying allocation is incompatible with your desires.By what definition? It's perfectly possible to delay allocation (and writing) as long as desired, and also delay all the other operations that are supposed to happen afterwards (in particular the rename of the same file) until after the allocation and writing has happened. Delayed allocation is a red hering here, the problem is the ordering of the writes.
LinLogFS, which had the goal of providing good crash recovery, did implement delayed allocation.
Where the the correctness go?
Posted Mar 13, 2009 23:34 UTC (Fri) by bojan (subscriber, #14302) [Link]
On the other hand, nobody is complaining about this:
> On some rather common Linux configurations, especially using the ext3 filesystem in the data=ordered mode, calling fsync doesnt just flush out the data for the file its called on, but rather on all the buffered data for that filesystem.
That seems like the real problem to me. If I ask for fsync on _my_ file, why on earth does the file system flush the lot out? Shouldn't _that_ be fixed instead?
Not to mention the problem of indiscriminately writing out hundreds of configuration files, when nobody actually changed anything in them.
Where the the correctness go?
Posted Mar 14, 2009 1:34 UTC (Sat) by tialaramex (subscriber, #21167) [Link]
My understanding of the problem is that it looks something like:
You, the application programmer delete file A, and create files B, C and D. You are writing to file B and call fsync(). I, the filesystem driver must now ensure that the data & metadata for B are on disk (survive a reboot etc.) before returning from fsync() or you will be very unhappy.
So I must flush B's data blocks. To do that I must allocate for them, and it so happens I use some blocks free'd by deleting A. So now I need to commit A's metadata update (deleting it) or else things may go tits up. Unfortunately that involves touching a directory, which mentions C and D. So now I need to commit C and D's metadata. And if I do that without pushing C and D's data to disk I am disobeying the data=ordered setting. So I will push C and D to disk.
Figuring out that I need to do all this is expensive, whereas just shrugging and pushing everything to disk is cheap, yet even if I do all the hard figuring I may still have to push everything to disk and now I've also wasted a lot of CPU figuring it out. So you will understand that nobody is anxious to take on the thankless (see the feedback to Tytso from Ubuntu users) yet very difficult task of trying to do better.
Feel free to volunteer though :)
Where the the correctness go?
Posted Mar 14, 2009 2:10 UTC (Sat) by bojan (subscriber, #14302) [Link]
I just switch to ext4 or some other FS that does this properly instead ;-)
Where the the correctness go?
Posted Mar 14, 2009 1:38 UTC (Sat) by foom (subscriber, #14868) [Link]
I don't see anyone attacking Ted. I see people arguing against the idea that zeroing out files is a
good quality for a filesystem to have.
> But, when it comes to the behaviour of one file system in one of its modes which was
> masking incorrect usage of the API, we quickly revert to screaming bloody murder and
> asking for more hand holding.
So perhaps ext5 should erase the entire directory if it's had any entries added or removed since
the last time someone called fsync on the directory fd? Or how about *never* writing any data to
disk unless you've called fsync on the file, its parent directory, and all parent directories up to
the root?
> That seems like the real problem to me. If I ask for fsync on _my_ file, why on earth
> does the file system flush the lot out? Shouldn't _that_ be fixed instead?
Yes, it would be nice, but it's a performance issue, not a data-loss issue.
Presumably at some point someone will figure out how to make a filesystem such that it can
avoid writing out metadata updates related to data which is not yet on disk, without actually
forcing out unrelated data just because you need to write out a metadata update in a different
part of the filesystem.
Where the the correctness go?
Posted Mar 14, 2009 2:15 UTC (Sat) by bojan (subscriber, #14302) [Link]
So, calling Ted's reasonable analysis of the situation a rant is not attacking?
> So perhaps ext5 should erase the entire directory if it's had any entries added or removed since the last time someone called fsync on the directory fd?
And the directory has been truncated by someone? People truncate their files in this scenario _explicitly_, don't commit the data and then expect it to be there. Well, if you don't make it durable, it isn't going to be durable, just atomic.
> Yes, it would be nice, but it's a performance issue, not a data-loss issue.
It's the issue that is causing proper programming practice of fsyncing to be abandoned, because it makes the machine unusable. See the FF issue.
Where the the correctness go?
Posted Mar 14, 2009 2:22 UTC (Sat) by bojan (subscriber, #14302) [Link]
Maybe you missed this bit, but people are truncating the files _explicitly_ in the code and _not_ committing subsequent changes. That's what's zeroing out the files, not the file system.
Where the the correctness go?
Posted Mar 14, 2009 4:14 UTC (Sat) by foom (subscriber, #14868) [Link]
_not_
> committing subsequent changes. That's what's zeroing out the files, not the file system.
That's not the only scenario. There's the one involving rename... You open a *new* file, write to
it, close it, and rename it over an existing file. Then the filesystem commits the metadata change
(that is: the unlinking of the file with data in it, and the creation of the new empty file with the
same name), but *not* the data written to the new file.
No explicit truncation.
Now, there is also the scenario involving truncation. I expect everybody agrees that if you
truncate a file and then later overwrite it, there's a chance that the empty version of the file
might end up on disk. The thing that's undesirable about what ext4 was doing in *that* scenario
is that it essentially eagerly committed to disk the truncation, but lazily committed the new data.
Thus making it *much* more likely that you end up with the truncated file than you'd expect
given that the application called write() with the new data a few milliseconds after truncating the
file.
Where the the correctness go?
Posted Mar 14, 2009 6:23 UTC (Sat) by bojan (subscriber, #14302) [Link]
Where the the correctness go?
Posted Mar 14, 2009 8:22 UTC (Sat) by alexl (subscriber, #19068) [Link]
fsync is just the only way to work around this in the posix API, but its is much more heavy and gives much more guarantees than we want.
Where the the correctness go?
Posted Mar 14, 2009 8:30 UTC (Sat) by bojan (subscriber, #14302) [Link]
If you rename such file to an existing file that contained data, you may legitimately end up with an empty file on crash.
If you want the data to be in the new file that will be renamed into the old file, you have to call fsync on the new file. Then you atomically rename.
This is what emacs and other safe programs already do. From https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug...
What emacs (and very sophisticated, careful application writers) will do is this:
3.a) open and read file ~/.kde/foo/bar/baz
3.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
3.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
3.d) fsync(fd) --- and check the error return from the fsync
3.e) close(fd)
3.f) rename("~/.kde/foo/bar/baz", "~/.kde/foo/bar/baz~") --- this is optional
3.g) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")
Where the the correctness go?
Posted Mar 14, 2009 12:03 UTC (Sat) by alexl (subscriber, #19068) [Link]
Now, what ext4 does is clearly correct according to what is "allowed" by POSIX (actually, this is kinda vague as POSIX allows fsync() to be empty, and doesn't actually specify anything about system crashes.)
However, even if its "posixly correct", it is imho broken. In the sense that I wouldn't let any of my data near such a filesystem, and I would recommend everyone who asks me to not use it.
Take for example this command:
sed -i s/user1/user2/ foo.conf
This does in-place update using write-to-temp and rename over, without fsync. The result of running this command, is that if your machine locks up after up to a minute you loose both versions of foo.conf.
Now, is foo.conf important? How the heck is sed to know? Is sed broken? Should it fsync? Thats more or less arguing that every app should fsync on close, which on ext4 is the same as the filesystem doing it, but on ext3 is unnecessary and a massive system slowdown.
Or should we try to avoid the performance implications of fsync (due to its guarantees being far more than what we need to solve our requirements)? We could do this by punting this to the users of sed, by having a -important-data argument, and then pushing this further out to any script that uses sed, etc, etc.
Or we could just rely on filesystems to guarantee this common behaviour to work. Even if its not specified by POSIX. (And choose not to use filesystems that doesn't give us that guarantee, like so many people have switched from XFS after data losses).
Ideally of course there would be another syscall, flag or whatever that says "don't write metadata before data is written". That way we could get both efficient and correct apps, but that doesn't exist today.
Where the the correctness go?
Posted Mar 14, 2009 21:20 UTC (Sat) by bojan (subscriber, #14302) [Link]
Look, this may as well be true, but the fact is that all of us that are creating applications have one thing to rely on - documentation. And the documentation says what it says.
Where the the correctness go?
Posted Mar 16, 2009 12:00 UTC (Mon) by nye (guest, #51576) [Link]
Where the the correctness go?
Posted Mar 14, 2009 13:32 UTC (Sat) by nix (subscriber, #2304) [Link]
no baz, on crash.
Hardly ideal, but probably unavoidable.
Why doesn't someone add real DB-style transactions to at least one
filesystem, again? They'd be really useful...
Where the the correctness go?
Posted Mar 14, 2009 21:25 UTC (Sat) by bojan (subscriber, #14302) [Link]
Yep, very true.
But, no zero length file, which was the original problem. Essentially, you will get at least _something_.
> Why doesn't someone add real DB-style transactions to at least one filesystem, again? They'd be really useful...
Who knows, maybe we'll get proper API for that behaviour out of this discussion.
Where the the correctness go?
Posted Mar 14, 2009 21:39 UTC (Sat) by foom (subscriber, #14868) [Link]
There's no guarantee of that. A filesystem could simple erase itself upon unexpected
poweroff/crash. *Anything* better than that is an implementation detail.
Where the the correctness go?
Posted Mar 15, 2009 1:58 UTC (Sun) by bojan (subscriber, #14302) [Link]
Where the the correctness go?
Posted Mar 15, 2009 2:42 UTC (Sun) by njs (guest, #40338) [Link]
> There's a window there where it can leave you with baz~ and baz.new, but no baz, on crash.Yeah, 3.f is supposed to say "link", not "rename". (Programming against POSIX correctly makes those Raymond Smullyan books seem like light reading. If only everything else weren't worse...)
> Why doesn't someone add real DB-style transactions to at least one filesystem, again? They'd be really useful...
The problem is that a filesystem has a bazillion mostly independent "transactions" going on all the time, and no way to tell which ones are actually dependent on each other. (Besides, app developers would just screw up their rollback-and-retry logic anyway...)
Completely off the wall solution: move to a plan9/capability-ish system where apps all live in little micro-worlds and can only see the stuff that is important to them (this is a good idea anyway), and then use these to infer safe but efficient transaction domain boundaries. (Anyone looking for a PhD project...?)
Transactions, ordering, rollback...
Posted Mar 15, 2009 8:09 UTC (Sun) by Pc5Y9sbv (guest, #41328) [Link]
However, as we were discussing further up the page, a write-barrier is really all that is needed for the intuitive crash-proof behavior desired by everything doing the "create a temp file; relink to real name". An awful lot of the discussion seems to conflate request ordering with synchronous disk operations, when all we really desire is ordering constraints to flow through the entire filesystem and block layer to the physical medium.
All people want is for the POSIX ordering semantics of "file content writes" preceding "file name linkage" to be preserved across crashes. It is OK if the crash drops cached data and forgets the link, or the data and link, but not the data while preserving the link.
Where the the correctness go?
Posted Mar 15, 2009 12:26 UTC (Sun) by nix (subscriber, #2304) [Link]
transactions getting entangled with each other is dependencies *within the
fs metadata*. i.e. what you'd actually need to do is put off *all*
operations on fs metadata areas that may be shared with other transactions
until such time as the entire transaction is ready to commit. And that's a
huge change.
Where the the correctness go?
Posted Mar 16, 2009 3:13 UTC (Mon) by k8to (subscriber, #15413) [Link]
link the name to name~
rename the name.new to name
Yes, explicit transaction support in the filesystem would be great, though hammering out the api will probably be hairy.
Where the the correctness go?
Posted Mar 14, 2009 9:19 UTC (Sat) by bojan (subscriber, #14302) [Link]
This semantics (where data in the new file is magically committed) may or may not be a result of particular file system implementation. From the rename() man page:
> If newpath already exists it will be atomically replaced (subject to a few conditions; see ERRORS below), so that there is no point at which another process attempting to access newpath will find it missing.
Nowhere does it specify what _data_ will be in either file, just that the file will be there. ext4 dutifully obeys that.
In short, what you are referring to as "traditional unix way" doesn't really exist. Proof: emacs code.
PS. Sure, it would be nice to have such "one shot" API. But, the current API isn't it.
Where the the correctness go?
Posted Mar 14, 2009 14:30 UTC (Sat) by endecotp (guest, #36428) [Link]
> just that the file will be there.
No; POSIX requires that the effects of one process' actions, as observed by another process, occur in order. So if you do the write() before the rename(), it is guaranteed that the file will be there with the expected data in it.
Of course this is not true of crashes where POSIX doesn't say anything at all about what should happen. Behaviour after a crash is purely a QoI issue.
Where the the correctness go?
Posted Mar 14, 2009 21:18 UTC (Sat) by bojan (subscriber, #14302) [Link]
We are talking about data _on_ _disk_ here, not what the process may see (which may be buffers just written, as presented by the kernel). What is on disk is _durable_, which is what we are discussing here. For durable, you need fsync.
So, rename does not specify which data on disk will be when.
Where the the correctness go?
Posted Mar 15, 2009 12:44 UTC (Sun) by endecotp (guest, #36428) [Link]
So what this all boils down to is how close each filesystem implementation comes to "non-crash" behaviour after a crash, which is a quality-of-implementation choice for the filesystems.
As far as I can see, for portable code the best bet is to stick with the write-close-rename pattern. This is sufficient for atomic changes in the non-crash case. Adding fsync in there makes it safe in the crash case for some filesystems, but not all, and there are others where it was safe without it, and others where it has a performance penalty: it's far from a clear winner at the moment.
Where the the correctness go?
Posted Mar 15, 2009 21:24 UTC (Sun) by bojan (subscriber, #14302) [Link]
Hence, you need to have various #ifs and ifs() to figure out what works on your platform. See Mac OS X. fsync is just an example here. The point is that you must use _something_ to commit. Without that, POSIX does not guarantee anything beyond currently running processes seeing the same picture.
Where the the correctness go?
Posted Mar 16, 2009 4:49 UTC (Mon) by dlang (guest, #313) [Link]
Where the the correctness go?
Posted Mar 16, 2009 13:28 UTC (Mon) by jamesh (guest, #1159) [Link]
That is likely to restrict reorderings that won't break correctness guarantees though.
Where the the correctness go?
Posted Mar 16, 2009 3:19 UTC (Mon) by k8to (subscriber, #15413) [Link]
fsync explicitly says that when it returns success, the data has been handed to the storage system successfully.
It doesn't guarantee that that storage system has committed it in a durable way for all scenarios. That's another issue.
fsync does guarantee that the data has been handed to the storage medium, but makes no guarantees about the implementation of that storage medium.
Where the the correctness go?
Posted Mar 16, 2009 1:07 UTC (Mon) by vonbrand (guest, #4458) [Link]
Sorry, "opening a file for writing it from scratch" is truncating, quite explicitly.
Where the the correctness go?
Posted Mar 14, 2009 4:24 UTC (Sat) by flewellyn (subscriber, #5047) [Link]
Which has what to do with the filesystem itself? I mean, if you use O_TRUNC when you call open(), zeroing out the file is exactly what you are asking to happen. Doing this and then writing new data to the file without calling fsync() before closing it, is where the problem comes from.
People should not blame the filesystem for doing what they ask it to do.
Where the the correctness go?
Posted Mar 14, 2009 5:16 UTC (Sat) by foom (subscriber, #14868) [Link]
sequence causes the problem.
Where the the correctness go?
Posted Mar 14, 2009 5:33 UTC (Sat) by flewellyn (subscriber, #5047) [Link]
Where the the correctness go?
Posted Mar 15, 2009 1:43 UTC (Sun) by droundy (subscriber, #4559) [Link]
Where the the correctness go?
Posted Mar 14, 2009 6:21 UTC (Sat) by bojan (subscriber, #14302) [Link]
As for rename, your file gets renamed _atomically_ just fine. However, if you don't _commit_ your writes (make the changes durable, call fsync), the renamed file will have zero size.
This is not a file system problem, but an application problem.
Where the the correctness go?
Posted Mar 14, 2009 13:06 UTC (Sat) by nix (subscriber, #2304) [Link]
Presumably at some point someone will figure out how to make a filesystem such that it can avoid writing out metadata updates related to data which is not yet on disk, without actually forcing out unrelated data just because you need to write out a metadata update in a different part of the filesystem.The BSD people did. It's so difficult that nobody else has done it since.
Where the the correctness go?
Posted Mar 15, 2009 8:14 UTC (Sun) by Pc5Y9sbv (guest, #41328) [Link]
I know that contemporary disk controller protocols support write-barriers in their command streams. They were intended to make this sort of thing easy. You don't have to micro-manage the requests all the way to the platter, but just decorate them with the correct ordering relations when you issue the commands.
Write barriers
Posted Mar 17, 2009 1:43 UTC (Tue) by xoddam (subscriber, #2322) [Link]
Because the journal is already able to provide the guarantee of filesystem metadata consistency, it can be used in the same way to ensure an effective ordering between write() and rename().
Where did the correctness go?
Posted Mar 17, 2009 6:42 UTC (Tue) by butlerm (subscriber, #13312) [Link]
circles around contemporary filesystems for comparable synchronous and
asynchronous operations. Check put Gray and Reuter's Transaction Processing
book for details. The edition I have was published in 1993.
There are two basic problems here:
The first is that fsync is a ridiculously *expensive* way to get the needed
functionality. The second is that most filesystems cannot implement atomic
operations any other way (i.e. without forcing both the metadata and the
data and any other pending metadata changes to disk).
fsync is orders of magnitude more expensive than necessary for the case
under consideration. A properly designed filesystem (i.e. one with
metadata undo) can issue an atomic rename in microseconds. The only option
that POSIX provides can take hundreds if not thousands of milliseconds on a
busy filesystem.
Databases do *synchronous*, durable commits on busy systems in ten
milliseconds or less. Ten to twenty times faster than it takes
contemporary filesystems to do an fsync under comparable conditions.
Even that is still a hundred times more expensive than necessary, because
synchronous durability is not required here. Just atomicity. Nothing has
to hit the disk. No synchronous I/O overhead. Just metadata undo
capability.
Where did the correctness go?
Posted Mar 17, 2009 7:18 UTC (Tue) by dlang (guest, #313) [Link]
they use f(data)sync calls to the filesystem.
so your assertion that databases can make atomic changes to their data faster than the filesystem can do an fsync means that either you don't know what you are saying, or you don't really have the data safety that you think you have.
Where did the correctness go?
Posted Mar 17, 2009 8:31 UTC (Tue) by butlerm (subscriber, #13312) [Link]
durability. A decent database will let you turn (synchronous) durability
off while fully guaranteeing atomicity and consistency.
The reason is that with a typical rotating disk, any durable commit is
going to take at least one disk revolution time, i.e. about 10 ms. Single
threaded atomic (but not necessarily durable) commits can be issued a
hundred times faster than that, because no synchronous disk I/O is required
at all.
Where did the correctness go?
Posted Mar 17, 2009 9:48 UTC (Tue) by dlang (guest, #313) [Link]
it's just the durability in the face of a crash that isn't there. but it wasn't there on ext3 either (there was just a smaller window of vunerability), and even if you mount your filesystem with the sync option many commodity hard drives would not let you disable their internal disk caches, and so you would still have the vunerability (with an even smaller window)
Where did the correctness go?
Posted Mar 17, 2009 17:30 UTC (Tue) by butlerm (subscriber, #13312) [Link]
atomicity you are looking for."
I am afraid not. Atomic means that the pertinent operation always appears
either to have completed OR to have never started in the first place. If
the system recovers in a state where some of the effects of the operation
have been preserved and other parts have disappeared, that is not atomic.
The operation here is replacing a file with a new version. Atomic
operation means when the system recovers there is either the old version or
the new version, not any other possibility. You can do this now of course,
you simply have have to pay the price for durability in addition to
atomicity.
Per accident of design, filesystems require a much higher price (in terms
of latency) to be paid for durability than databases do. That
factor is multiplied by a hundred or more if atomicity is required, but
durability is not.
Where did the correctness go?
Posted Mar 17, 2009 17:38 UTC (Tue) by butlerm (subscriber, #13312) [Link]
Where did the correctness go?
Posted Mar 17, 2009 20:42 UTC (Tue) by nix (subscriber, #2304) [Link]
This is a regression
Posted Mar 13, 2009 23:59 UTC (Fri) by ikm (subscriber, #493) [Link]
This thing alone has actually made me decided I wouldn't be moving to ext4 any time soon.
This is a regression
Posted Mar 14, 2009 1:53 UTC (Sat) by tialaramex (subscriber, #21167) [Link]
Zero length files were probably not possible (or at least so rare that you'd never see it) in ext3 for the rename case if you have data=ordered. The patch makes them similarly rare in ext4.
Neither happens if you run normally, or even if you soft hang, losing interactivity but allowing the kernel to flush to disk. Neither happens if your laptop doesn't wake up from sleep so long as the sleep code properly calls sync(). Neither happens if your changes were at least 5 seconds old (ext3 data=ordered) or 60 seconds old (other cases) The people getting bitten either lost power suddenly while working, or hit the reset button.
I agree that zero length files are undesirable, and shouldn't be common even if you pull the plug. Evidently Ted does too, since the patches are enabled by default. Still, it remains the case that applications which must have data integrity need to be more careful than this, because otherwise things can (even in ext3 with data=ordered) go badly wrong for you.
I believe that nodelalloc is just as much overkill as fully preserving atime is. Sure, in theory it might be slightly safer to disable the delayed allocator, but in practice it doesn't make enough difference to worry about, and the performance gain is very attractive. Sooner or later if you use computers you will lose some data, that's why we have backups.
This is a regression
Posted Mar 14, 2009 2:26 UTC (Sat) by bojan (subscriber, #14302) [Link]
Thanks for pointing this out. Essentially, relying on this behaviour was an accident, waiting to bite. Unfortunately, due to broken semantics of fsync on ext3, having a correct application would break the performance of the system. Looks to me that ext3 is far more broken than ext4 (which doesn't seem broken at all to me).
This is a regression
Posted Mar 14, 2009 12:30 UTC (Sat) by ikm (subscriber, #493) [Link]
This is a regression
Posted Mar 14, 2009 21:37 UTC (Sat) by bojan (subscriber, #14302) [Link]
Which just proves that most users are irrational, because they don't know any better. So, people that _know_ what really is the problem should listen to people that don't in order to fix it?
This is a regression
Posted Mar 14, 2009 22:50 UTC (Sat) by ikm (subscriber, #493) [Link]
Users don't care which solution is the right one as long as it *works*. And the solution went to 2.6.30 indeed. Distributors would hopefully backport. Problem solved. Horray. But all the blabbering about how POSIX allows this and stuff is unhelpful to end-user, if surely interesting and inspiring to developers.
This is a regression
Posted Mar 15, 2009 2:02 UTC (Sun) by bojan (subscriber, #14302) [Link]
And that is exactly why Ted, being a practical person, reverted to the old behaviour in some situations. Doesn't mean application writers should continue using incorrect idioms.
> It's totally unrealistic and not doable in any short- or even mid-term. Why suggest this then? And who is irrational after all?
Sorry, fixing bugs is irrational?
> But all the blabbering about how POSIX allows this and stuff is unhelpful to end-user, if surely interesting and inspiring to developers.
POSIX isn't blabbering (see http://www.kernel.org/):
> Linux is a clone of the operating system Unix, written from scratch by Linus Torvalds with assistance from a loosely-knit team of hackers across the Net. It aims towards POSIX and Single UNIX Specification compliance.
This is a regression
Posted Mar 15, 2009 3:12 UTC (Sun) by bojan (subscriber, #14302) [Link]
1. By default make ext3 ordered mode have fsync as a no-op. People that want current broken behaviour could specify a mount option to get it.
2. Tell folks that they _must_ use fsync in order to commit their data.
3. Once critical mass of applications achieved the above, remove all hacks from ext4, XFS etc.
4. Retire ext3.
This is a regression
Posted Mar 15, 2009 4:53 UTC (Sun) by foom (subscriber, #14868) [Link]
but *never* writes any new data (file data, directory data, anything) to a permanent location until
you've called fsync on the file's fd, the containing directory's fd, and the fd of every directory up the
tree to the root (or call sync, of course).
Hopefully it can be the default fs for Ubuntu Jaded Jackal. If anyone complains, I'm sure "But POSIX
says it's okay to do that, the apps are broken for not obsessively calling sync after every write!" will
satisfy everyone. :)
This is a regression
Posted Mar 15, 2009 5:26 UTC (Sun) by bojan (subscriber, #14302) [Link]
All the people here suggesting that well established standards Linux _aims_ to implement should be ignored, should remember the screaming Microsoft had to face from the FOSS community when they started twisting various standards to their own ends.
This is a regression
Posted Mar 15, 2009 12:34 UTC (Sun) by nix (subscriber, #2304) [Link]
This is a regression
Posted Mar 15, 2009 21:09 UTC (Sun) by bojan (subscriber, #14302) [Link]
Of course, Ted put hacks into ext4 because application writers missed the above and it will take time to fix it. That's called a workaround.
This is a regression
Posted Mar 15, 2009 23:50 UTC (Sun) by nix (subscriber, #2304) [Link]
should arrange that even if an app does something under which POSIX
*permits* data loss, that data loss is still considered bad and should be
avoided.
Agreed the apps are buggy, but I think this is a deficiency in POSIX,
rather than anything else.
This is a regression
Posted Mar 16, 2009 0:17 UTC (Mon) by bojan (subscriber, #14302) [Link]
And that's going to help the broken application running on another filesystem exactly how? The problem with hypocrisy here is not related to ext4 - it related to application code.
BTW, it is obvious that Ted already decided to make sure ext4 does that. The man is not stupid - he doesn't want the file system rejected over this - no matter how wrong the people blaming ext4 for this are.
> Agreed the apps are buggy, but I think this is a deficiency in POSIX, rather than anything else.
Well, yeah - the spec is, shall we say - demanding. But, it is what it is. We tell Microsoft not to ignore the specs. What makes us so special that we can? I would suggest nothing. If we take the right to demand that from Microsoft, we should make sure we do it ourselves.
This is a regression
Posted Mar 16, 2009 1:07 UTC (Mon) by nix (subscriber, #2304) [Link]
There's no point talking to you at all, IMNSHO.
This is a regression
Posted Mar 16, 2009 2:19 UTC (Mon) by bojan (subscriber, #14302) [Link]
If you don't want to talk to me, then don't. That's OK.
This is a regression
Posted Mar 16, 2009 13:45 UTC (Mon) by ikm (subscriber, #493) [Link]
> And that's going to help the broken application running on another filesystem exactly how?
It's not. We are talking about fixing problems users start to experience when they switch from ext3 to ext4. None of the other goals, such as fixing all the apps, making all filesystems happy, feeding the hungry and making world a better place are being pursued here. The 2.6.30 fixes do what they are supposed to do, without breaking anything else. So it is a good thing, and I don't understand why you seem to be against it.
Sure, there's lots of stuff which ain't working right, but it's not a subject here. World's not perfect, and it's not going to be any time soon.
This is a regression
Posted Mar 15, 2009 12:57 UTC (Sun) by ikm (subscriber, #493) [Link]
Gosh. What people suggest here is that standards should not be used as an excuse for unwanted filesystem behavior.
This is a regression
Posted Mar 16, 2009 0:21 UTC (Mon) by bojan (subscriber, #14302) [Link]
This is a regression
Posted Mar 15, 2009 12:34 UTC (Sun) by nix (subscriber, #2304) [Link]
crap generated by apps that didn't call fsync().
(I wonder if we can allow it to write dirty data to disk when under memory
pressure, as well? ;) )
This is a regression
Posted Mar 15, 2009 15:18 UTC (Sun) by dcoutts (subscriber, #5387) [Link]
This is a regression
Posted Mar 15, 2009 12:20 UTC (Sun) by alexl (subscriber, #19068) [Link]
Are you crazy? That would break ACID guarantees for all databases, etc.
fsync() is about much more than data-before-metadata.
This is a regression
Posted Mar 15, 2009 21:28 UTC (Sun) by bojan (subscriber, #14302) [Link]
Close to it ;-)
I admit, that was a bit tongue-in-cheek, to point out that current ext3 "lock up on fsync" behaviour is total nonsense.
This is a regression
Posted Mar 16, 2009 14:09 UTC (Mon) by ikm (subscriber, #493) [Link]
Once I had MySQL running on an XFS filesystem, and the system has hanged for some reason. The database got broken so horribly I had to restore it from backups. I wouldn't really count on any 'ACID guarantees' here :) An UPS and a ventilated dust-free environment is our only ACID guarantee :)
This is a regression
Posted Mar 17, 2009 5:41 UTC (Tue) by efexis (guest, #26355) [Link]
This is a regression
Posted Mar 17, 2009 11:59 UTC (Tue) by ikm (subscriber, #493) [Link]
This is a regression
Posted Mar 14, 2009 13:09 UTC (Sat) by nix (subscriber, #2304) [Link]
In any case anyone writing for Unix/Linux should know about and use at least the rename trick for replacing small files. Not doing so causes much worse problems than this one.I checked at work (a random Unix/Oracle financial shop) some time ago.
One person knew this trick (the only person there other than me who reads standards documents for fun), and not even he had spotted the old oops-better-retry-my-writes-on-EINTR trap. Most people assumed that 'oh, the O_TRUNC and the writes will all get put off until I close the file, won't they?' and hadn't even thought it through that much until I pressed them on it.
J. Random Programmer is much, much less competent than you seem to think.
fsync and spin-ups
Posted Mar 14, 2009 2:10 UTC (Sat) by njs (guest, #40338) [Link]
I also take Ted's point that if apps want atomic-rename to be portably atomic, then they have to call fsync anyway and accept that they will pay an additional cost for getting durability in addition to atomicity.
BUT, in those cases where durability is not required, and I'm using a filesystem like ext3 or (soon) ext4 that supports atomic-but-not-durable-rename, I *really* want some way to access that.
My problem is that I spend all my time on my laptop, and most of the time my hard drive is spun down. I've had to disable emacs's fsync-on-save, because otherwise my editor freezes for a good 0.5-1 seconds every time I hit save, while it blocks waiting for the disk to spin up. Even if the filesystem side of fsync is made cheap like in ext4, fsync will never be cheap if it blocks waiting on hardware like this. And if everyone adds fsync calls everywhere (because they've read Ted's rant, or just because fsync became cheaper), then most apps won't have a handy knob like emacs does to disable it, and that will suck for laptop user experience and power saving.
So I think we need an interface that lets user-space be more expressive about what ordering requirements it actually has -- rename_after_write(2), or something.
What's the optimization?
Posted Mar 14, 2009 7:21 UTC (Sat) by dfsmith (guest, #20302) [Link]
I.e., what is the pathological situation where a background fsync() initiated on close() fails?
I fully appreciate delayed allocation on files that are still open though.
Temporary files.
Posted Mar 14, 2009 11:21 UTC (Sat) by khim (subscriber, #9252) [Link]
A lot of files are written, used and removed in short order (think about "configure" script, for example). If allocation is delayed then file never hits the disk and savings can be huge.
Temporary files.
Posted Mar 14, 2009 12:47 UTC (Sat) by mgb (guest, #3226) [Link]
Temporary files.
Posted Mar 16, 2009 3:24 UTC (Mon) by k8to (subscriber, #15413) [Link]
This allows linux to avoid the idiocy of solaris's tmpfs where writing a large file to /tmp can crash the box.
Temporary files.
Posted Mar 16, 2009 15:22 UTC (Mon) by nix (subscriber, #2304) [Link]
going to crash the box unless you explicitly raised the size limits
on /tmp...
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 9:15 UTC (Sat) by petegn (guest, #847) [Link]
Me personally i gave up on the entire EXT* filesystem some years ago and will NEVER go back you can slag Reiserfs off all you want but it beats the living crap outta EXT*
I use Reiserfs for the small boot partition then XFS and it just behaves correctly ALL the time .
This may be slightly off topic but this my experence with EXT systems total unreliabity
YMMV mine is VERY fixed
Finny that - you must be pretty unique
Posted Mar 14, 2009 11:26 UTC (Sat) by khim (subscriber, #9252) [Link]
I use Reiserfs for the small boot partition then XFS and it just behaves correctly ALL the time.
Actully XFS is much worse in case of crush than any EXT* filesystem and it often used as "you can do this because XFS is doing this". Sorry, but I just don't buy you story. How often you XFS-based system crashes? How consistent it's afterwards? We need to know before your evidence can be used in this dicussion...
My personal experience with XFS was horrible exactly with behavior like descibed: save the file or configuration on disk, power off the system - then reboot and find out that file has zero bytes now. Few such incidents were enough for me. If you answer is "don't do this" then how the hell your experience is relevant to topic?
Finny that - you must be pretty unique
Posted Mar 14, 2009 22:56 UTC (Sat) by ikm (subscriber, #493) [Link]
Finny that - you must be pretty unique
Posted Mar 15, 2009 3:32 UTC (Sun) by dgc (subscriber, #6611) [Link]
> experience is relevant to topic?
XFSQA tests 136-140, 178 and a couple of others "do this"
explicitly and have done so for a couple of years now. This failure
scenario is tested every single time XFSQA is run by a developer.
Run those tests on 2.6.21 and they will fail. Run them on 2.6.22
and they will pass...
FWIW, XFS is alone in the fact that it has:
a) a publicly available automated regression test suite;
(http://git.kernel.org/?p=fs/xfs/xfstests-dev.git;a=summary)
b) a test suite that is run all the time by developers;
c) ioctls and built-in framework to simulate the effect of
power-off crashes that the QA tests use.
This doesn't mean XFS is perfect, but it does mean that it is known
immediately when a regression appears and where we need improvements
to be made.
IOWs, we can *prove* that the (once) commonly seen problems in XFS
have been fixed and will remain fixed. It would be nice if people
recognised this rather relevant fact instead of flaming
indiscriminately about stuff that has been fixed....
Finny that - you must be pretty unique
Posted Mar 15, 2009 13:15 UTC (Sun) by hmh (subscriber, #3838) [Link]
The only thing I never use XFS for is the root filesystem, and that's because nobody has seen fit to fix the XFS fsck to detect it is being run on a read-only filesystem, and to switch to xfs_repair on-the-fly.
It is no fun to need boot CDs to make sure everything is kosher in the root filesystem (or worse, to repair it)...
But it really works well for the MTA queues and squid spool, for example.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 15, 2009 4:02 UTC (Sun) by bojan (subscriber, #14302) [Link]
POSIX is a standard that Linux kernel is attempting to implement. Explaining to people what it says is not ranting.
Just remember, when Firefox developers got blasted for the performance hit in in FF 3.0, it wasn't their fault. It was the fsync problem in ext3, which is still there.
Similarly this time, Ted is getting blasted for other people's bugs, partially caused by completely unusable fsync on ext3.
Finally, if you don't like using new software, then don't use it. Nobody is twisting your arm.
Credit where credit is due
Posted Mar 14, 2009 18:29 UTC (Sat) by sbergman27 (guest, #10767) [Link]
Credit where credit is due
Posted Mar 14, 2009 19:45 UTC (Sat) by jspaleta (subscriber, #50639) [Link]
As of 03-06 people were still reporting problems in that ticket.
Even Kirkland reports the problem persists in some form as of 03-07 and affecting encrypted home directories. Lets just hope this wasn't the issue which forced Ubuntu to pull encrypted home directory support from the installer in Alpha6 and Jaunty final. That was a pretty cool feature to make available at install time for laptop users, it's a shame they had to pull the plug on it.
Its unfortunate that an upstream ticket was never created as part of that launchpad process to keep Ted in the loop. It seems in the past, Ubuntu user discovered real kernel bugs get a faster fix if the bugs find their way to the upstream kernel tracker for other kernel developers to be made aware of. For example:
http://bugzilla.kernel.org/show_bug.cgi?id=12821
But the external Ubuntu community certainly gets the credit for finding the bug...that's for sure. It seems Canonical has decided to make it easy for Ubuntu users to crash a system at unexpected times quite frequently. We should definitely thank Canonical for that. That is an important part of fuzz testing that I think the upstream kernel developers may overlook since their rabid passionate focus on keeping crashes from happening at all. How many of us do things like pull the power at random times on our systems to recreate a situation that looks like an unexpected crash scenario. I rarely think to do it. Physically pulling out the battery on my laptop to fuzz test a crash scenario isn't something I've even thought about doing.
It's good to see Canonical strongly committing to the idea of comprehensive testing to the extent that they are willing to give their users highly unstable combinations of kernel+module environments to work with to find these sorts of bugs which only crop up in unexpected and unlooked for kernel crashes. Hats off to them. By sacrificing overall system reliability for millions of users on a daily basis, they are able to flush out bugs which can only be seen when other problems(which can't be fixed by the kernel developers themselves such as out of tree drivers) cause the kernel to crash in an unexpected way. I'm not sure this is the sort of commitment to directly support upstream development that some of use we hoping to see from Canonical...but its something.
-jef
Credit where credit is due
Posted Mar 14, 2009 21:30 UTC (Sat) by drag (guest, #31333) [Link]
What it seems to me is that they have been able to garner a following among neophite Linux users that not only want to test and run bleeding edge software, they want to get all their graphic goodness to go with it.
Since graphics in Linux sucked for so long the only effective vendor for people wanting superior 3d performance is Nvidia. It's not the user's fault that this has simply been a fact of life for Linux for a long time now.
Nvidia makes unstable drivers. For Vista or XP or Linux they are going to be a major source of issues; it doesn't matter. With bleeding edge versions of Linux that are unstable to begin with then your going to run into a lot of crashes.
So it's nobody's fault really.. not anybody here or involved with Ubuntu. This is why we have beta testers and they are doing their jobs. Everybody should be thankful that this problem is being resolved now rather then on your servers and workstations.
Credit where credit is due
Posted Mar 14, 2009 23:19 UTC (Sat) by jspaleta (subscriber, #50639) [Link]
http://www.linux.com/feature/55915
Canonical is doing a very very neat tap dance here. They don't distribute the binary drivers...if they did that they'd be engaging in an activity that is restricted by the gpl and could easily be called out on it. No, Canonical is much more clever, they encourage users to download the source and binary bits together and compile them locally on the system, avoiding the distribution of the infringing binary module all together. They've even made a point and click gui to kick off this process. Clever Canonical, skating around the edges of gpl infringement and at the same time introducing significant instability into their userbase's experience without an effort to adequately inform users as to what is going on and the pontential downsides.
I don't think its exactly fair to expect the upstream kernel developers to support this sort of behaviour. And I'll note, that because of the tricks being used to avoid a gpl violation scenario, Canonical can't even do its own integration testing on these modules. There are no centrally built and distributed binaries.
Canonical might be avoiding a gpl violation by doing things this way, but they are certainly breaking some of the spirit of the intent.
-jef
Credit where credit is due
Posted Mar 15, 2009 13:29 UTC (Sun) by ovitters (guest, #27950) [Link]
No it really is Canonical's faultFedora dude blaming/attacking Canonical again...
Credit where credit is due
Posted Mar 15, 2009 17:25 UTC (Sun) by drag (guest, #31333) [Link]
No.
Cononical exists in reality, they are not the ones that created it.
People were installing nvidia drivers on Nvidia, Fedora, Redhat, Debian, etc etc etc long long before Ubuntu ever came onto the scene.
Credit where credit is due
Posted Mar 14, 2009 23:13 UTC (Sat) by sbergman27 (guest, #10767) [Link]
To think that a Fedora advocate, of all people, could have the gall to criticize *any* other distro's stability is nearly unbelievable.
I currently have Intel video. But I have experience with both NVidia's drivers and Fedora's FOSS Radeon drivers and Intel drivers. And NVidia wins hands down for stability. Which is not to say that my stability complaints end there. Not by a long shot.
It seems that the Jef Spaleta FUD-fountain flows eternal. Fortunately, the more it flows, the less credibility people accord it.
Credit where credit is due
Posted Mar 14, 2009 23:46 UTC (Sat) by jspaleta (subscriber, #50639) [Link]
How NVidia impedes Free Desktop adoption.
http://vizzzion.org/?blogentry=819
July 2008
"As a Free Software developer, user and advocate, I feel screwed by NVidia, and as a customer, even more so. I would recommend not buying NVidia hardware at this point. For both political reasons, and for practical ones: Pretty much all other graphics cards around there work better with KDE4 than those requiring nvidia.ko."
Linux Graphics Essay
https://www.linuxfoundation.org/en/Linux_Graphics_Essay
"Nvidia isn't in the list of top oops causers as part of some grand strategy to make itself (and Linux) look bad. It's there because the cost of doing the QA and continuous engineering to support the changing interfaces and to detect and correct these problems outweighs the revenue it can bring in from the Linux market. In essence this is because binary drivers totally negate the shared support and maintenance burden which is what makes Open Source so compelling."
If you must, feel free to shoot the messenger. Make it personal if that's what you need in order to engage on the issues. I'm more than happy to take a few bullets. But for anyone who cares about sustainable growth of the linux ecosystem to disregard the validity of the information is pure folly. Binary drivers are a real and significant problem to the stability of linux systems. The people doing the actually development of the linux desktop realize this, even if individual users and the entire Canonical workforce do not.
-jef
Credit where credit is due
Posted Mar 15, 2009 0:47 UTC (Sun) by sbergman27 (guest, #10767) [Link]
Please, don't take my word for it.
"""
There surely was never any danger of that. If bias were black body radiation, you'd be glowing at about 3500K. (Which hurts your credibility substantially. And likely not just with me.)
Now, I certainly don't care for NVidia keeping their drivers closed source. But in my experience with graphics drivers under Linux (which goes back to VooDoo1 and the original, pre Daryll Strauss, glide driver, and having lived through the hell that has been FOSS Radeon driver, I'd have to say that on all the cards I've had, including some NVidias, the NVidia driver has been flawless compared to the big mess that the usually incomplete FOSS video drivers often seem to be.
I wish that were not that case. But it has been my experience for about 10 years now.
Credit where credit is due
Posted Mar 15, 2009 1:15 UTC (Sun) by jspaleta (subscriber, #50639) [Link]
I like that vibe you've got going on there. One man's personal experience against a mountain of contrary opinion and evidence. That's the sort of awe inspiring situational awareness and view of the big picture that causes me to respect opponents to global warming and evolutionary theory so very very highly. I salute you!
I'm really glad the Ubuntu developers decided to finally enable kerneloop reporting so we can get a more comprehensive and unbiased view of the sources of instability in the Ubuntu kernel. Though I'm not sure they are in the kerneloops.org database yet. I personally fully expect that the Ubuntu experience will be much like the Fedora one in the kerneloops record. The proprietary drivers will dominate the crash reporting statistics...Canonical will introduce some bugs via patches, which will be quickly fixed (just as Fedora has)...but the proprietary driver bugs, like nvidia, will linger and linger...contrary to your singular personal experience.
-jef
Credit where credit is due
Posted Mar 15, 2009 12:23 UTC (Sun) by nix (subscriber, #2304) [Link]
(mach64 and now 9250). 3D is fast enough for my purposes, 2D is blinding
(and is now even faster thanks to a not-yet-committed patch from Michel
Dänzer to defragment the EXA glyph cache)... I've had a total of one crash
in ten years, and that was due to a device-independent Mesa bug.
Credit where credit is due
Posted Mar 16, 2009 1:18 UTC (Mon) by motk (subscriber, #51120) [Link]
Of course, anecdote != data, but the kerneloops website tells the story.
Credit where credit is due
Posted Mar 15, 2009 14:18 UTC (Sun) by sbergman27 (guest, #10767) [Link]
That's not entirely true.... they attempted to fix this problem several weeks ago by cherry picking proposed patches from Ted before the merge into mainline. But the cherry pick doesn't have appeared to work.
"""
Well, Ted has clearly pointed to the patches he thinks are the best work-around (for at least sparing files which already exist) he has. I started to say "safer" instead of "safe" in my post, but decided to give Ted's patches the benefit of any doubt. Apparently, they are not such an effective work-around, and the Ext4 guys need to come up with something better that does not eat data. Ext4 has about zero chance of becoming the default for any distro except Fedora until they do.
BTW, I hope you don't mind that I'm starting to quote you as an example of how threatened advocates of some distros feel about other distros which they perceive as being more successful. You're the most illustrative example of the phenomenon of which I am aware. (Though I'll stop short of thanking you for providing such a clear example.)
Credit where credit is due
Posted Mar 16, 2009 1:25 UTC (Mon) by motk (subscriber, #51120) [Link]
Did you even *read* the article?
Credit where credit is due
Posted Mar 15, 2009 3:54 UTC (Sun) by bojan (subscriber, #14302) [Link]
And if you repeat a lie enough times, it becomes the truth.
Credit where credit is due
Posted Mar 15, 2009 14:06 UTC (Sun) by sbergman27 (guest, #10767) [Link]
The lengths to which some people will go to avoid crediting Ubuntu is simply amazing. I've never seen such a notable sour grapes response in the community as the reactions I see today from advocates of less popular distros. Most of the really notable ones seem to come from the Fedora camp, where the perceived threat level is apparently particularly high. But then, Fedora has always had more than its fair share of spiteful advocates.
Credit where credit is due
Posted Mar 15, 2009 19:03 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link]
http://lxer.com/module/newswire/view/116126/#ext4
"Ext4 isn't all good news though, the new allocator that it uses is likely to expose old bugs in software running on top of it. With ext3, applications were pretty much guaranteed that data was actually written to disk about 5 seconds after a write call. This was never official but simply resulted from the way ext3 was implemented. The new allocator used for ext4 means that this can take between 30 seconds and 5 minutes or more if you are running in laptop mode. It exposes a lot of applications that forget to call fsync() to force data to the disk but nevertheless assume that it has been written."
I recommend that you listen to
http://fosdem.unixheads.org/2009/maintracks/ext4.xvid.avi
His talk actually goes into quite a bit about how to avoid this problem and potential workarounds he was looking at. He mentioned that Eric Sandeen, XFS developer now working on Ext4 in Red Hat had talked to Ted about how XFS had some hacks at the filesystem level to workaround this problem of applications writers relying on Ext3 like behaviour. The current Ext4 patches are based on the same ideas. The rawhide kernel already had backported patches already
https://www.redhat.com/archives/fedora-devel-list/2009-Ma...
It appears that proprietary kernel modules causing more instability aggravates the problem as well. Good to get more exposure on the gotchas however. It looks like Btrfs has now similar patches as well in part as a result of such wider discussions.
Credit where credit is due
Posted Mar 15, 2009 21:14 UTC (Sun) by bojan (subscriber, #14302) [Link]
The lie is that there was a flaw in ext4. There is no flaw in ext4 (not when it comes to this, at least) - applications are broken, because they don't do what's required. They are falling short.
Ted put a workaround into ext4 to address the shortcomings of applications.
The evidence is in your manual pages. Just read them.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 15, 2009 15:59 UTC (Sun) by NinjaSeg (guest, #33460) [Link]
Why yes, they want to play games. Bitch about proprietary drivers all you want, but open source drivers simply don't cut it for games less than 5 years old. Recent games have started *requiring* shader support. Do open source drivers provide it? Other than maybe the newest Intel chips, no. And ever since the great modesetting revolution, performance has gone to crap. Quake 3 isn't even playable on a Radeon 9600XT anymore.
And open source drivers aren't particularly stable at the moment either:
https://bugzilla.redhat.com/show_bug.cgi?id=441665
https://bugzilla.redhat.com/show_bug.cgi?id=474977
https://bugzilla.redhat.com/show_bug.cgi?id=487432
Just when things were starting to work, modesetting comes along and breaks everything. With great reluctance I have given up on gaming in Fedora, and have gone back to multibooting WinXP to game. When in rome...
Here's to another year or two of waiting before drivers stablize... again.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 8:57 UTC (Mon) by kornelix (guest, #57169) [Link]
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 9:28 UTC (Mon) by dlang (guest, #313) [Link]
not all temporary files are only used by a single program that keeps them open the entire time.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 10:24 UTC (Mon) by oseemann (subscriber, #6687) [Link]
For /home or /var many users might want a more conservative approach, e.g. fsync on close or something similar, accepting performance penalties where necessary.
I believe this is a larger issue and I'm glad the current behavior of ext4 receives such wide attention and makes people think about the actual requirements for persistent storage.
I'm certain in the long run the community will come up with a proper approach for a solution.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 23, 2009 4:56 UTC (Mon) by dedekind (guest, #32521) [Link]
And now what Theo is doing - he is fixing userspace bugs and pleasing angry bloggers by hacking ext4? Because ext4 wants more users? And now we have zero chance to have userspace ever fixed?
