Compare that to the gcc developers who say "The C standard allows this
or that behaviour" and the kernel people's response that gcc should do
the reasonable thing (whatever that is) regardless of what the standard
allows.
Similarly, all FSs should try hard to make sure open-write-close-rename
leaves you with either the new or the old file. Anything else isn't
reasonable.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 14:33 UTC (Fri) by jbailey (subscriber, #16890)
[Link]
@welinder: By that same argument, glibc should never have removed PLT entries for all of its internal functions that many programs were twiddling to try and improve their own performance. The result of that is Windows: where the scheduler has so many hacks and tweaks for different applications and there's no way out of the mess.
Relying on undocumented behaviour and then crying foul when it changes is just bad. The OS cannot and should not provide bug for bug compatibility with itself.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 14:46 UTC (Fri) by mrshiny (subscriber, #4266)
[Link]
The OS can provide bug-for-bug compatibility with itself. To do so is clearly possible given the evidence that Windows does it.
To what extent that compatibility is required or desired is a different matter.
In any case, "Linux" has, for many years now, been commonly found with EXT3 as its default filesystem. This filesystem did not exhibit data-loss for the scenario in question. EXT4 does. How is this not a regression? We're not talking about a program crashing or running slowly or anything like that, we're talking about data loss. DATA LOSS. If there's one thing a computer should get right, it's storing the darn data and not losing it.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 15:42 UTC (Fri) by drag (subscriber, #31333)
[Link]
Lets get this straight first:
1. It's not the file system causing dataloss. It's combination of buggy drivers and incorrect application developer assumptions that are causing dataloss. The file system is working correct.
2. Ext3 exhibits the same behavior, as does all modern file systems. You have the same problems with badly designed applications with XFS or Btrfs, for example.
The only difference between Ext3 and Ext4 in this manner is that with Ext3 you had a 5 second window and with Ext4 you generally have up to 60 seconds. That is there is no difference in behavior if your system crashes in 4 seconds after a write.
3. Linux supports multiple different file systems and it always has. Your not dealing with Windows were your only choices in life are NTFS or Fat32.
Therefore if you want a OS that can benefit from the positive qualities of anything other then Ext3 then it's shitty policy to bend over backwards to support badly written applications because those developers never bothered to test on anything other then Ext3 or understand what the code they are writing actually does.
4. If you want your software to be portable at all to other operating systems, say OS X, Windows, FreeBSD, etc etc... then depending on the dumb-luck chance characteristics of a common configuration a nearly obsolete file system on a single operating system is not a good way to go.
5. Tso introduced a patch that helps mitigate this issue anyways.
----------------------------------
Anyways. If you at any time complained about the lack of standards of banks that demand IE only, then what your saying now is slightly hypocritical. POSIX is a standard for accessing file systems and specific chance behavior of Ext3 isn't.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 16:14 UTC (Fri) by forthy (guest, #1525)
[Link]
You have the same problems with badly designed
applications with XFS or Btrfs, for example.
Actually, not with Btrfs. This is a log structured file system with
COW, i.e. it preserves file system operation order. You either get new
data and new metadata, or you get old data and old metadata, but no mix
like in ext4 or XFS. BTW: This discussion is old as dirt. See Anton
Ertl's very old rant on it: What's
wrong with synchronous metadata updates. If you don't understand
what's wrong with synchronous metadata updates, don't even try to write a
robust file system. You can still write a fast, but fragile file system,
but this doesn't need synchronous metadata updates.
If you move the problem of creating a consistent file system to the
applications, you will fail. fsync() only syncs the data - the metadata
still can be whatever the file system designer likes (old, new, not yet
allocated, trash that will settle down in five seconds). The file system
still can be broken beyond repair after a crash. And using fsync() is
horribly slow, increases file system fragmentation, and so on. If you are
a responsible file system designer, you don't mitigate your consistency
problems to the user. You solve them.
The fact that ext3 is broken the same way, just with a shorter period,
is no excuse. ReiserFS was broken in a similar way, until they added data
journalling as default. Btrfs gets it right by design, but then, we
better wait for Btrfs to mature than to "fix" ext4.
The applications in question like KDE are not broken. They just rely
on a robust, consistent file system. There is no guarantee for that, as
POSIX does not specify that the file system should be robust in case of
crashes. But it is a sound assumption for writing user code. If you can't
rely on your operating system, write one yourself.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 18:47 UTC (Fri) by masoncl (subscriber, #47138)
[Link]
Just to clarify what will happen on btrfs.
If you:
create(file1)
write(file1, good data)
fsync(file1)
create(file2)
write(file2, new good data)
rename(file2 to file1)
FS transaction happens to commit
<crash>
file2 has been renamed over file1, and that rename was atomic. file2 either completely replaces file1 in the FS or file1 and file2 both exist.
But, no fsync was done on file2's data, and file2 replaced file1. After the crash, the file1 you see in the directory tree may or may not have data in it.
The btrfs delalloc implementation is similar to XFS in this area. File metadata is updated after the file IO is complete, so you won't see garbage or null bytes in the file, but the file might be zero length.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 19:13 UTC (Fri) by foom (subscriber, #14868)
[Link]
> The btrfs delalloc implementation is similar to XFS in this area. File metadata
> is updated after the file IO is complete, so you won't see garbage or null
> bytes in the file, but the file might be zero length.
Better get to fixing that, then!
I'm rather amused at the number of comments along the lines of "XFS already does it so it must be
okay!" A filesystem known for pissing off users by losing their data after power-outages is not one
I'm happy to see filesystem developers hold up as a shining example of what to do...
(and apparently XFS doesn't even do this anymore, after the volume of complaints raised against it,
according to other comments!)
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 19:44 UTC (Fri) by masoncl (subscriber, #47138)
[Link]
XFS closes this window on truncate but not renames.
This is a fundamental discussion about what the FS is supposed to implement when it does a rename.
The part where applications expect rename to also mean fsync is a new invention with broad performance implications.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 20:11 UTC (Fri) by alexl (subscriber, #19068)
[Link]
Nobody is expecting rename to imply fsync!
This isn't about having the data on disk *now*.
We just expect it to don't write the new metadata for "newpath" to disk before the data in oldpath is on disk.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 20:34 UTC (Fri) by masoncl (subscriber, #47138)
[Link]
The goal is to make sure the data for the new file is on disk before (or in the same transaction as) the metadata for the rename.
We have two basic choices to accomplish this:
1) Put the new file into a list of things that must be written before the commit is done. This is pretty much what the proposed ext4 changes do.
2) Write the data before the rename is complete.
The problem with #1 is that it reintroduces the famous ext3 fsync behavior that caused so many problems with firefox. It does this in a more limited scope, just for files that have been renamed, but it makes for very large side effects when you want to rename big files.
The problem with #2 is that it is basically fsync-on-rename.
The btrfs fsync log would allow me to get away with #1 without too much pain, because fsyncs don't force full btrfs commits and so they won't actually wait for the renamed file data to hit disk.
But, the important discussion isn't if I can sneak in a good implementation for popular but incorrect API usage. The important discussion is, what is the API today and what should it really be?
Applications have known how to get consistent data on disk for a looong time. Mail servers do it, databases do it. Changing rename to include significant performance penalties when it isn't documented or expected to work this way seems like a very bad idea to me.
I'd much rather make a new system call or flag for open that explicitly documents the extra syncing, and give application developers the choice.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 20:46 UTC (Fri) by alexl (subscriber, #19068)
[Link]
I think "and give application developers the choice." is a fallacy.
At the level of the close happening we don't really know what kind of data this is, as this is generally some library helper function. And even at the application level, how do you know that its important to not zero out a file on crash? It depends on how the user uses the application.
It all comes back to the fact that for "proper" use of this more or less all cases would turn into sync-on-write (or the new flag or whatever). So, the difference wrt the filesystem wide implementation of this will get smaller as apps gets "fixed" until the difference is more or less zero.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 21:19 UTC (Fri) by drag (subscriber, #31333)
[Link]
Well all you guys are over my head.
But as it's pointed out that many applications do get it right consistantly. Vim, OpenOffice, Emacs, mail clients, databases, etc etc. All sorts of them. Right?
So you have the choice of making undocumented behavior documented and then forcing that behavior on all the file systems that Linux supports and all the file systems that you expect your application to run on, or you can fix the application to do it right.
And the assumptions that were made to create this bad behavior are not even true. So even then it's not even a question of backward compatability... They've always gotten it wrong, it's just that the it's been dumb luck that that it wasn't a bigger issue in the past.
As long as file systems are async then your going to have a delay between when the data is created and when that data is commited to disk. You can do all sorts of things to help reduce the damage that can cause, but it's still the fundamental nature of the beast. If you lose power or crash your computer you WILL lose some data.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 0:38 UTC (Sat) by drag (subscriber, #31333)
[Link]
Alright...
I've been reading what has been written and I think I am understanding what is going on better now.
But here is my new thought on the subject:
The reason why they are going:
create file 1
write file 1
.... (time passes)
create file 2
write file 2
rename file 2 over file 1
Is because they are trying to get a 'atomic' instruction, right?
They are making a few assumptions about the file system that:
A. the file system does not operate in atomic operations and thus they have to do this song and dance to do the file system's work for them... (protect their data)
B. that while the file system is going to fail them otherwise, it is still able to do rename in one single operation.
C. That by renaming the file they are actually telling the file to commit it to disk.
So in effect they are trying to outguess or outthink the operating system. But their assumptions, in the case of Ext4 and most others, are not correct and their software is doing what they told it do, but what they told it to do is not what they think its doing.
You see they are putting extra effort into compinsating for the file system already. So if they are putting the extra effort into out thinking the OS, then why don't they at least do it correctly?
Instead of writing out hundreds of files and trying the rename atomic trick, which isn't really right anyways, there are a half a dozen different design approaches that would yeild better results.
Or am I completely off-base here?
I understand the need for the file system to protect a user's data despite what the application developers actually wrote. Really I do.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 8:15 UTC (Sat) by alexl (subscriber, #19068)
[Link]
The rename is documented by posix and unix since forever to be atomic. That is not some form of "workaround", or "compensation" but a solid safe, well documented way to write files. However, those atomicity guarantees are only valid if the system doesn't crash (as crashes are not specified by posix).
The "atomic" part is protection against other apps that are saving to the file at the same time, not crashes. The fsync is only required not to get problems when the system crashes.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 16:45 UTC (Sat) by drag (subscriber, #31333)
[Link]
Thanks. I realise now that I was off-base. I understand now it has to do with application-land and not so much with file system stuff. :)
Not getting it right
Posted Mar 14, 2009 12:17 UTC (Sat) by man_ls (subscriber, #15091)
[Link]
The "other applications get it right with fsync" part is a bit of a fallacy. When you save a document in an application (be it a word processor or a mail client) you really want it to be safe, and it is normally not an issue to wait a few seconds. IOW the user is expected to wait for disk activity because we have been trained this way, so we are willing to accept this trade-off.
But there are other programs doing file operations all the time, and nobody wants to wait a few seconds for them. Most notably background tasks like the operating system or a desktop environment. Is it reasonable to expect all of them to do something which potentially slows the system to a crawl on other filesystems, just to play safe with the newcomer ext4?
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 19, 2009 19:34 UTC (Thu) by anton (guest, #25547)
[Link]
But as it's pointed out that many applications do get it right
consistantly. Vim, OpenOffice, Emacs, mail clients, databases, etc
etc. All sorts of them. Right?
Emacs did get it right when UFS lost my file that I had
just written out from emacs, as well as the autosave file. But UFS got it
wrong, just as ext4 gets it wrong now. There may be applications
that manage to work around the most crash-vulnerable file systems in
many cases, but that does not mean that the file system has sensible crash consistency guarantees.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 19, 2009 19:50 UTC (Thu) by alexl (subscriber, #19068)
[Link]
That is true, and I "fixed" the glib saver code to also fsync() before rename in the case where the rename would replace an existing file.
However, all apps doing such syncing results in lower overall system performance than if the system could guarantee data-before-metadata on rename. So, ideally we would either want a new call to use instead of fsync, or a way to tell if the guarantees are met so that we don't have to fsync.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 21:16 UTC (Fri) by foom (subscriber, #14868)
[Link]
It used to be that losing *your entire filesystem* upon power loss was a possible failure mode.
Whether an app called fsync for a file or not is rather irrelevant in such a case. Obviously, this
kind of failure is allowed by the standards. Just as obviously, it sucks for users, so people made
better filesystems that don't have that failure mode. That's good, I quite enjoy using a system
that doesn't lose my entire filesystem randomly if the power fails.
So it seems to me that claiming that since failing to call fsync before rename is "incorrect" API
usage, and thus it's okay to lose both old data and new, is simply wishful thinking on the part of
the filesystem developers. Sure it may be allowed by standards (as would be zeroing out the
entire partition...), but it sucks for users of that filesystem. So filesystems shouldn't do it. That's
really all there is to it.
Unless *every* call to rename is *always* be preceded by a call to fsync (including those in "mv"
and such), it will suck for users. And there's really no point in forcing everyone to put fsync()s
before every rename, when you could just make the filesystem work without that, and get to the
same place faster.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 15, 2009 7:35 UTC (Sun) by phiggins (subscriber, #5605)
[Link]
Every rename() call does not need to preceded by fsync(). If the source file is known to be on disk already, there's no point in calling fsync(). This is code that knowingly creates a new file, does not call fsync() so that there is no reason whatsoever to assume that the data is on disk, and then calls rename() to replace an existing file. I do think that the behavior of persisting the update to the directory before saving the new file's data is bizarre and likely to cause problems, though. There may not be a required ordering for those operations, but having them reordered is clearly confusing.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 20, 2009 18:29 UTC (Fri) by anton (guest, #25547)
[Link]
But, the important discussion isn't if I can sneak in a good
implementation for popular but incorrect API usage. The important
discussion is, what is the API today and what should it really be?
The API in the non-crash case is defined by POSIX, so I translate this
as: What guarantees in case of a crash should the file system give?
One ideal is to perform all operations synchronously. That's very
expensive, however.
The next-cheaper ideal is to preserve the order of the operations
(in-order
semantics), i.e., after crash recovery you will find the file
system in one of the states that it logically had before the crash;
the file system may lose the operations that happened after some point
in time, but it will be just as consistent as it was at that point in
time. If the file system gives this guarantee, any application that
written to be safe against being killed will also have consistent (but
not necessarily up-to-date) data in case of a system crash.
This guarantee can be implemented relatively cheaply in a
copy-on-write file systems, so I really would like Btrfs to give that
guarantee, and give it for its default mode (otherwise things like
ext3's data=journal debacle will happen).
How to implement this guarantee? When you decide to do another
commit, just remember the then-current logical state of the file
system (i.e., which blocks have to be written out), then write them
out, then do a barrier, and finally the root block. There are some
complications: e.g., you have to deal with some processes being in the
middle of some operation at the time; and if a later operation wants
to change a block before it is written out, you have to make a new
working copy of that block (in addition to the one waiting to be
written out), but that's just a variation on the usual copy-on-write
routine.
You would also have to decide how to deal with fsync() when you
give this guarantee: Can fsync()ed operations run ahead of the rest
(unlike what you normally guarantee), or do you just perform a sync
when an fsync is requested.
The benefits of providing such a guarantee would be:
Many applications that work well when killed would work well on
Btrfs even upon a crash.
It would be a unique selling point for Btrfs. Other popular Linux
file systems don't guarantee anything at all, and their maintainers
only grudgingly address the worst shortcomings when there's a large
enough outcry while complaining about "incorrect API usage" by
applications, and some play fast and lose in other ways (e.g., by not
using barriers). Many users value their data more than these
maintainers and would hopefully flock to a filesystem that actually
gives crash consistency gurarantees.
If you don't even give crash consistency guarantees, I don't really
see a point in having the checksums that are one of the main features
of Btrfs. I have seen many crashes, including some where the file
system lost data, but I have never seen hardware go bad in a way where
checksums would help.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 23:17 UTC (Sat) by masoncl (subscriber, #47138)
[Link]
Testing here shows that I can change the btrfs rename code to make sure the data for the new file is on disk before the rename commits without any performance penalty in most workloads.
It works differently in btrfs than xfs and ext4 because fsyncs go through a special logging mechanism, and so an fsync on one file won't have to wait for the rename flush on any other files in the FS.
I'll go ahead and queue this patch for 2.6.30.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 8:38 UTC (Mon) by njs (guest, #40338)
[Link]
So, uh... doesn't the Btrfs FAQ claim that this is the default, indeed required, behavior already?
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 10:46 UTC (Mon) by forthy (guest, #1525)
[Link]
I'm curious, too. I thought btrfs did it right, by being COW-logging
of data&metadata and having data=ordered mandatory, with all the
explication in the FAQ that make complete sense (correct checksums in the
metadata also mean correct data). Now Chris Mason tells us he didn't? Ok,
this will be fixed in 2.6.30, and for now, we all don't expect that btrfs
is perfect. We expect bugs to be fixed; and that's going on well.
IMHO a robust file system should preserve data operation ordering, so
that a file system after a crash follows the same consistency semantics
as during operation (and during operation, POSIX is clear about
consistency). Delaying metadata updates until all data is committed to
disk at the update points should actually speed things up, not slow them
down, since there is an opportunity to coalesce several metadata updates
into single writes without seeks (delayed inode allocation e.g. can
allocate all new inodes into a single consecutive block, delayed
directory name allocation all new names into consecutive data, as
well).
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 16:50 UTC (Mon) by masoncl (subscriber, #47138)
[Link]
The btrfs data=ordered implementation is different from ext34 and reiserfs. It decouples data writes from the metadata transaction, and simply updates the metadata for file extents after the data blocks are on disk.
This means the transaction commit doesn't have to wait for the data blocks because the metadata for the file extents always reflects extents that are actually on disk.
When you rename one file over another, the destination file is atomically replaced with the new file. The new file is fully consistent with the data that has already been written, which in the worst case means it has a size of zero after a crash.
I hope that made some kind of sense. At any rate, 2.6.30 will have patches that make the rename case work similar to the way ext3 does today. Files that have been through rename will get flushed before the commit is finalized (+/- some optimizations to skip it for destination files that were from the current transaction).
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 21:23 UTC (Mon) by njs (guest, #40338)
[Link]
...Is what you're saying that for btrfs, metadata about extents (like disk location and checksums, I guess) is handled separately from metadata about filenames, and traditionally only the former had data=ordered-style guarantees? (Just trying to see if I understand.)
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 22:51 UTC (Mon) by masoncl (subscriber, #47138)
[Link]
That's correct. The main point behind data=ordered is to make sure that if you crash you don't have extent pointers in the file pointing to extents that haven't been written since they were allocated.
Without data=ordered, after a crash the file could have garbage in it, or bits of old files that had been deleted.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 22:56 UTC (Mon) by njs (guest, #40338)
[Link]
That makes sense. Thanks.
Ts'o: Delayed allocation and the zero-length file problem
Posted Apr 7, 2009 22:27 UTC (Tue) by pgoetz (subscriber, #4931)
[Link]
"When you rename one file over another, the destination file is atomically replaced with the new file. The new file is fully consistent with the data that has already been written, which in the worst case means it has a size of zero after a crash."
Sorry this doesn't make any sense. Atomicity in this context means that when executing a rename, you always get either the old data (exactly) or the new data. Your worst case scenario -- a size of zero after crash -- precisely violates atomicity.
For the record, the first 2 paragraphs are equally mysterious: "This means the transaction commit doesn't have to wait for the data blocks...". Um, is the data ordered or not? If you commit the transaction -- i.e. update the metadata before the data blocks are committed, then the operations are occurring out of order and ext4 open-write-close-rename mayhem ensues.
"bug for bug compatibility" is why Windows sucks
Posted Mar 13, 2009 16:50 UTC (Fri) by JoeBuck (subscriber, #2330)
[Link]
The OS can provide bug-for-bug compatibility with itself. To do so is clearly possible given the evidence that Windows does it.
Microsoft has managed to hire some of the most brilliant developers on the planet. But they have a heavy burden: every mistake they ever made in the last 20 years, every misdesigned API, every unspecified behavior, has come to be relied on by some significant application developer, so they have to keep this mountain of crap duct-taped together and running. Some of the worst offenders have been Microsoft's own application developers, relying on undocumented behavior that they can find out by reading the source code as a competitive edge.
The wisest thing the Linux developers have done is that they decided they're willing to regularly break kernel APIs, other than system calls. It's the main reason that Linux has been able to catch up so rapidly.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 15:41 UTC (Fri) by qg6te2 (guest, #52587)
[Link]
Hang on, comparing the changing functionality of glibc, windows, etc to a file system is misleading. Ensuring that "open-write-close-rename" does what it says is a reasonable requirement, even if POSIX is silent about it.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 15:03 UTC (Fri) by rsidd (subscriber, #2582)
[Link]
The New Jersey guy said that the Unix folks were aware of the [PC-losering] problem, but the solution was for the system routine to always finish, but sometimes an error code would be returned that signaled that the system routine had failed to complete its action. A correct user program, then, had to check the error code to determine whether to simply try the system routine again.
Which, to me, sounds very much the same as saying, like Ted Ts'o, that a correct user program has to fsync() its data and not rely on fclose() actually flushing anything to disk. Also, there are lots of people (like scientific programmers) who write their own short file-handling code without being fanaticallyc "correct" C programmers; buffer overflows and other such bugs are probably OK for them, since their systems are trusted, but data loss really is not OK.
Still, as Gabriel says, Unix won against Lisp systems. And Windows (which was even worse up until Windows ME) won against Unix. So there's food for thought there.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 16:19 UTC (Fri) by ajross (subscriber, #4563)
[Link]
Agreed. What's the point of putting all these fancy journaling and reliability features into a file system if they don't work by default? I mean, hell, we could lose data after a system crash with ufs in 1983. Why bother with ext4?
Hiding behind POSIX here is just ridiculous. POSIX allows this absurd lack of reliability not because it's a good idea, but because filesystems available when the standard was drafted can't support it.
Worse is better
Posted Mar 13, 2009 23:03 UTC (Fri) by pboddie (subscriber, #50784)
[Link]
It's very apt to bring up "worse is better" because that particular rant is all about the applications programmer having to jump through hoops so that the systems programmer can save some effort.
Although people can argue that UNIX "got things about right" in comparison to competing (and presumably discontinued) operating systems which were more clever in their implementation, there's a lot to be said for not pestering application programmers with, for example, the tedious details of fsync and friends at the expense of common sense idioms that just work, like those which assume that closed files can safely have filesystem operations performed on them. Those tedious details involving, of course, figuring out which sync-related function actually does what the developer might anticipate from one platform to the next.
Sometimes worse really is worse.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 3:37 UTC (Sat) by bojan (subscriber, #14302)
[Link]
> Which, to me, sounds very much the same as saying, like Ted Ts'o, that a correct user program has to fsync() its data and not rely on fclose() actually flushing anything to disk.
It's not Ted saying this. It is how it works. From man 2 close:
> A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a file system to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored use fsync(2). (It will depend on the disk hardware at this point.)
OK?
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 16, 2009 6:13 UTC (Mon) by jamesh (guest, #1159)
[Link]
People aren't asking for sync on close. Rather they'd like the rename() operation to only occur if the new file data has been written to disk.
Conversely, if the file data hasn't been written to disk then they expect that the rename over the old data won't occur.
There is no expectation that either of these operations will occur immediately, which is why they don't request that happen via fsync().
If the current method applications use when expecting this behaviour, then it'd be nice to define an API that does provide the desired semantics. That said, I can't think of any cases where you wouldn't want the new data blocks written out before renaming over an existing file.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 15:44 UTC (Fri) by forthy (guest, #1525)
[Link]
"POSIX allows this" is an anal point. Like "the C standard allows
this". This is not what standard documents are about. I'm working in a
standard committee, and standards are about compromises between different
implementations who all have different design goals. If you design a file
system, it is your task to define those parts left out by the standard.
Do you want it fast, do you want it robust, or do you want it compatible
with some cruft out there for 25 years (like FAT)? A robust file system
has to preserve a consistent state of the file system in case of a
crash - and POSIX defines what consistent is: no reordering of operations
allowed. The real solution therefore is log-structured file system with
COW. Ok, so wait for btrfs.
But in the meantime, ext4 could just delay metadata writes as well,
and make sure that at these sync points it does not commit meta data
until it also has written the data out. The bug here is that ext4
writes "commit" to a metadata update even though data updates issued
prior to this metadata update are still pending.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 15:58 UTC (Fri) by kh (subscriber, #19413)
[Link]
I wonder if the default for distributions will not become the "nodelalloc mount option". I'm pretty happy with the speed of ext3, and would even take a small performance penalty from ext3 for additional data integrity guarantees, and I think my position is pretty normal - so I am left wondering who the target audience of ext4 is? Only server farms with multiple power backup systems?
Server farms are not the target
Posted Mar 13, 2009 19:21 UTC (Fri) by khim (subscriber, #9252)
[Link]
Ext2 is plenty fast for most operations and if you can just throw away
everything (like Google
can) - it's not a bad choice. And if ext4 can not provide sane
guarantee (and yes, open/write/close/rename must be atomic no matter what
POSIX says) it's filesystem for nobody...
Server farms are not the target
Posted Mar 13, 2009 19:41 UTC (Fri) by martinfick (subscriber, #4455)
[Link]
It sounds like it is atomic, just not durable.
Server farms are not the target
Posted Mar 14, 2009 1:21 UTC (Sat) by droundy (subscriber, #4559)
[Link]
No, if it were atomic, you'd be guaranteed to end up with either the old version of the file or the new version of the file.
Server farms are not the target
Posted Mar 14, 2009 1:47 UTC (Sat) by martinfick (subscriber, #4455)
[Link]
Well, it sounds like you are guaranteed to see the new version of the file unless the system crashes. In other words, running processes will always see the operations as atomic without a crash. This question is, can an operation be atomic without being durable? I cannot find a satisfying reference that says yes or no. Wikipedia says no, but has zero references. However, all the first ten google results which define ACID transactions mention nothing about operations needing to be durable to be atomic.
You may be right, but now you have me questioning both wikipedia and my own interpretation. :)
Atomicity vs durability
Posted Mar 14, 2009 13:42 UTC (Sat) by man_ls (subscriber, #15091)
[Link]
"Atomic without a crash" is not good enough; "atomic" means that a transaction is committed or not, no matter at what point we are -- even after a crash. I think that durability is not needed here:
In database systems, durability is the ACID property which guarantees that transactions that have committed will survive permanently.
In ext3 it does not matter (too much) if the transaction stays committed, but it cannot be left in the middle of an operation (crashes notwithstanding).
Let's see an example. Say we have the following sequence:
atomic change -> commit -> 5 secs -> flushed to disk
The change might be a rename operation. If the system crashes during those 5 secs, the transaction might not survive the crash -- the filesystem would appear as before the atomic change, and thus ext3 is not durable. But the transaction can only appear as not started or as finished, and not in any state in between; thus ext3 is atomic. I guess that is what fsync() is about: durability of the requested change.
But the problem Ts'o is talking about is different: the transaction has been committed but only part of it may appear on disk -- a zero-length version of the transaction to be precise. So the system is not atomic. It can be made durable with fsync() but that is not really the point.
I may very well have confused everything up, and would be grateful for any clarification. My coffee intake is not what it used to be these days.
Atomicity vs durability
Posted Mar 15, 2009 3:21 UTC (Sun) by bojan (subscriber, #14302)
[Link]
> But the problem Ts'o is talking about is different: the transaction has been committed but only part of it may appear on disk -- a zero-length version of the transaction to be precise. So the system is not atomic. It can be made durable with fsync() but that is not really the point.
I think you're missing the crucial distinction here. When the atomic rename happens (and atomicity refers to file names _only_ here), "so that there is no point at which another process attempting to access newpath will find it missing" (from rename(2) manual page), the new file will replace the old file. However, because the application writer didn't commit the _data_ to the new file yet, it may not be on disk.
In other words, rename(2) does _not_ specify atomicity of _data_ inside the file, only that at no point the file _name_ will be missing. For data to be in that new file, durability is required. Ergo, fsync.
The whole API, including write(), close(), fsync() and rename() has absolutely no idea that the application writer is trying to atomically replace the file. Only the application writer knows this and must act accordingly.
Atomicity vs durability
Posted Mar 15, 2009 9:44 UTC (Sun) by alexl (subscriber, #19068)
[Link]
I don't think this is a correct description of the specs.
POSIX guarantees an atomic replacement of the file, and this means *both* the data an the filenames[1]. However, POSIX doesn't specify anything about system crashes. So, this guarantee is only valid for the non-crashing case.
For the crashing case POSIX doesn't guarantee anything. In fact, many POSIX filesystems such as ext2 can (correctly, by the spec) result in a total loss of all filesystem data in the case of a system crash. And in fact, this is allowed even if the application fsync()ed some data before the crash.
Now, in order to have some way of getting better guarantees than this POSIX also supplies fsync that guarantees that the files have been written to disk. However, nowhere in the specs of fsync does it say that it guarantees that this will survive a system crash. Of course if it *does* survive it is nice to have the fsync guarantee because that means if the metadata change survived we're more likely to get the whole new file.
But, your discussions about how the "atomic" part is only refering to the filenames is bullshit. POSIX does give full guarantees for both filename and content in the case it specifies. Everything else is up to the implementation. This is why its a good idea for a robust filesystem to give the write-data-before-metadata-on-rename guarantee, since it turns an non-crash POSIX guarantee into a post-crash guarantee. (Of course, this is by no means necessary, even ext2 with full data loss on crash is POSIX compliant, its just a *good* implementation.)
[1] From the POSIX spec:
If the link named by the new argument exists, it shall be removed and old renamed to new. In this case, a link named new shall remain visible to other processes throughout the renaming operation and refer either to the file referred to by new or old before the operation began.
(Notice how this has no separation about "filenames" and "data")
Atomicity vs durability
Posted Mar 15, 2009 10:18 UTC (Sun) by bojan (subscriber, #14302)
[Link]
> (Notice how this has no separation about "filenames" and "data")
Notice how it doesn't say _anything_ at all about the content of the file _on_ _disk_ that is being renamed into the old file. That is because in order to see the file _durably_, you have to commit its content. Completely unrelated to committing the _rename_.
Just because another process can see the link (and access the data correctly, which may still just be cached by the kernel) does _not_ mean data reached the disk yet.
BTW, thanks for working on fixes of this in Gnome.
Atomicity vs durability
Posted Mar 15, 2009 10:29 UTC (Sun) by alexl (subscriber, #19068)
[Link]
Of course not. It says *nothing* about whats on disk, because durability wrt system crashes (not process crashes) is not part of the POSIX. So, any behaviour better than full data loss on crash is a robustness property of the implementation.
I argue that a robust useful filesystem should give data-before-metadata-on-rename guarantees, as that would make it a better filesystem. And without this guarantee I'd rather use another filesystem for my data. This is clearly not a requirement, and important code should still fsync() to handle other filesystems. But its still the mark of a "good" filesystem.
Atomicity vs durability
Posted Mar 15, 2009 10:37 UTC (Sun) by bojan (subscriber, #14302)
[Link]
Similarly, any file system that on fsync() locks up for several seconds is not a very good one ;-)
Atomicity vs durability
Posted Mar 15, 2009 11:04 UTC (Sun) by man_ls (subscriber, #15091)
[Link]
That is why we are willing to replace it in the first place! but not if it means losing in the process its good properties (be them in the spec or not).
Atomicity vs durability
Posted Mar 20, 2009 11:24 UTC (Fri) by anton (guest, #25547)
[Link]
Similarly, any file system that on fsync() locks up for several seconds is not a very good one ;-)
If the fsync() has to write out 500MB, I certainly would expect it to
take several seconds and the fsync call to block for several seconds.
fsync() is just an inherently slow operation. And if an application
works around the lack of robustness of a file system by calling
fsync() frequently, the application will be slow (even on robust file systems).
Atomicity vs durability
Posted Mar 15, 2009 10:36 UTC (Sun) by bojan (subscriber, #14302)
[Link]
> But, your discussions about how the "atomic" part is only refering to the filenames is bullshit.
Consider this. A file has been renamed and just then kernel decides that the directory will be evicted from the cache (i.e. committed to disk). What will be written to disk? The _new_ content of the directory will be written out to disk, because any other process looking for the file must see the _new_ file (and must never _not_ see the file). At the same time, the content of the file can still be cached just fine and _everything_ is atomically visible by all processes.
And yet, the atomicity of rename _only_ refers to filenames.
Atomicity vs durability
Posted Mar 15, 2009 11:02 UTC (Sun) by man_ls (subscriber, #15091)
[Link]
No, you are just blurring the issue; transactionality does not work that way. I think the interpretation of alexl is correct here. It does not matters if the contents of the file are still cached; other processes can see either the old contents or the new contents, but not both and not a broken file. The rename cannot be atomic if the name points to e.g. an empty file; not only the filename must be valid but the contents of the file as well, up to the point when the rename is done. It is no good if the file appears as it was before the atomic operation was issued (e.g. empty).
With fsync you make the contents persistent i.e. durable, but the operation should be atomic even without the fsync.
Atomicity vs durability
Posted Mar 15, 2009 11:34 UTC (Sun) by dlang (✭ supporter ✭, #313)
[Link]
what you are missing is that unless the system crashes everything does work. processes on the system see either the old file content or the new file content.
the only time they could see a blank or corrupted file is if the system crashes.
so the atomicity is there now
it's the durability that isn't there unless you do f(data)sync on the file and directory (and have hardware that doesn't do unsafe caching, which most drives do by default)
Atomicity vs durability
Posted Mar 15, 2009 12:48 UTC (Sun) by man_ls (subscriber, #15091)
[Link]
Exactly. What is missing now is atomicity in the face of a crash. To quote myself from a few posts above:
"Atomic without a crash" is not good enough; "atomic" means that a transaction is committed or not, no matter at what point we are -- even after a crash.
Durability (what fsync provides) is not needed here; durability means that the transaction is safely stored to disk. What we people are requesting from ext4 is that the property of atomicity be preserved even after a crash.
Even if the POSIX standard does not speak about system crashes it is good engineering to take them into account IMHO.
Atomicity vs durability
Posted Mar 15, 2009 13:06 UTC (Sun) by bojan (subscriber, #14302)
[Link]
> What we people are requesting from ext4 is that the property of atomicity be preserved even after a crash.
Which is not what POSIX requires.
Atomicity vs durability
Posted Mar 15, 2009 13:13 UTC (Sun) by man_ls (subscriber, #15091)
[Link]
That is right, that is why we are not using ext2 (which is POSIX-compliant), FreeBSD (which is POSIX-compliant) or even Windows Vista (which can be made to be POSIX-compliant). We are running a journalling file system in the (apparently silly) hope that the system will hold our data and then give it back.
Atomicity vs durability
Posted Mar 15, 2009 13:19 UTC (Sun) by bojan (subscriber, #14302)
[Link]
Look, I'm all for reliability. But, if the manual says: "fsync if you want your data on disk" and we don't fsync, then it is us that are creating the problem.
I think we should come up with a new API that guarantees what people really want. Making the existing API do that on a particular FS is just going to make applications non-portable to any FS that doesn't work that way using existing POSIX API. We've seen this with XFS. Who knows what's lurking out there. Better do the proper thing, fsync and be done with it. Then we can invent the new, better, smarter API.
Atomicity vs durability vs reliability
Posted Mar 15, 2009 13:35 UTC (Sun) by man_ls (subscriber, #15091)
[Link]
No, you are not all for reliability if you cannot see beyond your little POSIX manual. Or if you don't care about system crashes because the manual is silent about this particular point. Sorry to break it to you: reliability is such little details such as having predictable response to a crash, or surviving the crash while retaining all the nice properties.
I think we should come up with a new API that guarantees what people really want.
APIs are good enough as they are -- we don't need a special "reliability API" so we can build a special "reliability manual" for guys who just follow the book.
We've seen this with XFS.
Nope. What we have seen with XFS is how some anal-retentive developers lost most of their user base while trying to argue such points as "POSIX-compliance", and then they finally give in. With ex4 we are hoping to get to the point where the devs give in before they lose most of their user base. Just because ext4 is important for Linux and for our world domination agenda. Meanwhile you can keep waving the POSIX standard in our face. The POSIX standard seems to be about compatibility, not about reliability, and it should keep playing that role. Reliability is left as an exercise for the attentive reader. Let us hope that Mr Ts'o is attentive and can tell atomicity, reliability and durability apart.
Actually it's done deal...
Posted Mar 15, 2009 17:34 UTC (Sun) by khim (subscriber, #9252)
[Link]
If you read the comments on tytso's blog you'll see that current
position is: "POSIX is right while applications are broken yet we'll save
them anyway". Even if "proper way" is fix thousands of applications its
just not realistic - so ext4 (starting from 2.6.30) will try to save these
broken applications by default. And if you want performance - there are a
switch. Good enough for me. Can we close the discussion?
Actually it's done deal...
Posted Mar 15, 2009 21:10 UTC (Sun) by bojan (subscriber, #14302)
[Link]
Exactly. Ted is a practical man, so he already put a workaround in place, until applications are fixed.
Sorry
Posted Mar 15, 2009 21:20 UTC (Sun) by man_ls (subscriber, #15091)
[Link]
Sure, I have polluted the interwebs enough with my ignorance, and there is little chance to learn anything else.
Atomicity vs durability vs reliability
Posted Mar 15, 2009 21:06 UTC (Sun) by bojan (subscriber, #14302)
[Link]
> No, you are not all for reliability if you cannot see beyond your little POSIX manual.
POSIX manual is not little ;-)
Seriously, we tell Microsoft that going out of spec is bad, bad, bad. But, we can go out of spec no problem. There is a word for that:
> What we have seen with XFS is how some anal-retentive developers lost most of their user base while trying to argue such points as "POSIX-compliance", and then they finally give in.
Yep, blame the people that _didn't_ cause the problem. We've seen that before.
Sorry, but I don't see it this way...
Posted Mar 15, 2009 22:08 UTC (Sun) by khim (subscriber, #9252)
[Link]
I'm yet to see anyone who asks Microsoft to never go beyond the spec.
It'll be just insane: if you can not ever add anything beyond what
the spec says how any progress can occur?
When Microsoft is blamed it's because Microsoft
1. Does not implement spec correctly, or
2. Don't say what's the spec requirements and what's extensions.
When Microsoft says "JNI is not sexy so we'll provide RMI instead" the
ire is NOT about problems with RMI. Lack of JNI is to blame.
I don't see anything of the sort here: POSIX does not require to make
open/write/close/rename atomic but it certainly does not forbid this. And
it's useful thing to have so why not? It'll be best to actually document
this behaviour, of course - after that applications can safely rely on it
and other systems can implement it as well if they wish. We even have nice
flag to disable this extensions if someone wants this :-)
Sorry, but I don't see it this way...
Posted Mar 15, 2009 22:24 UTC (Sun) by bojan (subscriber, #14302)
[Link]
> 1. Does not implement spec correctly
Which is exactly what our applications are doing. POSIX says, commit. We don't and then we blame others for it.
This is the same thing HTML5 is doing
Posted Mar 15, 2009 22:33 UTC (Sun) by khim (subscriber, #9252)
[Link]
Sorry, but it's not the problem with POSIX or FS - it's problem with
number of applications. Once a lot of applications are starting to depend
on some weird feature (content sniffing in case of HTML, atomicity of
open/write/close/rename on case of filesystem) it makes no sense to try to
fix them all. Much better to document it and make it official. This is what
Microsoft did with a lot of "internal" functions in MS-DOS 5 (and it was
praised for it, not ostracized), this is what HTML is doing in HTML5 and
this is what Linux filesystems should do.
Was it good idea to depend on said atomicity? May be, may be not. But
the time to fix these problems come and gone - today it's much better to
extend the spec.
This is the same thing HTML5 is doing
Posted Mar 15, 2009 23:37 UTC (Sun) by bojan (subscriber, #14302)
[Link]
> But the time to fix these problems come and gone - today it's much better to extend the spec.
Time to fix these problems using the existing API is now, because right now we have the attention of everyone on how to use the API properly. To the credit of some in this discussion, bugs are already being fixed in Gnome (as I already mentioned in another comment). I also have bugs to fix in my own code - there is no denying that :-(
In general, I agree with you on extending the spec. But, before the spec gets extended officially, we need to make sure that _every_ POSIX compliant file system implements it that way. Otherwise, apps depending on this new spec will not be reliable until that's the case. So, can we actually make sure that's the case? I very much doubt it. There is a lot of different systems out there that are implementing POSIX, some of them very old. Auditing all of them and then fixing them may be harder than fixing the applications.
Why do we need such blessing?
Posted Mar 16, 2009 0:05 UTC (Mon) by khim (subscriber, #9252)
[Link]
Linux extends POSIX all the time. New syscalls, new features (things
like "According to the standard specification (e.g., POSIX.1-2001),
sync() schedules the writes, but may return before the actual writing is
done. However, since version 1.3.20 Linux does actually wait."), etc.
If application wants to use such "extended feature" - it can do this, if
not - it can use POSIX-approved features only.
As for old POSIX systems... it's up to application writers again. And
you can be pretty sure A LOT OF them don't give a damn about POSIX
compliance. They are starting to consider Linux as third platfrom for their
products (first two are obviously Windows and MacOS in that order), but if
you'll try to talk to them about POSIX it'll just lead to the removal of
Linux from list of supported platforms. Support of many distributions is
already hard enough, support of some exotic filesystems "we'll think about
it but don't hold your breath...", support for old exotic POSIX systems...
fuggetaboudit!
Now - the interesting question is: do we welcome such selfish developers
or not? This is hard question because the answer "no, they should play by
our rules" will just lead to exodus of users - because they need these
applications and WINE is not a good long-term solution...
Atomicity vs durability
Posted Mar 15, 2009 22:05 UTC (Sun) by dcoutts (guest, #5387)
[Link]
Remember, we do not care if the data is on disk or not, just that if it does make it to disk that it preserves the atomic property we were after. All that needs to happen is for the rename not to be reordered in front of the write. That hardly restricts performance.
As for a new API, yes, that'd be great. There are doubtless other situations where it would be useful to be able to constrain write re-ordering. For example for writes within a single file if we're implementing a persistent tree structure where the ordering is important to provide atomicity in the face of system failure.
Having a nice new API does not mean that the obvious cases that app writers have been using for ages are wrong. We should just insert the obvious write barriers in those cases.
Atomicity vs durability
Posted Mar 16, 2009 4:52 UTC (Mon) by dlang (✭ supporter ✭, #313)
[Link]
remember that the drive has it's own buffer (that usually isn't battery backed), and it will tell the OS that the data is written when it's in the buffer, not when it is on the disk. it then can re-order the writes to the disk.
so everything that you are screaming that the OS should guarantee can be broken by the hardware after the OS has done it's best.
you can buy/configure your hardware to not behave this way, but it costs a bunch (either in money or in performance). similarly you can configure your filesystem to give you added protection, at a significant added cost in performance.
Atomicity vs durability
Posted Mar 16, 2009 11:00 UTC (Mon) by forthy (guest, #1525)
[Link]
Any reasonable hard disk (SATA, SCSI) has write barriers which allow
file system implementers to actually implement atomicy.
Atomicity vs durability
Posted Mar 15, 2009 23:51 UTC (Sun) by vonbrand (subscriber, #4458)
[Link]
I just don't understand all this "extN isn't crash-proof" whining...
Yes, Linux systems do crash on occasion. It is thankfully very rare.
Yes, hardware does fail. Even disks do fail. Yes, if you are unlucky you will lose data. Yes, the system could fail horribly and scribble all over the disk. Yes, the operating system could mess up its internal (and external) data structures.
It is just completely impossible for the operating system to "do the right thing with respect to whatever data the user values more", even more so in the face of random failures. Want performance? Then you have to do tricks caching/buffering data, disks are horribly _s_l_o_w_ when compared to your processor or memory.
Asking Linux developers to create some Linux-only beast of a filesystem in order to make application developer's life easier doesn't cut it, there are other operating systems (and Linux systems with other filesystems) around, and always will be. Plus asking for a filesystem that is impossible in principle won't get you too far either.
Atomicity vs durability
Posted Mar 16, 2009 0:08 UTC (Mon) by man_ls (subscriber, #15091)
[Link]
Yes, isn't it silly to ask for the moon like this? Apart from the fact that ext3 does exactly what we are asking for; and XFS since 2007; and now ext4 with the new patches. Oh wait... maybe you really didn't understand what we were asking for.
Listen, the sky might fall on our heads tomorrow and eventually we are all to die, we understand that. But until then we really want our filesystems to do atomic renames in the face of a crash (i.e. what the rest of the world [except POSIX] understands as "atomic"). Not durable, not crash-proof, not magically indestructible -- just all-or-nothing. Atomic.
YMMV
Posted Mar 16, 2009 0:26 UTC (Mon) by khim (subscriber, #9252)
[Link]
Yes, Linux systems do crash on occasion. It is thankfully very
rare.
Depends of what hardware and what kind of drivers you have.
Want performance? Then you have to do tricks caching/buffering
data, disks are horribly _s_l_o_w_ when compared to your processor or
memory.
The problem is: fast filesystem is useless if it can't keep my data
safe. Microsoft knows this - that's why you don't need to explicitly
unmount flash drive there. Yes, cost is huge, it means flash wears down
faster and speed is horrible - but anything else is unacceptable. Oh, and I
know A LOT OF users who just turn off computer at the end of day. This
problem is somewhat mitigated by design of current systems ("power off"
button is actually "shutdown" button), but people are finding ways to cope:
they just switch power to the desk.
The same thing applies to developers. They are lazy. Most application
writers do not use fsync and do not check the error code from
close. Yet if data is lost - OS will be blamed. Is it fair to OS and FS
developers? Not at all! Can it be changed? Nope. Life is unfair - deal with
it.
The whining started when it was found it that new filesystem can lose
valuable data - where ext3 never does it in this fashion (it can do
this with O_TRUNC, but not with rename). This is pretty serious regression
to most people. The approach "let's fix thousads upon thousands
applications" (including proprietary ones) was thankfully rejected. This is
good sign: this means Linux is almost ready to be usable by normal people.
Last time such problem happened (OSS->ALSA switch) offered solution was
beyond the pale.
Atomicity vs durability
Posted Apr 8, 2009 15:30 UTC (Wed) by pgoetz (subscriber, #4931)
[Link]
Who gives a flying fruitcake about what POSIX requires?! It's not acceptable for a user to edit, say her thesis, which she's been working on for 18 months and which has been saved thousands of times, and -- upon system crash -- find that not only did she lose her most recent 15 minutes worth of changes (acceptable) but in fact THE ENTIRE FILE. Putting the onus on application developers to fsync early and often is beyond ridiculous.
Atomicity vs durability
Posted Mar 15, 2009 13:14 UTC (Sun) by bojan (subscriber, #14302)
[Link]
> The rename cannot be atomic if the name points to e.g. an empty file
As long as processes running on the system can see a consistent picture (and they can), according to POSIX it is.
> not only the filename must be valid but the contents of the file as well
From the point of view of any process running on the system, it is.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 5:45 UTC (Sat) by Nick (guest, #15060)
[Link]
> "POSIX allows this" is an anal point. Like
> "the C standard allows this". This is not
> what standard documents are about.
To everyone whining about this behaviour because "posix allows it", it is not like a C compiler that adds some new crazy optimisation that can break any threaded program that previously was working (like speculatively writing to a variable that is never touched in logical control flow). POSIX in this case *IS* basically ratifying existing practices.
fsync() is something that you have to do on many OSes and many filesystems for a long time in order to get correct semantics. Programmers that come along a decades or two later and decide to ignore basics like that because it mostly appears to work on one filesystem on one OS cannot complain that a new filesystem or OS breaks their app really don't have a high horse on which to sit.... a mini pony at best.
It's quite fair to hope for a best effort, but it is also fair to make an appropriate choice about the performance tradeoff so that well written applications don't get horribly penalised.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 20, 2009 13:48 UTC (Fri) by anton (guest, #25547)
[Link]
[...] it is not like a C compiler that adds some new
crazy optimisation that can break any threaded program that previously
was working (like speculatively writing to a variable that is never
touched in logical control flow)
That's actually a very good example: ext4 performs the writes in a
different order than what corresponds to the operations in the
process(es), resulting in an on-disk state that never was the logical
state of the file system at any point in time. One difference is that
file systems have been crash-vulnerable ("crazy") for a long time, so
in a variation of the Stockholm syndrome a number of people now insist
that that's the right thing.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 16:25 UTC (Fri) by droundy (subscriber, #4559)
[Link]
Similarly, all FSs should try hard to make sure open-write-close-rename
leaves you with either the new or the old file. Anything else isn't
reasonable.
Absolutely. The problem with the argument that it must be open-write-fsync-close-rename is that that describes a *different* goal, which is to ensure that the new file is actually saved. When an application doesn't particularly care whether the new or old file is present in case of a crash, it'd be nice to allow them to ask only for the ordinary POSIX guarantees of atomicity, without treating system crashes as a special exception.
And on top of that, I'd prefer to think of all those Ubuntu gamers as providing the valuable service of searching for race conditions relating to system crashes. Personally, I prefer not to run nvidia drivers, but I'd like to use a file system that has been thoroughly tested by hard rebooting the computer, so that on the rare power outages, I'll not be likely to lose data. It'd be even nicer if lots of people were to stress-test their ext4 file systems by simply repeatedly hitting the hard reset button, but it's hard to see how to motivate people to do this.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 13, 2009 18:26 UTC (Fri) by oak (subscriber, #2786)
[Link]
> It'd be even nicer if lots of people were to stress-test their ext4 file
systems by simply repeatedly hitting the hard reset button, but it's hard
to see how to motivate people to do this.
If you have very small children, they can do it for you. When you least
want/expect it...
Anyway, it's not so hard to do an automated system for this and you can
also buy one. All the file system maintainers I know, do this kind of
testing.
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 14, 2009 18:03 UTC (Sat) by dkite (guest, #4577)
[Link]
Used to be when the reset buttons were red and prominent. I speak from
experience.
Derek
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 23, 2009 5:02 UTC (Mon) by dedekind (guest, #32521)
[Link]
It's not just awful, it is a silly.
$ man 2 write
...
"A successful return from write() does not make any guarantee that data has been committed to disk."