Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 23, 2013
An "enum" for Python 3
An unexpected perf feature
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
Server farms are not the target
Posted Mar 14, 2009 1:21 UTC (Sat) by droundy (subscriber, #4559)
Posted Mar 14, 2009 1:47 UTC (Sat) by martinfick (subscriber, #4455)
You may be right, but now you have me questioning both wikipedia and my own interpretation. :)
Atomicity vs durability
Posted Mar 14, 2009 13:42 UTC (Sat) by man_ls (subscriber, #15091)
In database systems, durability is the ACID property which guarantees that transactions that have committed will survive permanently.
Let's see an example. Say we have the following sequence:
atomic change -> commit -> 5 secs -> flushed to disk
But the problem Ts'o is talking about is different: the transaction has been committed but only part of it may appear on disk -- a zero-length version of the transaction to be precise. So the system is not atomic. It can be made durable with fsync() but that is not really the point.
I may very well have confused everything up, and would be grateful for any clarification. My coffee intake is not what it used to be these days.
Posted Mar 15, 2009 3:21 UTC (Sun) by bojan (subscriber, #14302)
I think you're missing the crucial distinction here. When the atomic rename happens (and atomicity refers to file names _only_ here), "so that there is no point at which another process attempting to access newpath will find it missing" (from rename(2) manual page), the new file will replace the old file. However, because the application writer didn't commit the _data_ to the new file yet, it may not be on disk.
In other words, rename(2) does _not_ specify atomicity of _data_ inside the file, only that at no point the file _name_ will be missing. For data to be in that new file, durability is required. Ergo, fsync.
The whole API, including write(), close(), fsync() and rename() has absolutely no idea that the application writer is trying to atomically replace the file. Only the application writer knows this and must act accordingly.
Posted Mar 15, 2009 9:44 UTC (Sun) by alexl (subscriber, #19068)
POSIX guarantees an atomic replacement of the file, and this means *both* the data an the filenames. However, POSIX doesn't specify anything about system crashes. So, this guarantee is only valid for the non-crashing case.
For the crashing case POSIX doesn't guarantee anything. In fact, many POSIX filesystems such as ext2 can (correctly, by the spec) result in a total loss of all filesystem data in the case of a system crash. And in fact, this is allowed even if the application fsync()ed some data before the crash.
Now, in order to have some way of getting better guarantees than this POSIX also supplies fsync that guarantees that the files have been written to disk. However, nowhere in the specs of fsync does it say that it guarantees that this will survive a system crash. Of course if it *does* survive it is nice to have the fsync guarantee because that means if the metadata change survived we're more likely to get the whole new file.
But, your discussions about how the "atomic" part is only refering to the filenames is bullshit. POSIX does give full guarantees for both filename and content in the case it specifies. Everything else is up to the implementation. This is why its a good idea for a robust filesystem to give the write-data-before-metadata-on-rename guarantee, since it turns an non-crash POSIX guarantee into a post-crash guarantee. (Of course, this is by no means necessary, even ext2 with full data loss on crash is POSIX compliant, its just a *good* implementation.)
 From the POSIX spec:
If the link named by the new argument exists, it shall be removed and old renamed to new. In this case, a link named new shall remain visible to other processes throughout the renaming operation and refer either to the file referred to by new or old before the operation began.
(Notice how this has no separation about "filenames" and "data")
Posted Mar 15, 2009 10:18 UTC (Sun) by bojan (subscriber, #14302)
Notice how it doesn't say _anything_ at all about the content of the file _on_ _disk_ that is being renamed into the old file. That is because in order to see the file _durably_, you have to commit its content. Completely unrelated to committing the _rename_.
Just because another process can see the link (and access the data correctly, which may still just be cached by the kernel) does _not_ mean data reached the disk yet.
BTW, thanks for working on fixes of this in Gnome.
Posted Mar 15, 2009 10:29 UTC (Sun) by alexl (subscriber, #19068)
I argue that a robust useful filesystem should give data-before-metadata-on-rename guarantees, as that would make it a better filesystem. And without this guarantee I'd rather use another filesystem for my data. This is clearly not a requirement, and important code should still fsync() to handle other filesystems. But its still the mark of a "good" filesystem.
Posted Mar 15, 2009 10:37 UTC (Sun) by bojan (subscriber, #14302)
Posted Mar 15, 2009 11:04 UTC (Sun) by man_ls (subscriber, #15091)
Posted Mar 20, 2009 11:24 UTC (Fri) by anton (guest, #25547)
Similarly, any file system that on fsync() locks up for several seconds is not a very good one ;-)
Posted Mar 15, 2009 10:36 UTC (Sun) by bojan (subscriber, #14302)
Consider this. A file has been renamed and just then kernel decides that the directory will be evicted from the cache (i.e. committed to disk). What will be written to disk? The _new_ content of the directory will be written out to disk, because any other process looking for the file must see the _new_ file (and must never _not_ see the file). At the same time, the content of the file can still be cached just fine and _everything_ is atomically visible by all processes.
And yet, the atomicity of rename _only_ refers to filenames.
Posted Mar 15, 2009 11:02 UTC (Sun) by man_ls (subscriber, #15091)
With fsync you make the contents persistent i.e. durable, but the operation should be atomic even without the fsync.
Posted Mar 15, 2009 11:34 UTC (Sun) by dlang (✭ supporter ✭, #313)
the only time they could see a blank or corrupted file is if the system crashes.
so the atomicity is there now
it's the durability that isn't there unless you do f(data)sync on the file and directory (and have hardware that doesn't do unsafe caching, which most drives do by default)
Posted Mar 15, 2009 12:48 UTC (Sun) by man_ls (subscriber, #15091)
"Atomic without a crash" is not good enough; "atomic" means that a transaction is committed or not, no matter at what point we are -- even after a crash.
Even if the POSIX standard does not speak about system crashes it is good engineering to take them into account IMHO.
Posted Mar 15, 2009 13:06 UTC (Sun) by bojan (subscriber, #14302)
Which is not what POSIX requires.
Posted Mar 15, 2009 13:13 UTC (Sun) by man_ls (subscriber, #15091)
Posted Mar 15, 2009 13:19 UTC (Sun) by bojan (subscriber, #14302)
I think we should come up with a new API that guarantees what people really want. Making the existing API do that on a particular FS is just going to make applications non-portable to any FS that doesn't work that way using existing POSIX API. We've seen this with XFS. Who knows what's lurking out there. Better do the proper thing, fsync and be done with it. Then we can invent the new, better, smarter API.
Atomicity vs durability vs reliability
Posted Mar 15, 2009 13:35 UTC (Sun) by man_ls (subscriber, #15091)
I think we should come up with a new API that guarantees what people really want.
We've seen this with XFS.
Actually it's done deal...
Posted Mar 15, 2009 17:34 UTC (Sun) by khim (subscriber, #9252)
If you read the comments on tytso's blog you'll see that current
position is: "POSIX is right while applications are broken yet we'll save
them anyway". Even if "proper way" is fix thousands of applications its
just not realistic - so ext4 (starting from 2.6.30) will try to save these
broken applications by default. And if you want performance - there are a
switch. Good enough for me. Can we close the discussion?
Posted Mar 15, 2009 21:10 UTC (Sun) by bojan (subscriber, #14302)
Posted Mar 15, 2009 21:20 UTC (Sun) by man_ls (subscriber, #15091)
Posted Mar 15, 2009 21:06 UTC (Sun) by bojan (subscriber, #14302)
POSIX manual is not little ;-)
Seriously, we tell Microsoft that going out of spec is bad, bad, bad. But, we can go out of spec no problem. There is a word for that:
> What we have seen with XFS is how some anal-retentive developers lost most of their user base while trying to argue such points as "POSIX-compliance", and then they finally give in.
Yep, blame the people that _didn't_ cause the problem. We've seen that before.
Sorry, but I don't see it this way...
Posted Mar 15, 2009 22:08 UTC (Sun) by khim (subscriber, #9252)
I'm yet to see anyone who asks Microsoft to never go beyond the spec.
It'll be just insane: if you can not ever add anything beyond what
the spec says how any progress can occur?
When Microsoft is blamed it's because Microsoft
1. Does not implement spec correctly, or
2. Don't say what's the spec requirements and what's extensions.
When Microsoft says "JNI is not sexy so we'll provide RMI instead" the
ire is NOT about problems with RMI. Lack of JNI is to blame.
I don't see anything of the sort here: POSIX does not require to make
open/write/close/rename atomic but it certainly does not forbid this. And
it's useful thing to have so why not? It'll be best to actually document
this behaviour, of course - after that applications can safely rely on it
and other systems can implement it as well if they wish. We even have nice
flag to disable this extensions if someone wants this :-)
Posted Mar 15, 2009 22:24 UTC (Sun) by bojan (subscriber, #14302)
Which is exactly what our applications are doing. POSIX says, commit. We don't and then we blame others for it.
This is the same thing HTML5 is doing
Posted Mar 15, 2009 22:33 UTC (Sun) by khim (subscriber, #9252)
Sorry, but it's not the problem with POSIX or FS - it's problem with
number of applications. Once a lot of applications are starting to depend
on some weird feature (content sniffing in case of HTML, atomicity of
open/write/close/rename on case of filesystem) it makes no sense to try to
fix them all. Much better to document it and make it official. This is what
Microsoft did with a lot of "internal" functions in MS-DOS 5 (and it was
praised for it, not ostracized), this is what HTML is doing in HTML5 and
this is what Linux filesystems should do.
Was it good idea to depend on said atomicity? May be, may be not. But
the time to fix these problems come and gone - today it's much better to
extend the spec.
Posted Mar 15, 2009 23:37 UTC (Sun) by bojan (subscriber, #14302)
Time to fix these problems using the existing API is now, because right now we have the attention of everyone on how to use the API properly. To the credit of some in this discussion, bugs are already being fixed in Gnome (as I already mentioned in another comment). I also have bugs to fix in my own code - there is no denying that :-(
In general, I agree with you on extending the spec. But, before the spec gets extended officially, we need to make sure that _every_ POSIX compliant file system implements it that way. Otherwise, apps depending on this new spec will not be reliable until that's the case. So, can we actually make sure that's the case? I very much doubt it. There is a lot of different systems out there that are implementing POSIX, some of them very old. Auditing all of them and then fixing them may be harder than fixing the applications.
Why do we need such blessing?
Posted Mar 16, 2009 0:05 UTC (Mon) by khim (subscriber, #9252)
Linux extends POSIX all the time. New syscalls, new features (things
like "According to the standard specification (e.g., POSIX.1-2001),
sync() schedules the writes, but may return before the actual writing is
done. However, since version 1.3.20 Linux does actually wait."), etc.
If application wants to use such "extended feature" - it can do this, if
not - it can use POSIX-approved features only.
As for old POSIX systems... it's up to application writers again. And
you can be pretty sure A LOT OF them don't give a damn about POSIX
compliance. They are starting to consider Linux as third platfrom for their
products (first two are obviously Windows and MacOS in that order), but if
you'll try to talk to them about POSIX it'll just lead to the removal of
Linux from list of supported platforms. Support of many distributions is
already hard enough, support of some exotic filesystems "we'll think about
it but don't hold your breath...", support for old exotic POSIX systems...
Now - the interesting question is: do we welcome such selfish developers
or not? This is hard question because the answer "no, they should play by
our rules" will just lead to exodus of users - because they need these
applications and WINE is not a good long-term solution...
Posted Mar 15, 2009 22:05 UTC (Sun) by dcoutts (guest, #5387)
As for a new API, yes, that'd be great. There are doubtless other situations where it would be useful to be able to constrain write re-ordering. For example for writes within a single file if we're implementing a persistent tree structure where the ordering is important to provide atomicity in the face of system failure.
Having a nice new API does not mean that the obvious cases that app writers have been using for ages are wrong. We should just insert the obvious write barriers in those cases.
Posted Mar 16, 2009 4:52 UTC (Mon) by dlang (✭ supporter ✭, #313)
so everything that you are screaming that the OS should guarantee can be broken by the hardware after the OS has done it's best.
you can buy/configure your hardware to not behave this way, but it costs a bunch (either in money or in performance). similarly you can configure your filesystem to give you added protection, at a significant added cost in performance.
Posted Mar 16, 2009 11:00 UTC (Mon) by forthy (guest, #1525)
Any reasonable hard disk (SATA, SCSI) has write barriers which allow
file system implementers to actually implement atomicy.
Posted Mar 15, 2009 23:51 UTC (Sun) by vonbrand (subscriber, #4458)
I just don't understand all this "extN isn't crash-proof" whining...
Yes, Linux systems do crash on occasion. It is thankfully very rare.
Yes, hardware does fail. Even disks do fail. Yes, if you are unlucky you will lose data. Yes, the system could fail horribly and scribble all over the disk. Yes, the operating system could mess up its internal (and external) data structures.
It is just completely impossible for the operating system to "do the right thing with respect to whatever data the user values more", even more so in the face of random failures. Want performance? Then you have to do tricks caching/buffering data, disks are horribly _s_l_o_w_ when compared to your processor or memory.
Asking Linux developers to create some Linux-only beast of a filesystem in order to make application developer's life easier doesn't cut it, there are other operating systems (and Linux systems with other filesystems) around, and always will be. Plus asking for a filesystem that is impossible in principle won't get you too far either.
Posted Mar 16, 2009 0:08 UTC (Mon) by man_ls (subscriber, #15091)
Listen, the sky might fall on our heads tomorrow and eventually we are all to die, we understand that. But until then we really want our filesystems to do atomic renames in the face of a crash (i.e. what the rest of the world [except POSIX] understands as "atomic"). Not durable, not crash-proof, not magically indestructible -- just all-or-nothing. Atomic.
Posted Mar 16, 2009 0:26 UTC (Mon) by khim (subscriber, #9252)
Yes, Linux systems do crash on occasion. It is thankfully very
Depends of what hardware and what kind of drivers you have.
Want performance? Then you have to do tricks caching/buffering
data, disks are horribly _s_l_o_w_ when compared to your processor or
The problem is: fast filesystem is useless if it can't keep my data
safe. Microsoft knows this - that's why you don't need to explicitly
unmount flash drive there. Yes, cost is huge, it means flash wears down
faster and speed is horrible - but anything else is unacceptable. Oh, and I
know A LOT OF users who just turn off computer at the end of day. This
problem is somewhat mitigated by design of current systems ("power off"
button is actually "shutdown" button), but people are finding ways to cope:
they just switch power to the desk.
The same thing applies to developers. They are lazy. Most application
writers do not use fsync and do not check the error code from
close. Yet if data is lost - OS will be blamed. Is it fair to OS and FS
developers? Not at all! Can it be changed? Nope. Life is unfair - deal with
The whining started when it was found it that new filesystem can lose
valuable data - where ext3 never does it in this fashion (it can do
this with O_TRUNC, but not with rename). This is pretty serious regression
to most people. The approach "let's fix thousads upon thousands
applications" (including proprietary ones) was thankfully rejected. This is
good sign: this means Linux is almost ready to be usable by normal people.
Last time such problem happened (OSS->ALSA switch) offered solution was
beyond the pale.
Posted Apr 8, 2009 15:30 UTC (Wed) by pgoetz (subscriber, #4931)
Posted Mar 15, 2009 13:14 UTC (Sun) by bojan (subscriber, #14302)
As long as processes running on the system can see a consistent picture (and they can), according to POSIX it is.
> not only the filename must be valid but the contents of the file as well
From the point of view of any process running on the system, it is.
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds