|
|
Log in / Subscribe / Register

Atomicity vs durability

Atomicity vs durability

Posted Mar 15, 2009 9:44 UTC (Sun) by alexl (guest, #19068)
In reply to: Atomicity vs durability by bojan
Parent article: Ts'o: Delayed allocation and the zero-length file problem

I don't think this is a correct description of the specs.

POSIX guarantees an atomic replacement of the file, and this means *both* the data an the filenames[1]. However, POSIX doesn't specify anything about system crashes. So, this guarantee is only valid for the non-crashing case.

For the crashing case POSIX doesn't guarantee anything. In fact, many POSIX filesystems such as ext2 can (correctly, by the spec) result in a total loss of all filesystem data in the case of a system crash. And in fact, this is allowed even if the application fsync()ed some data before the crash.

Now, in order to have some way of getting better guarantees than this POSIX also supplies fsync that guarantees that the files have been written to disk. However, nowhere in the specs of fsync does it say that it guarantees that this will survive a system crash. Of course if it *does* survive it is nice to have the fsync guarantee because that means if the metadata change survived we're more likely to get the whole new file.

But, your discussions about how the "atomic" part is only refering to the filenames is bullshit. POSIX does give full guarantees for both filename and content in the case it specifies. Everything else is up to the implementation. This is why its a good idea for a robust filesystem to give the write-data-before-metadata-on-rename guarantee, since it turns an non-crash POSIX guarantee into a post-crash guarantee. (Of course, this is by no means necessary, even ext2 with full data loss on crash is POSIX compliant, its just a *good* implementation.)

[1] From the POSIX spec:
If the link named by the new argument exists, it shall be removed and old renamed to new. In this case, a link named new shall remain visible to other processes throughout the renaming operation and refer either to the file referred to by new or old before the operation began.
(Notice how this has no separation about "filenames" and "data")


to post comments

Atomicity vs durability

Posted Mar 15, 2009 10:18 UTC (Sun) by bojan (subscriber, #14302) [Link] (4 responses)

> (Notice how this has no separation about "filenames" and "data")

Notice how it doesn't say _anything_ at all about the content of the file _on_ _disk_ that is being renamed into the old file. That is because in order to see the file _durably_, you have to commit its content. Completely unrelated to committing the _rename_.

Just because another process can see the link (and access the data correctly, which may still just be cached by the kernel) does _not_ mean data reached the disk yet.

BTW, thanks for working on fixes of this in Gnome.

Atomicity vs durability

Posted Mar 15, 2009 10:29 UTC (Sun) by alexl (guest, #19068) [Link] (3 responses)

Of course not. It says *nothing* about whats on disk, because durability wrt system crashes (not process crashes) is not part of the POSIX. So, any behaviour better than full data loss on crash is a robustness property of the implementation.

I argue that a robust useful filesystem should give data-before-metadata-on-rename guarantees, as that would make it a better filesystem. And without this guarantee I'd rather use another filesystem for my data. This is clearly not a requirement, and important code should still fsync() to handle other filesystems. But its still the mark of a "good" filesystem.

Atomicity vs durability

Posted Mar 15, 2009 10:37 UTC (Sun) by bojan (subscriber, #14302) [Link] (2 responses)

Similarly, any file system that on fsync() locks up for several seconds is not a very good one ;-)

Atomicity vs durability

Posted Mar 15, 2009 11:04 UTC (Sun) by man_ls (guest, #15091) [Link]

That is why we are willing to replace it in the first place! but not if it means losing in the process its good properties (be them in the spec or not).

Atomicity vs durability

Posted Mar 20, 2009 11:24 UTC (Fri) by anton (subscriber, #25547) [Link]

Similarly, any file system that on fsync() locks up for several seconds is not a very good one ;-)
If the fsync() has to write out 500MB, I certainly would expect it to take several seconds and the fsync call to block for several seconds. fsync() is just an inherently slow operation. And if an application works around the lack of robustness of a file system by calling fsync() frequently, the application will be slow (even on robust file systems).

Atomicity vs durability

Posted Mar 15, 2009 10:36 UTC (Sun) by bojan (subscriber, #14302) [Link] (24 responses)

> But, your discussions about how the "atomic" part is only refering to the filenames is bullshit.

Consider this. A file has been renamed and just then kernel decides that the directory will be evicted from the cache (i.e. committed to disk). What will be written to disk? The _new_ content of the directory will be written out to disk, because any other process looking for the file must see the _new_ file (and must never _not_ see the file). At the same time, the content of the file can still be cached just fine and _everything_ is atomically visible by all processes.

And yet, the atomicity of rename _only_ refers to filenames.

Atomicity vs durability

Posted Mar 15, 2009 11:02 UTC (Sun) by man_ls (guest, #15091) [Link] (23 responses)

No, you are just blurring the issue; transactionality does not work that way. I think the interpretation of alexl is correct here. It does not matters if the contents of the file are still cached; other processes can see either the old contents or the new contents, but not both and not a broken file. The rename cannot be atomic if the name points to e.g. an empty file; not only the filename must be valid but the contents of the file as well, up to the point when the rename is done. It is no good if the file appears as it was before the atomic operation was issued (e.g. empty).

With fsync you make the contents persistent i.e. durable, but the operation should be atomic even without the fsync.

Atomicity vs durability

Posted Mar 15, 2009 11:34 UTC (Sun) by dlang (guest, #313) [Link] (21 responses)

what you are missing is that unless the system crashes everything does work. processes on the system see either the old file content or the new file content.

the only time they could see a blank or corrupted file is if the system crashes.

so the atomicity is there now

it's the durability that isn't there unless you do f(data)sync on the file and directory (and have hardware that doesn't do unsafe caching, which most drives do by default)

Atomicity vs durability

Posted Mar 15, 2009 12:48 UTC (Sun) by man_ls (guest, #15091) [Link] (20 responses)

Exactly. What is missing now is atomicity in the face of a crash. To quote myself from a few posts above:
"Atomic without a crash" is not good enough; "atomic" means that a transaction is committed or not, no matter at what point we are -- even after a crash.
Durability (what fsync provides) is not needed here; durability means that the transaction is safely stored to disk. What we people are requesting from ext4 is that the property of atomicity be preserved even after a crash.

Even if the POSIX standard does not speak about system crashes it is good engineering to take them into account IMHO.

Atomicity vs durability

Posted Mar 15, 2009 13:06 UTC (Sun) by bojan (subscriber, #14302) [Link] (19 responses)

> What we people are requesting from ext4 is that the property of atomicity be preserved even after a crash.

Which is not what POSIX requires.

Atomicity vs durability

Posted Mar 15, 2009 13:13 UTC (Sun) by man_ls (guest, #15091) [Link] (17 responses)

That is right, that is why we are not using ext2 (which is POSIX-compliant), FreeBSD (which is POSIX-compliant) or even Windows Vista (which can be made to be POSIX-compliant). We are running a journalling file system in the (apparently silly) hope that the system will hold our data and then give it back.

Atomicity vs durability

Posted Mar 15, 2009 13:19 UTC (Sun) by bojan (subscriber, #14302) [Link] (13 responses)

Look, I'm all for reliability. But, if the manual says: "fsync if you want your data on disk" and we don't fsync, then it is us that are creating the problem.

I think we should come up with a new API that guarantees what people really want. Making the existing API do that on a particular FS is just going to make applications non-portable to any FS that doesn't work that way using existing POSIX API. We've seen this with XFS. Who knows what's lurking out there. Better do the proper thing, fsync and be done with it. Then we can invent the new, better, smarter API.

Atomicity vs durability vs reliability

Posted Mar 15, 2009 13:35 UTC (Sun) by man_ls (guest, #15091) [Link] (9 responses)

No, you are not all for reliability if you cannot see beyond your little POSIX manual. Or if you don't care about system crashes because the manual is silent about this particular point. Sorry to break it to you: reliability is such little details such as having predictable response to a crash, or surviving the crash while retaining all the nice properties.
I think we should come up with a new API that guarantees what people really want.
APIs are good enough as they are -- we don't need a special "reliability API" so we can build a special "reliability manual" for guys who just follow the book.
We've seen this with XFS.
Nope. What we have seen with XFS is how some anal-retentive developers lost most of their user base while trying to argue such points as "POSIX-compliance", and then they finally give in. With ex4 we are hoping to get to the point where the devs give in before they lose most of their user base. Just because ext4 is important for Linux and for our world domination agenda. Meanwhile you can keep waving the POSIX standard in our face. The POSIX standard seems to be about compatibility, not about reliability, and it should keep playing that role. Reliability is left as an exercise for the attentive reader. Let us hope that Mr Ts'o is attentive and can tell atomicity, reliability and durability apart.

Actually it's done deal...

Posted Mar 15, 2009 17:34 UTC (Sun) by khim (subscriber, #9252) [Link] (2 responses)

If you read the comments on tytso's blog you'll see that current position is: "POSIX is right while applications are broken yet we'll save them anyway". Even if "proper way" is fix thousands of applications its just not realistic - so ext4 (starting from 2.6.30) will try to save these broken applications by default. And if you want performance - there are a switch. Good enough for me. Can we close the discussion?

Actually it's done deal...

Posted Mar 15, 2009 21:10 UTC (Sun) by bojan (subscriber, #14302) [Link]

Exactly. Ted is a practical man, so he already put a workaround in place, until applications are fixed.

Sorry

Posted Mar 15, 2009 21:20 UTC (Sun) by man_ls (guest, #15091) [Link]

Sure, I have polluted the interwebs enough with my ignorance, and there is little chance to learn anything else.

Atomicity vs durability vs reliability

Posted Mar 15, 2009 21:06 UTC (Sun) by bojan (subscriber, #14302) [Link] (5 responses)

> No, you are not all for reliability if you cannot see beyond your little POSIX manual.

POSIX manual is not little ;-)

Seriously, we tell Microsoft that going out of spec is bad, bad, bad. But, we can go out of spec no problem. There is a word for that:

http://en.wikipedia.org/wiki/Hypocrisy

> What we have seen with XFS is how some anal-retentive developers lost most of their user base while trying to argue such points as "POSIX-compliance", and then they finally give in.

Yep, blame the people that _didn't_ cause the problem. We've seen that before.

Sorry, but I don't see it this way...

Posted Mar 15, 2009 22:08 UTC (Sun) by khim (subscriber, #9252) [Link] (4 responses)

I'm yet to see anyone who asks Microsoft to never go beyond the spec. It'll be just insane: if you can not ever add anything beyond what the spec says how any progress can occur?

When Microsoft is blamed it's because Microsoft
1. Does not implement spec correctly, or
2. Don't say what's the spec requirements and what's extensions.

When Microsoft says "JNI is not sexy so we'll provide RMI instead" the ire is NOT about problems with RMI. Lack of JNI is to blame.

I don't see anything of the sort here: POSIX does not require to make open/write/close/rename atomic but it certainly does not forbid this. And it's useful thing to have so why not? It'll be best to actually document this behaviour, of course - after that applications can safely rely on it and other systems can implement it as well if they wish. We even have nice flag to disable this extensions if someone wants this :-)

Sorry, but I don't see it this way...

Posted Mar 15, 2009 22:24 UTC (Sun) by bojan (subscriber, #14302) [Link] (3 responses)

> 1. Does not implement spec correctly

Which is exactly what our applications are doing. POSIX says, commit. We don't and then we blame others for it.

This is the same thing HTML5 is doing

Posted Mar 15, 2009 22:33 UTC (Sun) by khim (subscriber, #9252) [Link] (2 responses)

Sorry, but it's not the problem with POSIX or FS - it's problem with number of applications. Once a lot of applications are starting to depend on some weird feature (content sniffing in case of HTML, atomicity of open/write/close/rename on case of filesystem) it makes no sense to try to fix them all. Much better to document it and make it official. This is what Microsoft did with a lot of "internal" functions in MS-DOS 5 (and it was praised for it, not ostracized), this is what HTML is doing in HTML5 and this is what Linux filesystems should do.

Was it good idea to depend on said atomicity? May be, may be not. But the time to fix these problems come and gone - today it's much better to extend the spec.

This is the same thing HTML5 is doing

Posted Mar 15, 2009 23:37 UTC (Sun) by bojan (subscriber, #14302) [Link] (1 responses)

> But the time to fix these problems come and gone - today it's much better to extend the spec.

Time to fix these problems using the existing API is now, because right now we have the attention of everyone on how to use the API properly. To the credit of some in this discussion, bugs are already being fixed in Gnome (as I already mentioned in another comment). I also have bugs to fix in my own code - there is no denying that :-(

In general, I agree with you on extending the spec. But, before the spec gets extended officially, we need to make sure that _every_ POSIX compliant file system implements it that way. Otherwise, apps depending on this new spec will not be reliable until that's the case. So, can we actually make sure that's the case? I very much doubt it. There is a lot of different systems out there that are implementing POSIX, some of them very old. Auditing all of them and then fixing them may be harder than fixing the applications.

Why do we need such blessing?

Posted Mar 16, 2009 0:05 UTC (Mon) by khim (subscriber, #9252) [Link]

Linux extends POSIX all the time. New syscalls, new features (things like "According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However, since version 1.3.20 Linux does actually wait."), etc. If application wants to use such "extended feature" - it can do this, if not - it can use POSIX-approved features only.

As for old POSIX systems... it's up to application writers again. And you can be pretty sure A LOT OF them don't give a damn about POSIX compliance. They are starting to consider Linux as third platfrom for their products (first two are obviously Windows and MacOS in that order), but if you'll try to talk to them about POSIX it'll just lead to the removal of Linux from list of supported platforms. Support of many distributions is already hard enough, support of some exotic filesystems "we'll think about it but don't hold your breath...", support for old exotic POSIX systems... fuggetaboudit!

Now - the interesting question is: do we welcome such selfish developers or not? This is hard question because the answer "no, they should play by our rules" will just lead to exodus of users - because they need these applications and WINE is not a good long-term solution...

Atomicity vs durability

Posted Mar 15, 2009 22:05 UTC (Sun) by dcoutts (subscriber, #5387) [Link] (2 responses)

Remember, we do not care if the data is on disk or not, just that if it does make it to disk that it preserves the atomic property we were after. All that needs to happen is for the rename not to be reordered in front of the write. That hardly restricts performance.

As for a new API, yes, that'd be great. There are doubtless other situations where it would be useful to be able to constrain write re-ordering. For example for writes within a single file if we're implementing a persistent tree structure where the ordering is important to provide atomicity in the face of system failure.

Having a nice new API does not mean that the obvious cases that app writers have been using for ages are wrong. We should just insert the obvious write barriers in those cases.

Atomicity vs durability

Posted Mar 16, 2009 4:52 UTC (Mon) by dlang (guest, #313) [Link] (1 responses)

remember that the drive has it's own buffer (that usually isn't battery backed), and it will tell the OS that the data is written when it's in the buffer, not when it is on the disk. it then can re-order the writes to the disk.

so everything that you are screaming that the OS should guarantee can be broken by the hardware after the OS has done it's best.

you can buy/configure your hardware to not behave this way, but it costs a bunch (either in money or in performance). similarly you can configure your filesystem to give you added protection, at a significant added cost in performance.

Atomicity vs durability

Posted Mar 16, 2009 11:00 UTC (Mon) by forthy (guest, #1525) [Link]

Any reasonable hard disk (SATA, SCSI) has write barriers which allow file system implementers to actually implement atomicy.

Atomicity vs durability

Posted Mar 15, 2009 23:51 UTC (Sun) by vonbrand (guest, #4458) [Link] (2 responses)

I just don't understand all this "extN isn't crash-proof" whining... Yes, Linux systems do crash on occasion. It is thankfully very rare. Yes, hardware does fail. Even disks do fail. Yes, if you are unlucky you will lose data. Yes, the system could fail horribly and scribble all over the disk. Yes, the operating system could mess up its internal (and external) data structures.

It is just completely impossible for the operating system to "do the right thing with respect to whatever data the user values more", even more so in the face of random failures. Want performance? Then you have to do tricks caching/buffering data, disks are horribly _s_l_o_w_ when compared to your processor or memory.

Asking Linux developers to create some Linux-only beast of a filesystem in order to make application developer's life easier doesn't cut it, there are other operating systems (and Linux systems with other filesystems) around, and always will be. Plus asking for a filesystem that is impossible in principle won't get you too far either.

Atomicity vs durability

Posted Mar 16, 2009 0:08 UTC (Mon) by man_ls (guest, #15091) [Link]

Yes, isn't it silly to ask for the moon like this? Apart from the fact that ext3 does exactly what we are asking for; and XFS since 2007; and now ext4 with the new patches. Oh wait... maybe you really didn't understand what we were asking for.

Listen, the sky might fall on our heads tomorrow and eventually we are all to die, we understand that. But until then we really want our filesystems to do atomic renames in the face of a crash (i.e. what the rest of the world [except POSIX] understands as "atomic"). Not durable, not crash-proof, not magically indestructible -- just all-or-nothing. Atomic.

YMMV

Posted Mar 16, 2009 0:26 UTC (Mon) by khim (subscriber, #9252) [Link]

Yes, Linux systems do crash on occasion. It is thankfully very rare.

Depends of what hardware and what kind of drivers you have.

Want performance? Then you have to do tricks caching/buffering data, disks are horribly _s_l_o_w_ when compared to your processor or memory.

The problem is: fast filesystem is useless if it can't keep my data safe. Microsoft knows this - that's why you don't need to explicitly unmount flash drive there. Yes, cost is huge, it means flash wears down faster and speed is horrible - but anything else is unacceptable. Oh, and I know A LOT OF users who just turn off computer at the end of day. This problem is somewhat mitigated by design of current systems ("power off" button is actually "shutdown" button), but people are finding ways to cope: they just switch power to the desk.

The same thing applies to developers. They are lazy. Most application writers do not use fsync and do not check the error code from close. Yet if data is lost - OS will be blamed. Is it fair to OS and FS developers? Not at all! Can it be changed? Nope. Life is unfair - deal with it.

The whining started when it was found it that new filesystem can lose valuable data - where ext3 never does it in this fashion (it can do this with O_TRUNC, but not with rename). This is pretty serious regression to most people. The approach "let's fix thousads upon thousands applications" (including proprietary ones) was thankfully rejected. This is good sign: this means Linux is almost ready to be usable by normal people. Last time such problem happened (OSS->ALSA switch) offered solution was beyond the pale.

Atomicity vs durability

Posted Apr 8, 2009 15:30 UTC (Wed) by pgoetz (guest, #4931) [Link]

Who gives a flying fruitcake about what POSIX requires?! It's not acceptable for a user to edit, say her thesis, which she's been working on for 18 months and which has been saved thousands of times, and -- upon system crash -- find that not only did she lose her most recent 15 minutes worth of changes (acceptable) but in fact THE ENTIRE FILE. Putting the onus on application developers to fsync early and often is beyond ridiculous.

Atomicity vs durability

Posted Mar 15, 2009 13:14 UTC (Sun) by bojan (subscriber, #14302) [Link]

> The rename cannot be atomic if the name points to e.g. an empty file

As long as processes running on the system can see a consistent picture (and they can), according to POSIX it is.

> not only the filename must be valid but the contents of the file as well

From the point of view of any process running on the system, it is.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds