Posted Mar 14, 2009 21:25 UTC (Sat) by bojan (subscriber, #14302)
[Link]
> There's a window there where it can leave you with baz~ and baz.new, but no baz, on crash.
Yep, very true.
But, no zero length file, which was the original problem. Essentially, you will get at least _something_.
> Why doesn't someone add real DB-style transactions to at least one filesystem, again? They'd be really useful...
Who knows, maybe we'll get proper API for that behaviour out of this discussion.
Where the the correctness go?
Posted Mar 14, 2009 21:39 UTC (Sat) by foom (subscriber, #14868)
[Link]
> Essentially, you will get at least _something_.
There's no guarantee of that. A filesystem could simple erase itself upon unexpected
poweroff/crash. *Anything* better than that is an implementation detail.
Where the the correctness go?
Posted Mar 15, 2009 1:58 UTC (Sun) by bojan (subscriber, #14302)
[Link]
I knew someone's going have a silly comment here. I was expecting, however, that it's going to be more technical, along the lines "see, you cannot rely on it after all". Just for the record, the first rename emacs does is optional (in order to get the backup file) and would not be done for configuration files, hence full atomicity and durability.
Where the the correctness go?
Posted Mar 15, 2009 2:42 UTC (Sun) by njs (guest, #40338)
[Link]
> There's a window there where it can leave you with baz~ and baz.new, but
no baz, on crash.
Yeah, 3.f is supposed to say "link", not "rename". (Programming against POSIX correctly makes those Raymond Smullyan books seem like light reading. If only everything else weren't worse...)
> Why doesn't someone add real DB-style transactions to at least one
filesystem, again? They'd be really useful...
The problem is that a filesystem has a bazillion mostly independent "transactions" going on all the time, and no way to tell which ones are actually dependent on each other. (Besides, app developers would just screw up their rollback-and-retry logic anyway...)
Completely off the wall solution: move to a plan9/capability-ish system where apps all live in little micro-worlds and can only see the stuff that is important to them (this is a good idea anyway), and then use these to infer safe but efficient transaction domain boundaries. (Anyone looking for a PhD project...?)
Transactions, ordering, rollback...
Posted Mar 15, 2009 8:09 UTC (Sun) by Pc5Y9sbv (guest, #41328)
[Link]
The entire point of transactions is to say "these operations are related to one another" by opening a transaction and performing multiple read/write actions which populate a data dependency map. Then the commit says that either the dependency map is satisfied and all writes are made, or no writes are made. Thus it would not be difficult to obtain the map from the application, but it is a huge expansion of scope for the filesystem abstraction.
However, as we were discussing further up the page, a write-barrier is really all that is needed for the intuitive crash-proof behavior desired by everything doing the "create a temp file; relink to real name". An awful lot of the discussion seems to conflate request ordering with synchronous disk operations, when all we really desire is ordering constraints to flow through the entire filesystem and block layer to the physical medium.
All people want is for the POSIX ordering semantics of "file content writes" preceding "file name linkage" to be preserved across crashes. It is OK if the crash drops cached data and forgets the link, or the data and link, but not the data while preserving the link.
Where the the correctness go?
Posted Mar 15, 2009 12:26 UTC (Sun) by nix (subscriber, #2304)
[Link]
Even the off-the-wall solution won't work, because the reason for
transactions getting entangled with each other is dependencies *within the
fs metadata*. i.e. what you'd actually need to do is put off *all*
operations on fs metadata areas that may be shared with other transactions
until such time as the entire transaction is ready to commit. And that's a
huge change.
Where the the correctness go?
Posted Mar 16, 2009 3:13 UTC (Mon) by k8to (subscriber, #15413)
[Link]
There's an easy way to avoid that problem.
link the name to name~
rename the name.new to name
Yes, explicit transaction support in the filesystem would be great, though hammering out the api will probably be hairy.