On the contrary, every decent database in the world does this, and will run
circles around contemporary filesystems for comparable synchronous and
asynchronous operations. Check put Gray and Reuter's Transaction Processing
book for details. The edition I have was published in 1993.
There are two basic problems here:
The first is that fsync is a ridiculously *expensive* way to get the needed
functionality. The second is that most filesystems cannot implement atomic
operations any other way (i.e. without forcing both the metadata and the
data and any other pending metadata changes to disk).
fsync is orders of magnitude more expensive than necessary for the case
under consideration. A properly designed filesystem (i.e. one with
metadata undo) can issue an atomic rename in microseconds. The only option
that POSIX provides can take hundreds if not thousands of milliseconds on a
busy filesystem.
Databases do *synchronous*, durable commits on busy systems in ten
milliseconds or less. Ten to twenty times faster than it takes
contemporary filesystems to do an fsync under comparable conditions.
Even that is still a hundred times more expensive than necessary, because
synchronous durability is not required here. Just atomicity. Nothing has
to hit the disk. No synchronous I/O overhead. Just metadata undo
capability.
Posted Mar 17, 2009 7:18 UTC (Tue) by dlang (✭ supporter ✭, #313)
[Link]
how do you think the databases make sure their data is on disk?
they use f(data)sync calls to the filesystem.
so your assertion that databases can make atomic changes to their data faster than the filesystem can do an fsync means that either you don't know what you are saying, or you don't really have the data safety that you think you have.
Where did the correctness go?
Posted Mar 17, 2009 8:31 UTC (Tue) by butlerm (subscriber, #13312)
[Link]
ACID has four letters for a reason. Atomicity is logically independent of
durability. A decent database will let you turn (synchronous) durability
off while fully guaranteeing atomicity and consistency.
The reason is that with a typical rotating disk, any durable commit is
going to take at least one disk revolution time, i.e. about 10 ms. Single
threaded atomic (but not necessarily durable) commits can be issued a
hundred times faster than that, because no synchronous disk I/O is required
at all.
Where did the correctness go?
Posted Mar 17, 2009 9:48 UTC (Tue) by dlang (✭ supporter ✭, #313)
[Link]
and all the filesystems (including ext4 prior to the patches) provide the atomicity you are looking for.
it's just the durability in the face of a crash that isn't there. but it wasn't there on ext3 either (there was just a smaller window of vunerability), and even if you mount your filesystem with the sync option many commodity hard drives would not let you disable their internal disk caches, and so you would still have the vunerability (with an even smaller window)
Where did the correctness go?
Posted Mar 17, 2009 17:30 UTC (Tue) by butlerm (subscriber, #13312)
[Link]
"and all the filesystems (including ext4 prior to the patches) provide the
atomicity you are looking for."
I am afraid not. Atomic means that the pertinent operation always appears
either to have completed OR to have never started in the first place. If
the system recovers in a state where some of the effects of the operation
have been preserved and other parts have disappeared, that is not atomic.
The operation here is replacing a file with a new version. Atomic
operation means when the system recovers there is either the old version or
the new version, not any other possibility. You can do this now of course,
you simply have have to pay the price for durability in addition to
atomicity.
Per accident of design, filesystems require a much higher price (in terms
of latency) to be paid for durability than databases do. That
factor is multiplied by a hundred or more if atomicity is required, but
durability is not.
Where did the correctness go?
Posted Mar 17, 2009 17:38 UTC (Tue) by butlerm (subscriber, #13312)
[Link]
I refer to filesystem *meta-data* operations of course.
Where did the correctness go?
Posted Mar 17, 2009 20:42 UTC (Tue) by nix (subscriber, #2304)
[Link]
Sure. I meant nobody else had done it *in a filesystem*.