Atomicity

Posted Sep 17, 2009 8:17 UTC (Thu) by Nicolas.Boulay (guest, #59722)
Parent article: POSIX v. reality: A position on O_PONIES

Fsync() is a means to declare a kind of full priority against any performance ordering.

rename() is a kind of tricks to minimise the problem of empty file after a power failure.

But what an application writer really want is a fast file system that do _atomicity_ : that means he want the previous file states or the new content of the last sys_write() and nothing else.

At the time of fsync(), i think we better need a fdone() which should be a kind of "wait on complete transaction" instead of "flush everthing quickly".
If fdone() is too long, i could use threads. If fdone() take time, it's for bandwith optimisation. One of a great linux optimisation for system without important data is to map fsync() to a void function, then everything fly :)

Is it coslty to have the behavior of open()/write()/rename() for a single sys_write() ?

Atomicity

Posted Sep 25, 2009 3:39 UTC (Fri) by xoddam (guest, #2322) [Link] (2 responses)

We already have an API for atomicity in POSIX. It is called rename().

Rename is *not* a 'kind of trick'. By specification, it is guaranteed to be atomic in the face of concurrent readers on a working system. Unfortunately the specification has nothing to say about it with respect to unclean shutdown.

Extending the atomicity of rename() so that it still applies in the face of a successful recovery (such as a journal replay) after an unclean shutdown is perfectly logical.

Atomicity

Posted Oct 26, 2009 10:09 UTC (Mon) by Nicolas.Boulay (guest, #59722) [Link] (1 responses)

You completly forget the case where the file is too big to be copied.

KB is ok, MB is not.

It's typical in any data base work. In that case, rename() have no use.

Atomicity

Posted Oct 30, 2009 4:30 UTC (Fri) by xoddam (guest, #2322) [Link]

Database implementors have many choices for implementing data stores and any transactional semantics that they need.

Databases traditionally use very large files because their implementors have chosen to re-implement filesystem functionality at the low level for performance reasons.

Most often they use their own journalling implementations and fsync(). This is of course legitimate. But using filesystem-level rename to provide atomicity would also be perfectly reasonable.

The size of the renamed and replaced file is an implementation detail only. Rename doesn't impose a requirement to copy large hunks of data only to throw it away. The unit of replacement might be a btree node, for example.

Nothing forces an implementor to use large files for any particular purpose.