LWN.net Logo

Better than POSIX?

Better than POSIX?

Posted Mar 17, 2009 15:42 UTC (Tue) by quotemstr (subscriber, #45331)
Parent article: Better than POSIX?

Thank you for a balanced discussion of one of the most inflammatory technical (as opposed to legal or social) issues in recent memory. The "worse is better" approach is not unique to unixland, and in the days of small computers with limited resources, counting on applications to not do silly things made sense.

But I firmly believe that today the kernel can provide a great deal of additional robustness at practically no performance cost. An "ordered rename" is a no-brainer. Not only do existing applications suddenly do the right thing, but also an "ordered rename" allows application developers to inform the kernel of constraints that are simply impossible to express when applications are required to fsync before every useful rename.

Some people say that a non-ordered rename gives the kernel more freedom to optimize. That view is a red herring: with a non-ordered rename, applications must fsync before the rename to have sensible semantics. So really, the choice isn't between an ordered and a non-ordered rename, but between fsync-unordered_rename and ordered_rename; the latter actually gives the kernel greater latitude in optimizing IO.

An ordered rename be either neutral or beneficial in every real-world situation. Here's my challenge to anyone reading this: come up with a non-contrived scenario in which an ordered rename (i.e., an implicit write barrier) would be harmful.


(Log in to post comments)

Better than POSIX?

Posted Mar 17, 2009 17:28 UTC (Tue) by ms (subscriber, #41272) [Link]

Is no one doing transactional filesystems? Surely that's ultimately where we're going to end up. Behavior is well defined and understood for databases, so why don't we just adopt them? Apologies if I'm just repeating other people's thoughts from elsewhere - I'm not trying to extend the flames to this very nice and balanced piece.

Transactional filesystems

Posted Mar 17, 2009 17:56 UTC (Tue) by butlerm (subscriber, #13312) [Link]

Doing this efficiently requires the adoption of more filesystem internal
transaction processing techniques - meta data undo in particular.

However, "transactional filesystem" usually refers to a much more
complicated setup that allows an arbitrary number of data and meta data
operations to commit or rollback as a group. That is an order of magnitude
more complicated than what would be required to efficiently preserve the
atomicity of file rename operations, for example.

That might well be standard a couple of decades down the road - Google
Transactional NTFS for an example - but for now the ability to provide
atomicity without durability would be a major improvement that has a
fraction of the complexity.

Transactional filesystems

Posted Mar 18, 2009 15:19 UTC (Wed) by alvherre (subscriber, #18730) [Link]

> Doing this efficiently requires the adoption of more filesystem internal
> transaction processing techniques - meta data undo in particular.

Not so. PostgreSQL, for example, implements transactional semantics without needing metadata undo.

Transactional filesystems

Posted Mar 20, 2009 20:54 UTC (Fri) by butlerm (subscriber, #13312) [Link]

PostgreSQL is not a filesystem. If it was pretending to be one, it would
accomplish what it does by "meta-data" undo, where the meta data of the
higher level filesystem was native to PostgreSQL (i.e. stored in table
rows), as opposed to the completely separate and irrelevant meta data of
the lower level filesystem PostgreSQL was running on top of.

So of course you can implement meta data undo on any filesystem you please,
just as long as the meta data you are referring to is not the meta data of
the filesystem itself.

Transactional filesystems

Posted Mar 21, 2009 18:00 UTC (Sat) by nix (subscriber, #2304) [Link]

I don't see why you can't use MVCC for filesystems, since MVCC could be
regarded as a means of mixing the undo log into the data store itself,
eliminating it as a separate entity.

(As for vacuuming, do it incrementally and the data volume pretty much
doesn't matter; you just work through it bit by bit. PostgreSQL scales to
silly data volumes already...)

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds