Tux3: the other next-generation filesystem

Posted Dec 9, 2008 20:26 UTC (Tue) by martinfick (subscriber, #4455)
In reply to: Tux3: the other next-generation filesystem by rwmj
Parent article: Tux3: the other next-generation filesystem

You are correct, that is actually quite more advanced than what I was proposing. But since I did not go into any details about what I was asking for, I can hardly object. :) The real problem with the above, apart from perhaps being difficult to achieve, is that it would likely break posix semantics!

The yum proposal probably assumes that I could have multiple writes interleaved with reads from the same locations that could succeed in one transaction and then possibly be rolled back. Posix requires that once a write succeeds any reads to the same location that succeed after the write report the newly written bytes. To return a read of some written bytes to any process, even the writer, with the transaction pending, and to then rollback the transaction and return in a read what was there before the write, to any process, would break this requirement. The yum example above probably requires such "broken" semantics.

What I was suggesting is actually something much simpler than the above: a method to allow a transaction coordinator to intercept every individual write action (anything that modifies the FS) and decide whether to commit of rollback the write (transaction).

The coordinator would intercept the write after the FS signals "ready to commit". The write action would then block until either a commit or a rollback is received from the coordinator. This would not allow any concurrent read or writes to the portion of the object being modified during this block, ensuring full posix semantics.

For this to be effective with distributed redundant filesystems, once the FS has signaled ready to commit, the write has to be able to survive a crash so that if the node hosting the FS crashes, the rollback or commit can be issued upon recovery (depending on the coordinator's decision) and reads/writes must continue to be blocked until then (even after the crash!)

If the commit is performed, things continue as usual, if there is a rollback, the write simply fails. Nothing would seem different to applications using such an FS, except for a possible (undetermined) delay while the coordinator decides to commit or rollback the transaction.

That's all I had in mind, not bunching together multiple writes. It should not actually be that difficult to implement, the tricky part is defining a useful generic interface to the controller that would allow higher level distributed FSes to use it effectively.