Tux3: the other next-generation filesystem

Posted Dec 2, 2008 19:59 UTC (Tue) by avik (guest, #704)
In reply to: Tux3: the other next-generation filesystem by sbergman27
Parent article: Tux3: the other next-generation filesystem

jbd is an in-kernel interface for journalling changes to block devices. What is described here is a filesystem-level, user-visible transaction support.

Consider:

begin transaction
yum update
test test test
commit transaction (or abort transaction)

Useful work can continue to be performed while the update takes place, and is not lost in case of rollback.

I believe NTFS supports this feature.

Tux3: the other next-generation filesystem

Posted Dec 9, 2008 14:48 UTC (Tue) by rwmj (subscriber, #5474) [Link] (1 responses)

begin transaction
yum update
test test test
commit transaction

This sounds like a nice idea at first, but you're forgetting an essential step: if you have multiple transactions outstanding, you need some way to combine the results to get a consistent filesystem.

For example, suppose that the yum transaction modified /etc/motd, and a user edited this file at the same time (before the yum transaction was committed). What is the final, consistent value of this file after the transaction?

From DBMSes you can find lots of different strategies to deal with these cases. A non-exhaustive list might include: Don't permit the second transaction to succeed. Always take the result of the first (or second) transaction and overwrite the other. Use a merge strategy (and there are many different sorts).

As usual in computer science, there is a whole load of interesting, accessible theory here, which is being completely ignored. My favorite which is directly relevant here is Oleg's Zipper filesystem.

Rich.

Tux3: the other next-generation filesystem

Posted Dec 9, 2008 20:26 UTC (Tue) by martinfick (subscriber, #4455) [Link]

You are correct, that is actually quite more advanced than what I was proposing. But since I did not go into any details about what I was asking for, I can hardly object. :) The real problem with the above, apart from perhaps being difficult to achieve, is that it would likely break posix semantics!

The yum proposal probably assumes that I could have multiple writes interleaved with reads from the same locations that could succeed in one transaction and then possibly be rolled back. Posix requires that once a write succeeds any reads to the same location that succeed after the write report the newly written bytes. To return a read of some written bytes to any process, even the writer, with the transaction pending, and to then rollback the transaction and return in a read what was there before the write, to any process, would break this requirement. The yum example above probably requires such "broken" semantics.

What I was suggesting is actually something much simpler than the above: a method to allow a transaction coordinator to intercept every individual write action (anything that modifies the FS) and decide whether to commit of rollback the write (transaction).

The coordinator would intercept the write after the FS signals "ready to commit". The write action would then block until either a commit or a rollback is received from the coordinator. This would not allow any concurrent read or writes to the portion of the object being modified during this block, ensuring full posix semantics.

For this to be effective with distributed redundant filesystems, once the FS has signaled ready to commit, the write has to be able to survive a crash so that if the node hosting the FS crashes, the rollback or commit can be issued upon recovery (depending on the coordinator's decision) and reads/writes must continue to be blocked until then (even after the crash!)

If the commit is performed, things continue as usual, if there is a rollback, the write simply fails. Nothing would seem different to applications using such an FS, except for a possible (undetermined) delay while the coordinator decides to commit or rollback the transaction.

That's all I had in mind, not bunching together multiple writes. It should not actually be that difficult to implement, the tricky part is defining a useful generic interface to the controller that would allow higher level distributed FSes to use it effectively.