Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
Posted Dec 2, 2008 19:59 UTC (Tue) by avik (guest, #704)In reply to: Tux3: the other next-generation filesystem by sbergman27
Parent article: Tux3: the other next-generation filesystem
Consider:
begin transaction
yum update
test test test
commit transaction (or abort transaction)
Useful work can continue to be performed while the update takes place, and is not lost in case of rollback.
I believe NTFS supports this feature.
Posted Dec 9, 2008 14:48 UTC (Tue)
by rwmj (subscriber, #5474)
[Link] (1 responses)
This sounds like a nice idea at first, but you're forgetting an essential step: if you have multiple
transactions outstanding, you need some way to combine the results to get a consistent filesystem. For example, suppose that the yum transaction modified
From DBMSes you can find lots of different strategies to deal with these cases. A non-exhaustive
list might include: Don't permit the second transaction to succeed. Always take the result of the first (or
second) transaction and overwrite the other. Use a merge strategy (and there are many different sorts).
As usual in computer science, there is a whole load of interesting, accessible theory here, which is
being completely ignored. My favorite which is directly relevant here is
Oleg's Zipper
filesystem.
Rich.
Posted Dec 9, 2008 20:26 UTC (Tue)
by martinfick (subscriber, #4455)
[Link]
The yum proposal probably assumes that I could have multiple writes interleaved with reads from the same locations that could succeed in one transaction and then possibly be rolled back. Posix requires that once a write succeeds any reads to the same location that succeed after the write report the newly written bytes. To return a read of some written bytes to any process, even the writer, with the transaction pending, and to then rollback the transaction and return in a read what was there before the write, to any process, would break this requirement. The yum example above probably requires such "broken" semantics.
What I was suggesting is actually something much simpler than the above: a method to allow a transaction coordinator to intercept every individual write action (anything that modifies the FS) and decide whether to commit of rollback the write (transaction).
The coordinator would intercept the write after the FS signals "ready to commit". The write action would then block until either a commit or a rollback is received from the coordinator. This would not allow any concurrent read or writes to the portion of the object being modified during this block, ensuring full posix semantics.
For this to be effective with distributed redundant filesystems, once the FS has signaled ready to commit, the write has to be able to survive a crash so that if the node hosting the FS crashes, the rollback or commit can be issued upon recovery (depending on the coordinator's decision) and reads/writes must continue to be blocked until then (even after the crash!)
If the commit is performed, things continue as usual, if there is a rollback, the write simply fails. Nothing would seem different to applications using such an FS, except for a possible (undetermined) delay while the coordinator decides to commit or rollback the transaction.
That's all I had in mind, not bunching together multiple writes. It should not actually be that difficult to implement, the tricky part is defining a useful generic interface to the controller that would allow higher level distributed FSes to use it effectively.
Tux3: the other next-generation filesystem
begin transaction
yum update
test test test
commit transaction
/etc/motd, and a user
edited this file at the same time (before the yum transaction was committed). What is
the final, consistent value of this file after the transaction?Tux3: the other next-generation filesystem
