LWN.net Logo

Tux3: the other next-generation filesystem

Tux3: the other next-generation filesystem

Posted Dec 2, 2008 19:20 UTC (Tue) by martinfick (subscriber, #4455)
Parent article: Tux3: the other next-generation filesystem

I look forward to a stable Tux3 FS!

I know that with the whole reiserfs debate there was talk of adding a generic journaling layer to the kernel, and now tux3 will have some form of transaction support! But, has anyone considered adding entire FS transactions to the VFS API layer (including the ability to rollback) to help with the future development of distributed redundant filesystems?

It seems like there are many new distributed filesystems also in development. If they do have data redundancy, most of them do not do it in a transactional manner yet, probably because it is hard. However, if these FSes had sub filesystem kernel support for transactions, this might become much easier.

Hmm, maybe some tricks could even be played to use snapshots in this way? A brute force approach might even be to use lvm snapshots, but this might seriously stress lvm if a new snapshot were required for every FS write and it could also mean severe performance penalties. However, an lvm fallback method would allow transactions to be added to the kernel VFS layer even for older filesystems such as FAT.

If this suggested in kernel transaction support could allow commit/rollback decisions to be exported to userspace, I would think that it could easily be used (and would be very welcomed) by distributed FS designers.


(Log in to post comments)

Tux3: the other next-generation filesystem

Posted Dec 2, 2008 19:35 UTC (Tue) by sbergman27 (guest, #10767) [Link]

"""
I know that with the whole reiserfs debate there was talk of adding a generic journaling layer to the kernel
"""

I thought that was jbd?

http://en.wikipedia.org/wiki/Journaled_block_device

Tux3: the other next-generation filesystem

Posted Dec 2, 2008 19:59 UTC (Tue) by avik (guest, #704) [Link]

jbd is an in-kernel interface for journalling changes to block devices. What is described here is a filesystem-level, user-visible transaction support.

Consider:

begin transaction
yum update
test test test
commit transaction (or abort transaction)

Useful work can continue to be performed while the update takes place, and is not lost in case of rollback.

I believe NTFS supports this feature.

Tux3: the other next-generation filesystem

Posted Dec 9, 2008 14:48 UTC (Tue) by rwmj (subscriber, #5474) [Link]

begin transaction
yum update
test test test
commit transaction

This sounds like a nice idea at first, but you're forgetting an essential step: if you have multiple transactions outstanding, you need some way to combine the results to get a consistent filesystem.

For example, suppose that the yum transaction modified /etc/motd, and a user edited this file at the same time (before the yum transaction was committed). What is the final, consistent value of this file after the transaction?

From DBMSes you can find lots of different strategies to deal with these cases. A non-exhaustive list might include: Don't permit the second transaction to succeed. Always take the result of the first (or second) transaction and overwrite the other. Use a merge strategy (and there are many different sorts).

As usual in computer science, there is a whole load of interesting, accessible theory here, which is being completely ignored. My favorite which is directly relevant here is Oleg's Zipper filesystem.

Rich.

Tux3: the other next-generation filesystem

Posted Dec 9, 2008 20:26 UTC (Tue) by martinfick (subscriber, #4455) [Link]

You are correct, that is actually quite more advanced than what I was proposing. But since I did not go into any details about what I was asking for, I can hardly object. :) The real problem with the above, apart from perhaps being difficult to achieve, is that it would likely break posix semantics!

The yum proposal probably assumes that I could have multiple writes interleaved with reads from the same locations that could succeed in one transaction and then possibly be rolled back. Posix requires that once a write succeeds any reads to the same location that succeed after the write report the newly written bytes. To return a read of some written bytes to any process, even the writer, with the transaction pending, and to then rollback the transaction and return in a read what was there before the write, to any process, would break this requirement. The yum example above probably requires such "broken" semantics.

What I was suggesting is actually something much simpler than the above: a method to allow a transaction coordinator to intercept every individual write action (anything that modifies the FS) and decide whether to commit of rollback the write (transaction).

The coordinator would intercept the write after the FS signals "ready to commit". The write action would then block until either a commit or a rollback is received from the coordinator. This would not allow any concurrent read or writes to the portion of the object being modified during this block, ensuring full posix semantics.

For this to be effective with distributed redundant filesystems, once the FS has signaled ready to commit, the write has to be able to survive a crash so that if the node hosting the FS crashes, the rollback or commit can be issued upon recovery (depending on the coordinator's decision) and reads/writes must continue to be blocked until then (even after the crash!)

If the commit is performed, things continue as usual, if there is a rollback, the write simply fails. Nothing would seem different to applications using such an FS, except for a possible (undetermined) delay while the coordinator decides to commit or rollback the transaction.

That's all I had in mind, not bunching together multiple writes. It should not actually be that difficult to implement, the tricky part is defining a useful generic interface to the controller that would allow higher level distributed FSes to use it effectively.

Tux3: the other next-generation filesystem

Posted Dec 3, 2008 0:00 UTC (Wed) by jengelh (subscriber, #33263) [Link]

Not that JBD has many users. Reiser4, JFS, XFS and btrfs all use their own journalling. Leaves... ext3 to use jbd. Wow.

Tux3: the other next-generation filesystem

Posted Dec 3, 2008 12:40 UTC (Wed) by daniel (subscriber, #3181) [Link]

Reiser4, JFS, XFS and btrfs all use their own journalling. Leaves... ext3 to use jbd.

And OCFS2. JBD was created at a time when it seemed as though all future filesystems would be journalling filesystems. Incidentally, any filesystem developer who overlooks Stephen Tweedie's copious writings on the JBD design, does so at their peril whether they intend to use journalling or some other atomic commit model.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds