Tux3: the other next-generation filesystem

Posted Dec 4, 2008 16:13 UTC (Thu) by joern (guest, #22392)
Parent article: Tux3: the other next-generation filesystem

> Tux3 handles this by writing the new blocks directly to their final location, then putting a "promise" to update the metadata block into the log. At roll-up time, that promise will be fulfilled through the allocation of a new block and, if necessary, the logging of a promise to change the next-higher block in the tree. In this way, changes to files propagate up through the filesystem one step at a time, without the need to make a recursive, all-at-once change.

Excellent, you had the same idea. How do you deal with inode->i_size and inode->i_blocks changing on behalf of the "promise"?

Tux3: the other next-generation filesystem

Posted Dec 4, 2008 21:05 UTC (Thu) by joern (guest, #22392) [Link] (2 responses)

Oh, and there is another more subtle problem. When mounting the filesystem with very little DRAM available, it may not be possible to cache all "promised" metadata blocks. So one must start writing them back at mount time. However, before they have all been loaded, one might have an incorrect (slightly dated) picture of the filesystem. The easiest example I can come up with involves three blocks, A, B and C, where A points to B and B points to C:
A -> B -> C

Now both B and C are rewritten, without updating their respective parent blocks (A and B):
A -> B -> C
B' C'

B' and C' appear disconnected without reading up on all the promises. At this point, when mounting under memory pressure, order becomes important. If A is written out first, to release the "promise" on B', everything works fine. But when B is written out first, to release the "promise on C', we get something like this:
A -> B -> C
B' C'
B"---^

And now there are two conflicting "promises" on B' and B". A rather ugly situation.

Tux3: the other next-generation filesystem

Posted Dec 11, 2008 8:01 UTC (Thu) by daniel (guest, #3181) [Link] (1 responses)

Hi Joern,

there is another more subtle problem. When mounting the filesystem with very little DRAM available, it may not be possible to cache all "promised" metadata blocks. So one must start writing them back at mount time.

You mean, first run with lots of ram, get tons of metadata blocks pinned, then remount with too little ram to hold all the pinned metadata blocks. A rare situation, you would have to work at that. All of ram is available for pinned metadata on remount, and Tux3 is pretty stingy about metadata size.

In your example, when B is rewritten (a btree split or merge) the promise made by C' to update B is released because B' is on disk. So the situation is not as complex as you feared.

I expect we can just ignore the problem of running out of dirtyable cache on replay and nobody will ever hit it. But for completeness, note that writing out the dirty metadata is not the only option. By definition, one can reconstruct each dirty metadata block from the log. So choose a dirty metadata block with no dirty children, reconstruct it and write it out, complete with promises (a mini-rollup). Keep doing that until all the dirty metadata fits in cache, then go live. This may not be fast, but it clearly terminates. Unwinding these promises is surely much easier than unwinding credit default swaps :-)

Regards,

Daniel

Tux3: the other next-generation filesystem

Posted Dec 20, 2008 13:08 UTC (Sat) by joern (guest, #22392) [Link]

>> I expect we can just ignore the problem

In that case I am a step ahead of you. :)

The situation may be easier to reach than you expect. Removable media can move from a beefy machine to some embedded device with 8M of RAM. Might not be likely for tux3, but is reasonably likely for logfs.

And order is important. If B is rewritten _after_ C, the promise made by C' is released. If it is rewritten _before_ C, both promises exist in parallel.

What I did to handle this problem may not apply directly to tux3, as the filesystem designs don't match 100%. Logfs has the old-fashioned indirect blocks and stores a "promise" by marking a pointer in the indirect block as such. Each commit walks a list of promise-containing indirect blocks and writes all promises to the journal.

On mount the promises are added to an in-memory btree. Each promise occupies about 32 bytes - while it would occupy a full page if stored in the indirect block and no other promises share this block. That allows the read-only case to work correctly and consume fairly little memory.

When going to read-write mode, the promises can be moved into the indirect blocks again. If those consume too much memory, they are written back. However, for some period promises may exist both in the btree and in indirect blocks. Care must be taken that those two never disagree.

Requires a bit more RAM than your outlined algorithm, but still bounded to a reasonable amount - nearly identical to the size occupied in the journal.

Tux3: the other next-generation filesystem

Posted Dec 11, 2008 6:42 UTC (Thu) by daniel (guest, #3181) [Link]

How do you deal with inode->i_size and inode->i_blocks changing on behalf of the "promise"?

These are updated with the inode table block and not affected by promises. Note that we can sometimes infer the i_size and i_blocks changes from the logical positions of the written data blocks and could defer inode table block udpates until rollup time. And in the cases where we can't infer it, write the i_size into the log commit block. More optimization fun.