Tux3: the other next-generation filesystem

Posted Dec 4, 2008 21:05 UTC (Thu) by joern (guest, #22392)
In reply to: Tux3: the other next-generation filesystem by joern
Parent article: Tux3: the other next-generation filesystem

Oh, and there is another more subtle problem. When mounting the filesystem with very little DRAM available, it may not be possible to cache all "promised" metadata blocks. So one must start writing them back at mount time. However, before they have all been loaded, one might have an incorrect (slightly dated) picture of the filesystem. The easiest example I can come up with involves three blocks, A, B and C, where A points to B and B points to C:
A -> B -> C

Now both B and C are rewritten, without updating their respective parent blocks (A and B):
A -> B -> C
B' C'

B' and C' appear disconnected without reading up on all the promises. At this point, when mounting under memory pressure, order becomes important. If A is written out first, to release the "promise" on B', everything works fine. But when B is written out first, to release the "promise on C', we get something like this:
A -> B -> C
B' C'
B"---^

And now there are two conflicting "promises" on B' and B". A rather ugly situation.

Tux3: the other next-generation filesystem

Posted Dec 11, 2008 8:01 UTC (Thu) by daniel (guest, #3181) [Link] (1 responses)

Hi Joern,

there is another more subtle problem. When mounting the filesystem with very little DRAM available, it may not be possible to cache all "promised" metadata blocks. So one must start writing them back at mount time.

You mean, first run with lots of ram, get tons of metadata blocks pinned, then remount with too little ram to hold all the pinned metadata blocks. A rare situation, you would have to work at that. All of ram is available for pinned metadata on remount, and Tux3 is pretty stingy about metadata size.

In your example, when B is rewritten (a btree split or merge) the promise made by C' to update B is released because B' is on disk. So the situation is not as complex as you feared.

I expect we can just ignore the problem of running out of dirtyable cache on replay and nobody will ever hit it. But for completeness, note that writing out the dirty metadata is not the only option. By definition, one can reconstruct each dirty metadata block from the log. So choose a dirty metadata block with no dirty children, reconstruct it and write it out, complete with promises (a mini-rollup). Keep doing that until all the dirty metadata fits in cache, then go live. This may not be fast, but it clearly terminates. Unwinding these promises is surely much easier than unwinding credit default swaps :-)

Regards,

Daniel

Tux3: the other next-generation filesystem

Posted Dec 20, 2008 13:08 UTC (Sat) by joern (guest, #22392) [Link]

>> I expect we can just ignore the problem

In that case I am a step ahead of you. :)

The situation may be easier to reach than you expect. Removable media can move from a beefy machine to some embedded device with 8M of RAM. Might not be likely for tux3, but is reasonably likely for logfs.

And order is important. If B is rewritten _after_ C, the promise made by C' is released. If it is rewritten _before_ C, both promises exist in parallel.

What I did to handle this problem may not apply directly to tux3, as the filesystem designs don't match 100%. Logfs has the old-fashioned indirect blocks and stores a "promise" by marking a pointer in the indirect block as such. Each commit walks a list of promise-containing indirect blocks and writes all promises to the journal.

On mount the promises are added to an in-memory btree. Each promise occupies about 32 bytes - while it would occupy a full page if stored in the indirect block and no other promises share this block. That allows the read-only case to work correctly and consume fairly little memory.

When going to read-write mode, the promises can be moved into the indirect blocks again. If those consume too much memory, they are written back. However, for some period promises may exist both in the btree and in indirect blocks. Care must be taken that those two never disagree.

Requires a bit more RAM than your outlined algorithm, but still bounded to a reasonable amount - nearly identical to the size occupied in the journal.