User: Password:
Subscribe / Log in / New account

The Tux3 filesystem returns

The Tux3 filesystem returns

Posted Jan 3, 2013 16:45 UTC (Thu) by butlerm (guest, #13312)
In reply to: The Tux3 filesystem returns by daniel
Parent article: The Tux3 filesystem returns

I am curious what you use as a key or identifier to a changed metadata block, e.g. one that has at least one consistent prior location, one delta entry in the log somewhere, and one pending rewrite in a future rollup.

Are you pre-allocating the location of the next version, and identifying it that way? Or perhaps referring to the existing version (by location), plus a list of the (accumulated) deltas? Or do you use some other meta-data-block identifier to correlate the different versions of the meta data block and any logged deltas for recovery purposes?

(Log in to post comments)

The Tux3 filesystem returns

Posted Jan 3, 2013 23:07 UTC (Thu) by daniel (guest, #3181) [Link]

We preallocate the location of the next version, your first guess. Let's call this a "retained" metadata block that will be flushed in a future rollup. We say "redirect" when we allocate a new volume position for a read-only block. When we need to modify a clean btree node we redirect it in cache, emit the corresponding balloc log record (thus avoiding writing out the bitmap), modify it, emit the node modification log record, then recursively redirect its parents in the btree access path until we hit a dirty one (usually the immediate parent).

Though you didn't ask, we can estimate the per-delta writeout cost. We emit an allocation record and a modify record per redirect, which can actually be one record with some modest optimizing. Given a reasonably local btree modification pattern, the number of recursive redirects rounds to zero, so our theoretical cost is roughly one log record per dirty per rollup. Number of dirty metadata blocks written per rollup given a localized burst of modifications is roughly the filesystem tree depth, log_base_196(tree_leaves), a small number that rounds to zero which amortized across many deltas. We can further improve this (rounding closer to zero) by permitting certain dirty metadata to survive rollup, for example, when we need to redirect the root of a file index tree or a high level node of the inode table btree. This is a future optimization. We can round even closer to zero by bumping our average branching factor up from 196 to 213 using b+tree splitting, if anybody cares to add that.

The Tux3 filesystem returns

Posted Jan 4, 2013 3:26 UTC (Fri) by butlerm (guest, #13312) [Link]

That sounds great. As I mentioned earlier, many databases do something similar, except on logical overwrite rather than copy on write basis. If you have a transaction that needs to modify an b-tree index block (for example), a copy of the index block is modified in RAM, a log record describing the block change is created, and the redo log is forced to disk whenever a transaction (or group of transactions) is committed.

Then at some future point the database server process forces a more recent version of the index block back to disk at the same location as the old one, either because it needs to free up pages, is running a checkpoint, or is in the process of shutting down. After a checkpoint is complete the versions of the blocks on disk are up-to-date up through the logical sequence number of the checkpoint.

The vulnerability is that if any block is corrupted, it is highly likely that the database will have to be recovered from a coherent snapshot, and the archived redo information generated in the interim re-applied to restore the database to the state it was at the moment of the crash, and then the uncommitted transactions rolled back, typically by reversing or undo-ing all the block changes made by each one.

Filesystems, of course, generally do not have the luxury of requiring administrators to restore a clean snapshot of the underlying block device on such occasions. And that is why (I understand) the EXT4 designers adopted block image journaling rather than block modification journaling, with its accompanying cost in journal traffic. Many databases, when put into "backup mode" switch to block image journaling for the same reason - the backup software will not necessarily acquire a coherent image of each block.

And of course when such a mode is engaged, redo log traffic skyrockets, giving administrators a healthy incentive to take the database out of backup mode as soon as possible, something made much more practical by the advent of filesystem snapshots - if at the cost of pushing much of the overhead (for a database under heavy write load) onto the volume manager or snapshot-capable filesystem instead.

The Tux3 filesystem returns

Posted Jan 6, 2013 3:29 UTC (Sun) by jeltz (guest, #88600) [Link]

PostgreSQL has full_page_writes on by default which I assume is the setting for controlling "block image journaling". Though I am pretty certain it does not matter for making backups.

The Tux3 filesystem returns

Posted Jan 6, 2013 13:34 UTC (Sun) by andresfreund (subscriber, #69562) [Link]

full_page_writes doesn't mean that full pages are written all the time though - it means that the first time a page is touched after a checkpoint it will be logged entirely. If you have write intensive workloads the system should be configured in a way it doesn't checkpoint permanently...
The reason its on by default is that its important for consistency in the face of OS/HW crashes. Postgres' pages are normally bigger than the disks sectors and you could get into problems if e.g. one half of a page was written but the other was not and the WAL record only contains information about the one of them. As recovery always starts at a checkpoint its enough to log the page fully once after each checkpoint.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds