User: Password:
Subscribe / Log in / New account

The Tux3 filesystem returns

The Tux3 filesystem returns

Posted Jan 4, 2013 3:26 UTC (Fri) by butlerm (guest, #13312)
In reply to: The Tux3 filesystem returns by daniel
Parent article: The Tux3 filesystem returns

That sounds great. As I mentioned earlier, many databases do something similar, except on logical overwrite rather than copy on write basis. If you have a transaction that needs to modify an b-tree index block (for example), a copy of the index block is modified in RAM, a log record describing the block change is created, and the redo log is forced to disk whenever a transaction (or group of transactions) is committed.

Then at some future point the database server process forces a more recent version of the index block back to disk at the same location as the old one, either because it needs to free up pages, is running a checkpoint, or is in the process of shutting down. After a checkpoint is complete the versions of the blocks on disk are up-to-date up through the logical sequence number of the checkpoint.

The vulnerability is that if any block is corrupted, it is highly likely that the database will have to be recovered from a coherent snapshot, and the archived redo information generated in the interim re-applied to restore the database to the state it was at the moment of the crash, and then the uncommitted transactions rolled back, typically by reversing or undo-ing all the block changes made by each one.

Filesystems, of course, generally do not have the luxury of requiring administrators to restore a clean snapshot of the underlying block device on such occasions. And that is why (I understand) the EXT4 designers adopted block image journaling rather than block modification journaling, with its accompanying cost in journal traffic. Many databases, when put into "backup mode" switch to block image journaling for the same reason - the backup software will not necessarily acquire a coherent image of each block.

And of course when such a mode is engaged, redo log traffic skyrockets, giving administrators a healthy incentive to take the database out of backup mode as soon as possible, something made much more practical by the advent of filesystem snapshots - if at the cost of pushing much of the overhead (for a database under heavy write load) onto the volume manager or snapshot-capable filesystem instead.

(Log in to post comments)

The Tux3 filesystem returns

Posted Jan 6, 2013 3:29 UTC (Sun) by jeltz (guest, #88600) [Link]

PostgreSQL has full_page_writes on by default which I assume is the setting for controlling "block image journaling". Though I am pretty certain it does not matter for making backups.

The Tux3 filesystem returns

Posted Jan 6, 2013 13:34 UTC (Sun) by andresfreund (subscriber, #69562) [Link]

full_page_writes doesn't mean that full pages are written all the time though - it means that the first time a page is touched after a checkpoint it will be logged entirely. If you have write intensive workloads the system should be configured in a way it doesn't checkpoint permanently...
The reason its on by default is that its important for consistency in the face of OS/HW crashes. Postgres' pages are normally bigger than the disks sectors and you could get into problems if e.g. one half of a page was written but the other was not and the WAL record only contains information about the one of them. As recovery always starts at a checkpoint its enough to log the page fully once after each checkpoint.

Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds