User: Password:
Subscribe / Log in / New account

The Tux3 filesystem returns

The Tux3 filesystem returns

Posted Jan 2, 2013 21:46 UTC (Wed) by kjp (subscriber, #39639)
Parent article: The Tux3 filesystem returns

This was a nice read... anything involving ACID makes me happy (er, that sounds bad), but I missed a key point somehow. Are you writing data blocks once or twice (once to redo log, then much later check pointed somewhere else?)

It does seem an awful lot like a RDBMS. But even postgres has some quirks, like indexes never shrinking, vacuum sometimes always running in the background and thus taking forever to free up space on large tables... so your blurb about 'defragmentation' made me realize that for open source projects, zero-admin-maintenance is certainly a very high goal. Cheers.

(Log in to post comments)

The Tux3 filesystem returns

Posted Jan 2, 2013 23:11 UTC (Wed) by moltonel (subscriber, #45207) [Link]

It took postgres a lot of time, but vacuum-related annoyances get fixed one by one with each release, to the point that it is now pretty much troublefree. Hopefully tux3 can progress in a similar fashion. Good luck !

The Tux3 filesystem returns

Posted Jan 3, 2013 0:52 UTC (Thu) by butlerm (guest, #13312) [Link]

Most databases have to deal with multiple independent transactions, where most filesystems, thankfully, do not. A couple of decades ago almost every database out there would acquire page locks that would block all other page access until the locking transaction was committed.

You can just imagine if access to a directory stalled because another pending transaction had an uncommitted modification to the same directory. Or if pending additions to a directory had to be invisible to other readers, data writes had to be invisible to other readers until committed and so on. There are ways to solve that problem in database system design, but fortunately filesystems generally speaking don't have to deal with it.

The Tux3 filesystem returns

Posted Jan 3, 2013 8:11 UTC (Thu) by daniel (subscriber, #3181) [Link]

Filesystems actually do need to deal with multiple independent transactions because of shared resources like bitmaps and the inode table. We need to ensure that for each delta the bitmaps exactly match the file data and directory entries exactly match the inode table, and do that without holding long duration, performance killing locks.

The Tux3 filesystem returns

Posted Jan 3, 2013 16:33 UTC (Thu) by butlerm (guest, #13312) [Link]

A modern database deals with multiple long lived independent transactions that can independently be committed or rolled back, involve independent, isolated views of what the contents of the database look like that evolve as each transaction progresses, row level locking so that no more than one uncommitted version of a row exists at any given time, and multiple version concurrency control so that all other readers only see the previously committed versions of rows.

POSIX filesystem semantics don't have anything like it - in fact they actually forbid it. There is only one view of the filesystem to all user processes, and everything that happens takes effect instantaneously. No isolation, no process specific view of the contents of files or directories, no (user visible) transaction commit or rollback.

Everything a strict POSIX filesystem does with transactions is for recovery purposes, in an attempt to recreate something resembling a consistent snapshot of the system at the time it failed, or at least a working inconsistent one. And certainly one would want to pipeline the assembly and retiring of those transactions, so that the next one can be constructed while previous ones are being retired. I am glad to hear that Tux3 does just that.

The Tux3 filesystem returns

Posted Jan 3, 2013 8:46 UTC (Thu) by daniel (subscriber, #3181) [Link]

We write blocks once. The only duplication is, sometimes we will write one or more log records to avoid writing a full block that has only a few bytes changed, and flush the retained dirty block much later. This is a small multiplier in terms of total blocks written, so maybe our average write multiplication factor is 1.01. It is possible to construct cases where it is higher. We will only ever know the actual multiplier by measuring it under typical loads, which we have not done yet. But rounding down, the answer to your question is "once".

For example we might write ten megabyte data extents, plus four blocks to complete an atomic commit, including one log block, giving a write multiplier of 1.0016. Note that this multiplier includes all metadata. In our "rollup" a few more blocks may need to be accounted to the multiplier, maybe pushing it up to 1.003, or maybe not moving the needle at all because those retained blocks may be amortized over many deltas. In other words, we often write dirty metadata blocks zero times, rounding down modestly.

The Tux3 filesystem returns

Posted Jan 3, 2013 16:45 UTC (Thu) by butlerm (guest, #13312) [Link]

I am curious what you use as a key or identifier to a changed metadata block, e.g. one that has at least one consistent prior location, one delta entry in the log somewhere, and one pending rewrite in a future rollup.

Are you pre-allocating the location of the next version, and identifying it that way? Or perhaps referring to the existing version (by location), plus a list of the (accumulated) deltas? Or do you use some other meta-data-block identifier to correlate the different versions of the meta data block and any logged deltas for recovery purposes?

The Tux3 filesystem returns

Posted Jan 3, 2013 23:07 UTC (Thu) by daniel (subscriber, #3181) [Link]

We preallocate the location of the next version, your first guess. Let's call this a "retained" metadata block that will be flushed in a future rollup. We say "redirect" when we allocate a new volume position for a read-only block. When we need to modify a clean btree node we redirect it in cache, emit the corresponding balloc log record (thus avoiding writing out the bitmap), modify it, emit the node modification log record, then recursively redirect its parents in the btree access path until we hit a dirty one (usually the immediate parent).

Though you didn't ask, we can estimate the per-delta writeout cost. We emit an allocation record and a modify record per redirect, which can actually be one record with some modest optimizing. Given a reasonably local btree modification pattern, the number of recursive redirects rounds to zero, so our theoretical cost is roughly one log record per dirty per rollup. Number of dirty metadata blocks written per rollup given a localized burst of modifications is roughly the filesystem tree depth, log_base_196(tree_leaves), a small number that rounds to zero which amortized across many deltas. We can further improve this (rounding closer to zero) by permitting certain dirty metadata to survive rollup, for example, when we need to redirect the root of a file index tree or a high level node of the inode table btree. This is a future optimization. We can round even closer to zero by bumping our average branching factor up from 196 to 213 using b+tree splitting, if anybody cares to add that.

The Tux3 filesystem returns

Posted Jan 4, 2013 3:26 UTC (Fri) by butlerm (guest, #13312) [Link]

That sounds great. As I mentioned earlier, many databases do something similar, except on logical overwrite rather than copy on write basis. If you have a transaction that needs to modify an b-tree index block (for example), a copy of the index block is modified in RAM, a log record describing the block change is created, and the redo log is forced to disk whenever a transaction (or group of transactions) is committed.

Then at some future point the database server process forces a more recent version of the index block back to disk at the same location as the old one, either because it needs to free up pages, is running a checkpoint, or is in the process of shutting down. After a checkpoint is complete the versions of the blocks on disk are up-to-date up through the logical sequence number of the checkpoint.

The vulnerability is that if any block is corrupted, it is highly likely that the database will have to be recovered from a coherent snapshot, and the archived redo information generated in the interim re-applied to restore the database to the state it was at the moment of the crash, and then the uncommitted transactions rolled back, typically by reversing or undo-ing all the block changes made by each one.

Filesystems, of course, generally do not have the luxury of requiring administrators to restore a clean snapshot of the underlying block device on such occasions. And that is why (I understand) the EXT4 designers adopted block image journaling rather than block modification journaling, with its accompanying cost in journal traffic. Many databases, when put into "backup mode" switch to block image journaling for the same reason - the backup software will not necessarily acquire a coherent image of each block.

And of course when such a mode is engaged, redo log traffic skyrockets, giving administrators a healthy incentive to take the database out of backup mode as soon as possible, something made much more practical by the advent of filesystem snapshots - if at the cost of pushing much of the overhead (for a database under heavy write load) onto the volume manager or snapshot-capable filesystem instead.

The Tux3 filesystem returns

Posted Jan 6, 2013 3:29 UTC (Sun) by jeltz (guest, #88600) [Link]

PostgreSQL has full_page_writes on by default which I assume is the setting for controlling "block image journaling". Though I am pretty certain it does not matter for making backups.

The Tux3 filesystem returns

Posted Jan 6, 2013 13:34 UTC (Sun) by andresfreund (subscriber, #69562) [Link]

full_page_writes doesn't mean that full pages are written all the time though - it means that the first time a page is touched after a checkpoint it will be logged entirely. If you have write intensive workloads the system should be configured in a way it doesn't checkpoint permanently...
The reason its on by default is that its important for consistency in the face of OS/HW crashes. Postgres' pages are normally bigger than the disks sectors and you could get into problems if e.g. one half of a page was written but the other was not and the WAL record only contains information about the one of them. As recovery always starts at a checkpoint its enough to log the page fully once after each checkpoint.

Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds