Improving ext4: bigalloc, inline data, and metadata checksums
Posted Dec 24, 2011 22:17 UTC (Sat) by jlokier
In reply to: Improving ext4: bigalloc, inline data, and metadata checksums
Parent article: Improving ext4: bigalloc, inline data, and metadata checksums
You can't implement a COW tree without writing all the way up the tree. You write a new node to the tree, so you have to have the tree point to it. You either copy an existing parent node and fix it, or you overwrite it in place. If you do the latter, then you aren't doing COW. If you copy the parent node, then its parent is pointing to the wrong place, all the way up to the root.
In fact you can. The simplest illustration: for every tree node currently, allocate 2 on storage, and replace every pointer in a current interior node format with 2 pointers, pointing to the 2 allocated storage nodes. Those 2 storage nodes both contain a 2-bit version number. The one with larger version number (using wraparound comparison) is "current node", and the other is "potential node".
To update a tree node in COW fashion, without writing all the way up the tree on every update, simply locate the tree node's "potential node" partner, and overwrite that in place with a version 1 higher than the existing tree node. The tree is thus updated. It is made atomic using the same methods as needed for a robust journal: if it's a single sector and the medium writes those atomically, or by using a node checksum, or by writing version number at start and end if the medium is sure to write sequentially.
Note I didn't say it made reading any faster :-) (Though with non-seeking media, speed might not be a problem.)
That method is clearly space inefficient and reads slowly (unless you can cache a lot of the node selections). It can be made more efficient in a variety of ways, such as sharing "potential node" space among multiple potential nodes, or having a few pre-allocated pools of "potential node" space which migrate into the explicit tree with a delay - very much like multiple classical journals. One extreme of that strategy is a classical journal, which can be viewed as every tree node having an implicit reference to the same range of locations, any of which might be regarded as containing that node's latest version overriding the explicit tree structure.
You can imagine there a variety of structures with space and behaviour in between a single, flat journal and an explicitly replicated tree of micro-journalled nodes.
The "replay" employed by classical journals also has an analogue: preloading of node selections either on mount, or lazily as parts of the tree are first read in after mounting, potentially updating tree nodes at preload time to reduce the number of pointer traversals on future reads.
The modern trick of "mounted dirty" bits for large block ranges in some filesystems to reduce fsck time, also has a natural analogue: Dirty subtree bits, indicating whether the "potential" pointers (implicit or explicit) need to be followed or can be ignored. Those bits must be set with a barrier in advance of using the pointers, but they don't have to be set again for new updates after that, and can be cleaned in a variety of ways; one of which is the preload mentioned above.
to post comments)