LWN.net Logo

Improving ext4: bigalloc, inline data, and metadata checksums

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 11:43 UTC (Sun) by tytso (subscriber, #9993)
In reply to: Improving ext4: bigalloc, inline data, and metadata checksums by alankila
Parent article: Improving ext4: bigalloc, inline data, and metadata checksums

Actually, in some cases btrfs (or any copy on write, or CoW file system) may require more metadata blocks to be written than a traditional journalling file system design. That's because even though a CoW file system doesn't have a journal, when you update a metadata block, you have to update all of the metadata blocks that point to it.

So if you modify a node at the bottom of the b-tree, you write a new copy of the leaf block, but then you need to write a copy of its parent node with a pointer to the new leaf block, and then you need to write a copy of its grandparent, with a pointer to the new parent node, all the way up to the root of the tree. This also implies that all of these nodes had better be in memory, or you will need to read them into memory before you can write them back out. Which is why CoW file systems tend to be very memory hungry; if you are under a lot of memory pressure because you're running a cloud server, and are trying to keep lots of VM's packed into a server (or are on an EC2 VM where extra memory costs $$$$), good luck to you.

At least in theory, CoW file systems will try to batch multiple file system operations into a single big transaction (just as ext3 will try to batch many file system operations into a single transaction, to try to minimize writes to the journal). But if you have a really fsync()-happy workload, there definitely could be situations where a CoW file system like btrfs or ZFS could end up needing to update more blocks on an SSD than a traditional update-in-place file system with journaling, such as ext3 or XFS.


(Log in to post comments)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 12:13 UTC (Mon) by jlokier (guest, #52227) [Link]

I don't know if btrfs works as you describe, but it is certainly possible to implement a CoW filesystem without "writing all the way up the tree". Think about how journals work without requiring updates to the superblocks that point to them. If btrfs doesn't use that, it's an optimisation waiting to happen.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 24, 2011 20:56 UTC (Sat) by rich0 (guest, #55509) [Link]

You can't implement a COW tree without writing all the way up the tree. You write a new node to the tree, so you have to have the tree point to it. You either copy an existing parent node and fix it, or you overwrite it in place. If you do the latter, then you aren't doing COW. If you copy the parent node, then its parent is pointing to the wrong place, all the way up to the root.

I believe Btrfs actually uses a journal, and then updates the tree very 30 seconds. This is a compromise between pure journal-less COW behavior and the memory-hungry behavior described above. So, the tree itself is always in a clean state (if the change propagates to the root then it points to an up-to-date clean tree, and if it doesn't propagate to the root then it points to a stale clean tree), and then the journal can be replayed to catch the last 30 seconds worth of writes.

I believe that the Btfs journal does effectively protect both data and metadata (equivalent to data=ordered). Since data is not overwritten in place you end up with what appears to be atomic writes I think (within a single file only).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 24, 2011 22:17 UTC (Sat) by jlokier (guest, #52227) [Link]

You can't implement a COW tree without writing all the way up the tree. You write a new node to the tree, so you have to have the tree point to it. You either copy an existing parent node and fix it, or you overwrite it in place. If you do the latter, then you aren't doing COW. If you copy the parent node, then its parent is pointing to the wrong place, all the way up to the root.

In fact you can. The simplest illustration: for every tree node currently, allocate 2 on storage, and replace every pointer in a current interior node format with 2 pointers, pointing to the 2 allocated storage nodes. Those 2 storage nodes both contain a 2-bit version number. The one with larger version number (using wraparound comparison) is "current node", and the other is "potential node".

To update a tree node in COW fashion, without writing all the way up the tree on every update, simply locate the tree node's "potential node" partner, and overwrite that in place with a version 1 higher than the existing tree node. The tree is thus updated. It is made atomic using the same methods as needed for a robust journal: if it's a single sector and the medium writes those atomically, or by using a node checksum, or by writing version number at start and end if the medium is sure to write sequentially.

Note I didn't say it made reading any faster :-) (Though with non-seeking media, speed might not be a problem.)

That method is clearly space inefficient and reads slowly (unless you can cache a lot of the node selections). It can be made more efficient in a variety of ways, such as sharing "potential node" space among multiple potential nodes, or having a few pre-allocated pools of "potential node" space which migrate into the explicit tree with a delay - very much like multiple classical journals. One extreme of that strategy is a classical journal, which can be viewed as every tree node having an implicit reference to the same range of locations, any of which might be regarded as containing that node's latest version overriding the explicit tree structure.

You can imagine there a variety of structures with space and behaviour in between a single, flat journal and an explicitly replicated tree of micro-journalled nodes.

The "replay" employed by classical journals also has an analogue: preloading of node selections either on mount, or lazily as parts of the tree are first read in after mounting, potentially updating tree nodes at preload time to reduce the number of pointer traversals on future reads.

The modern trick of "mounted dirty" bits for large block ranges in some filesystems to reduce fsck time, also has a natural analogue: Dirty subtree bits, indicating whether the "potential" pointers (implicit or explicit) need to be followed or can be ignored. Those bits must be set with a barrier in advance of using the pointers, but they don't have to be set again for new updates after that, and can be cleaned in a variety of ways; one of which is the preload mentioned above.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted May 29, 2012 8:49 UTC (Tue) by marcH (subscriber, #57642) [Link]

I'm not 100% sure but I think you just meant:

"You can implement a COW tree without writing all the way up the tree if your tree implements versioning".

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds