User: Password:
|
|
Subscribe / Log in / New account

Improving ext4: bigalloc, inline data, and metadata checksums

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 10:31 UTC (Sun) by alankila (guest, #47141)
In reply to: Improving ext4: bigalloc, inline data, and metadata checksums by dlang
Parent article: Improving ext4: bigalloc, inline data, and metadata checksums

As far as I can tell, the same argument could be made about rotational media. If only there was a way to write things out in atomic chunks that result in FS metadata moving from a valid state to another valid state, journal wouldn't be necessary... The journal doesn't even improve performance (or shouldn't) because its contents must be merged with the on-disk datastructures at some point anyway.

Anyway, isn't btrfs going to give us journal-less but atomic filesystem modification behavior?


(Log in to post comments)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 11:43 UTC (Sun) by tytso (subscriber, #9993) [Link]

Actually, in some cases btrfs (or any copy on write, or CoW file system) may require more metadata blocks to be written than a traditional journalling file system design. That's because even though a CoW file system doesn't have a journal, when you update a metadata block, you have to update all of the metadata blocks that point to it.

So if you modify a node at the bottom of the b-tree, you write a new copy of the leaf block, but then you need to write a copy of its parent node with a pointer to the new leaf block, and then you need to write a copy of its grandparent, with a pointer to the new parent node, all the way up to the root of the tree. This also implies that all of these nodes had better be in memory, or you will need to read them into memory before you can write them back out. Which is why CoW file systems tend to be very memory hungry; if you are under a lot of memory pressure because you're running a cloud server, and are trying to keep lots of VM's packed into a server (or are on an EC2 VM where extra memory costs $$$$), good luck to you.

At least in theory, CoW file systems will try to batch multiple file system operations into a single big transaction (just as ext3 will try to batch many file system operations into a single transaction, to try to minimize writes to the journal). But if you have a really fsync()-happy workload, there definitely could be situations where a CoW file system like btrfs or ZFS could end up needing to update more blocks on an SSD than a traditional update-in-place file system with journaling, such as ext3 or XFS.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 12:13 UTC (Mon) by jlokier (guest, #52227) [Link]

I don't know if btrfs works as you describe, but it is certainly possible to implement a CoW filesystem without "writing all the way up the tree". Think about how journals work without requiring updates to the superblocks that point to them. If btrfs doesn't use that, it's an optimisation waiting to happen.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 24, 2011 20:56 UTC (Sat) by rich0 (guest, #55509) [Link]

You can't implement a COW tree without writing all the way up the tree. You write a new node to the tree, so you have to have the tree point to it. You either copy an existing parent node and fix it, or you overwrite it in place. If you do the latter, then you aren't doing COW. If you copy the parent node, then its parent is pointing to the wrong place, all the way up to the root.

I believe Btrfs actually uses a journal, and then updates the tree very 30 seconds. This is a compromise between pure journal-less COW behavior and the memory-hungry behavior described above. So, the tree itself is always in a clean state (if the change propagates to the root then it points to an up-to-date clean tree, and if it doesn't propagate to the root then it points to a stale clean tree), and then the journal can be replayed to catch the last 30 seconds worth of writes.

I believe that the Btfs journal does effectively protect both data and metadata (equivalent to data=ordered). Since data is not overwritten in place you end up with what appears to be atomic writes I think (within a single file only).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 24, 2011 22:17 UTC (Sat) by jlokier (guest, #52227) [Link]

You can't implement a COW tree without writing all the way up the tree. You write a new node to the tree, so you have to have the tree point to it. You either copy an existing parent node and fix it, or you overwrite it in place. If you do the latter, then you aren't doing COW. If you copy the parent node, then its parent is pointing to the wrong place, all the way up to the root.

In fact you can. The simplest illustration: for every tree node currently, allocate 2 on storage, and replace every pointer in a current interior node format with 2 pointers, pointing to the 2 allocated storage nodes. Those 2 storage nodes both contain a 2-bit version number. The one with larger version number (using wraparound comparison) is "current node", and the other is "potential node".

To update a tree node in COW fashion, without writing all the way up the tree on every update, simply locate the tree node's "potential node" partner, and overwrite that in place with a version 1 higher than the existing tree node. The tree is thus updated. It is made atomic using the same methods as needed for a robust journal: if it's a single sector and the medium writes those atomically, or by using a node checksum, or by writing version number at start and end if the medium is sure to write sequentially.

Note I didn't say it made reading any faster :-) (Though with non-seeking media, speed might not be a problem.)

That method is clearly space inefficient and reads slowly (unless you can cache a lot of the node selections). It can be made more efficient in a variety of ways, such as sharing "potential node" space among multiple potential nodes, or having a few pre-allocated pools of "potential node" space which migrate into the explicit tree with a delay - very much like multiple classical journals. One extreme of that strategy is a classical journal, which can be viewed as every tree node having an implicit reference to the same range of locations, any of which might be regarded as containing that node's latest version overriding the explicit tree structure.

You can imagine there a variety of structures with space and behaviour in between a single, flat journal and an explicitly replicated tree of micro-journalled nodes.

The "replay" employed by classical journals also has an analogue: preloading of node selections either on mount, or lazily as parts of the tree are first read in after mounting, potentially updating tree nodes at preload time to reduce the number of pointer traversals on future reads.

The modern trick of "mounted dirty" bits for large block ranges in some filesystems to reduce fsck time, also has a natural analogue: Dirty subtree bits, indicating whether the "potential" pointers (implicit or explicit) need to be followed or can be ignored. Those bits must be set with a barrier in advance of using the pointers, but they don't have to be set again for new updates after that, and can be cleaned in a variety of ways; one of which is the preload mentioned above.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted May 29, 2012 8:49 UTC (Tue) by marcH (subscriber, #57642) [Link]

I'm not 100% sure but I think you just meant:

"You can implement a COW tree without writing all the way up the tree if your tree implements versioning".

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 13:11 UTC (Sun) by dlang (subscriber, #313) [Link]

SSDs are inherently COW, they can't modify a block in place, but have to copy it (and all the other disk blocks that form up the erase block) to a new erase block.

this is ok with large streaming writes, but horrible with many small writes to the same area of disk.

the journal is many small writes to the same area of disk, exactly the worst case for an SSD

also with rotational media, writing all the block in place requires many seeks before the data can be considered safe, and if you need to write the blocks in a particular order, you may end up seeking back and forth across the disk. with a SSD the order the blocks are written in doesn't affect how long it takes to write them.

by the way, i'm not the OP who said that all journaling filesystems are bad on SSDs, I'm just pointing out some reasons why this could be the case.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 17:39 UTC (Sun) by tytso (subscriber, #9993) [Link]

Flash erase blocks are around a megabyte these days. So all modern SSD's use a Flash Translation Layer (FTL) that allows writes smaller to than an erase block to get written together in a single erase block. So it's simply not true that if you do a small random write of 16k, that the SSD will need to copy all of the other disk blocks that form up the erase block.

This might be the case for cheap MMC or SD cards that are designed for use in digital cameras, but an SSD which is meant for use in a computer will have a much more sophisticated FTL than that.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 19:38 UTC (Sun) by dlang (subscriber, #313) [Link]

if you have 4K of data that is part of an eraseblock that you modify, that eraseblock now no longer contains valid info, so since it can't overwrite the 4K it will need to write to a new eraseblock.

yes, in theory it could mark that 4k of data as being obsolete and only write new data to a new eraseblock, but that would lead to fragmentation where the disk could have 256 1M chunks, each with 4K of obsolete data in them, and to regain any space it would then need to re-write 255M of data.

given the performance impact of stalling for this long on a write (not the mention the problems you would run into if you didn't have that many blank eraseblocks available), I would assume that if you re-write a 4k chunk, when it writes that data it will re-write the rest of the eraseblock as well so that it can free up the old eraseblock

the flash translation layer lets it mix the logical blocks in the eraseblocks, and the drives probably do something in between the two extremes I listed above (so they probably track a few holes, but not too many)


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds