Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 20:22 UTC (Wed) by walex (guest, #69836)
In reply to: Improving ext4: bigalloc, inline data, and metadata checksums by cmccabe
Parent article: Improving ext4: bigalloc, inline data, and metadata checksums

No journaled filesystem is good for SSDs.

And this is relevant to ext4... exactly how?

Posted Nov 30, 2011 21:33 UTC (Wed) by khim (subscriber, #9252) [Link] (10 responses)

Take a look here. Note the linux version number...

And this is relevant to ext4... exactly how?

Posted Nov 30, 2011 23:16 UTC (Wed) by Lennie (subscriber, #49641) [Link] (9 responses)

Google stores it's data on ext4 without journal:

http://www.youtube.com/watch?v=Wp5Ehw7ByuU

And this is relevant to ext4... exactly how?

Posted Dec 1, 2011 1:01 UTC (Thu) by SLi (subscriber, #53131) [Link] (8 responses)

Then again Google normally has three copies of every piece of important data on different computers, so they're not too concerned about failures due to not journaling.

And this is relevant to ext4... exactly how?

Posted Dec 1, 2011 1:59 UTC (Thu) by dlang (guest, #313) [Link] (7 responses)

journaling (as used by default on every distro I know) almost never prevents data loss, at least not directly. All that journaling does is make it so that the filesystem metadata makes sense, the metadata may be pointing at garbage data, but you aren't as likely to get the metadata corrupted in such a way that continues use of the filesystem after a failure will corrupt existing data.

And this is relevant to ext4... exactly how?

Posted Dec 1, 2011 3:29 UTC (Thu) by tytso (subscriber, #9993) [Link] (6 responses)

fsync() in combination with a journal will protect against data loss.

But yes, a journal by itself has as its primary feature avoiding long fsck times. One nice thing with ext4 is that fsck times are reduced (typically) by a factor of 7-12 times. So a TB file system that previously took 20-25 minutes might now only take 2-3 minutes.

If you are replicating your data anyway because you're using a cluster file system such as Hadoopfs, and you're confident that your data center has appropriate contingencies that mitigate against a simultaneous data-center wide power loss event (i.e., you have bat, and diesel generators, etc., and you test all of this equipment regularly), then it may be that going without a journal makes sense. You really need to know what you are doing though, and it requires careful design both at the hardware level, the data center level, as well as the storage stack above the local disk file system.

And this is relevant to ext4... exactly how?

Posted Dec 2, 2011 18:55 UTC (Fri) by walex (guest, #69836) [Link] (5 responses)

One nice thing with ext4 is that fsck times are reduced (typically) by a factor of 7-12 times. So a TB file system that previously took 20-25 minutes might now only take 2-3 minutes.

That is the case only for fully undamaged filesystems, that is the common case of a periodic filesystem check. I have never seen any reports that the new 'e2fsck' is faster on damaged filesystems too. And since a damaged 1.5TB 'ext3' filesystem was reported take 2 months to 'fsck', even a factor of 10 is not going to help a lot.

And this is relevant to ext4... exactly how?

Posted Dec 2, 2011 19:10 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

I've had to do fsck on multi-TB filesystems after unclean shutdowns, they can take a long time, but time measured in hours (to a couple days for the larger ones). I suspect that if you are taking months, you have some other bottleneck in place as well.

And this is relevant to ext4... exactly how?

Posted Dec 3, 2011 0:40 UTC (Sat) by walex (guest, #69836) [Link]

An unclean shutdown is usually not that damaged, which can however happen with a particularly bad unclean shutdown (lots of stuff in flight, for example on a wide RAID) or RAM/disk errors. The report I saw was not for a "enterprise" system with battery, ECC and a redundant storage layer.

And this is relevant to ext4... exactly how?

Posted Dec 2, 2011 21:41 UTC (Fri) by nix (subscriber, #2304) [Link] (2 responses)

This has been wrong for years. As long as your filesystem was built with the uninit_bg option (which it is by default), block groups which have never been used will not need to be fscked either, hugely speeding up passes 2 and 5 (at the very least).

Fill up the fs, even once, and this benefit goes away -- but a *lot* of filesystems sit for years mostly empty. fscking those filesystems is very, very fast these days (I've seen subsecond times for mostly-empty multi-Tb filesystems).

And this is relevant to ext4... exactly how?

Posted Dec 2, 2011 22:45 UTC (Fri) by tytso (subscriber, #9993) [Link] (1 responses)

We could fix things so that as you delete files from a full file system, we reduce the high watermark field for each block group's inode table, which would restore the speedups caused by needing to scan the entire inode table. I haven't bothered to do this, but I'll add it to my todo list. (Or someone can send me a patch; it would be trivial to do this at e2fsck, but we could do it in the kernel, too.)

Not all of the improvements in fsck time come from being able to skip reading portions of the inode table. Extent tree blocks are also far more efficient than indirect blocks, and so that contributes to much of the speed improvements of fsck'ing an ext4 filesystem compared to an ext2 or ext3 file system.

And this is relevant to ext4... exactly how?

Posted Dec 2, 2011 23:35 UTC (Fri) by nix (subscriber, #2304) [Link]

We could fix things so that as you delete files from a full file system, we reduce the high watermark field for each block group's inode table

That seems hard to me. It's easy to tell if you need to increase the high watermark when adding a new file: but when you delete one, how can you tell what to reduce the high watermark to without doing a fairly expensive scan?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 4:25 UTC (Sun) by alankila (guest, #47141) [Link] (21 responses)

Just to get the argument out in the open, what is the basis for making this claim?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 4:38 UTC (Sun) by dlang (guest, #313) [Link] (19 responses)

for one thing, SSDs are write limited and have effectively zero seek time.

journaling writes data twice with the idea being that the first one is to a sequential location that is going to be fast and then the following write will be to the random location

with no seek time, you should be able to write the data to it's final location directly and avoid the second write. All you need to do is to enforce the ordering of the writes and you should be just as safe as with a journal, without the extra overhead.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 4:49 UTC (Sun) by mjg59 (subscriber, #23239) [Link] (9 responses)

Doesn't that assume that you can perform a series of atomic operations that will result in a consistent filesystem? If that's not true then you still need to be able to indicate the beginning of a transaction, the contents of that transaction and the end of it. If all of that hits the journal first then you can play the entire transaction, but if you were doing it directly to the filesystem then a poorly timed crash might hit an inconsistent point in the middle.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 5:05 UTC (Sun) by dlang (guest, #313) [Link] (8 responses)

that's true, but the trade-off is that you avoid writing the data to the journal, and then writing to the journal again to indicate the the transaction is finished.

if what you are writing is metadata, it seems like it shouldn't be that hard, since there isn't that much metadata to be written.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 11:32 UTC (Sun) by tytso (subscriber, #9993) [Link] (6 responses)

The problem is that many file system operations require you to update more than one metadata block. For example, when you move a file from one directory to another, you need to add a directory entry into one directory, and remove a directory entry from another.

Or when you allocate a disk block, you need to modify the block allocation bitmap (or whatever data structure you use to indicate that the block is in use) and then update the data structures which map a particular inode's logical to physical block map.

Without a journal, you can't do this atomically, which means the state of the file system is undefined after a unclean/unexpected shutdown of the OS.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 17:02 UTC (Sun) by kleptog (subscriber, #1183) [Link] (5 responses)

Indeed. If there were an efficient way to guarantee consistency without a journal there'd be a significant market for it, namely in databases. Journals are a well understood and effective way of managing integrity of complicated disk structures. There are other ways, but journaling beats the others on a number of fronts.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 6, 2011 0:40 UTC (Tue) by cmccabe (guest, #60281) [Link] (4 responses)

There is an efficient way to guarantee consistency without a journal. Soft updates. See http://en.wikipedia.org/wiki/Soft_updates. The main disadvantage of soft updates is that the code seems to be more complex.

Soft updates would not work for databases, because database operations often need to be logged "logically" rather than "physically." For example, when you encounter an update statement that modifies every row of the table, you just want to add the update statement itself to the journal, not the contents of every row.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 6, 2011 1:24 UTC (Tue) by tytso (subscriber, #9993) [Link] (3 responses)

The problems with Soft Updates are quite adequately summed up here, by Val Aurora (formerly Henson): http://lwn.net/Articles/339337/

My favorite line from that article is "...and then I turn to page 8 and my head explodes."

The *BSD's didn't get advanced features such as Extended Attribute until some 2 or 3 years after Linux. My theory why is that it required someone as smart as Kirk McKusick to be able to modify UFS with Soft Updates to add support for Extended Attributes and ACL's.

Also, note that because of how Soft Update works, it requires forcing metadata blocks out to disk more frequently than without Soft Updates; it is not free. What's worse, it depends on the disk not reordering write requests, which modern disks do to avoid seeks (in some cases a write can not make it onto the platter in the absence of a Cache Flush request for 5-10 seconds or more). If you disable the HDD's write cacheing, your lose a lot of performance on HDD's; if you leave it enabled (which is the default) your data is not safe.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 11, 2011 10:18 UTC (Sun) by vsrinivas (subscriber, #56913) [Link]

FFS w/ soft updates assumes that drives honor write requests in the order they were dispatched. This is not necessarily the case, weakening the guarantees it means to provide. Also FFS doesn't ever issue what linux calls 'barriers' (on BSD known as device cache flushes or BUF_CMD_FLUSH).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 21, 2011 23:09 UTC (Wed) by GalacticDomin8r (guest, #81935) [Link] (1 responses)

> Also, note that because of how Soft Update works, it requires forcing metadata blocks out to disk more frequently than without Soft Updates

Duh. Can you name a file system with integrity features that doesn't introduce a performance penalty? I thought not. The point is that the Soft Updates method is (far) less overhead than most.

> What's worse, it depends on the disk not reordering write requests

Bald faced lie. The only requirement of SU's is that writes reported as done by disk driver are indeed safely landed in the nonvolatile storage.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 22, 2011 11:32 UTC (Thu) by nix (subscriber, #2304) [Link]

A little civility would be appreciated. Unless you're a minor filesystem deity in pseudonymous disguise, it is reasonable to assume that Ted knows a hell of a lot more about filesystems than you (because he knows a hell of a lot more about filesystems than almost anyone). It's also extremely impolite to accuse someone of lying unless you have proof that what they are saying is not only wrong but maliciously meant. That is very unlikely here.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 17:13 UTC (Sun) by mjg59 (subscriber, #23239) [Link]

The trade-off is that you go from a situation where you can guarantee metadata consistency to one where you can't. SSDs may make the window of inconsistency smaller, but it's still there.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 10:31 UTC (Sun) by alankila (guest, #47141) [Link] (8 responses)

As far as I can tell, the same argument could be made about rotational media. If only there was a way to write things out in atomic chunks that result in FS metadata moving from a valid state to another valid state, journal wouldn't be necessary... The journal doesn't even improve performance (or shouldn't) because its contents must be merged with the on-disk datastructures at some point anyway.

Anyway, isn't btrfs going to give us journal-less but atomic filesystem modification behavior?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 11:43 UTC (Sun) by tytso (subscriber, #9993) [Link] (4 responses)

Actually, in some cases btrfs (or any copy on write, or CoW file system) may require more metadata blocks to be written than a traditional journalling file system design. That's because even though a CoW file system doesn't have a journal, when you update a metadata block, you have to update all of the metadata blocks that point to it.

So if you modify a node at the bottom of the b-tree, you write a new copy of the leaf block, but then you need to write a copy of its parent node with a pointer to the new leaf block, and then you need to write a copy of its grandparent, with a pointer to the new parent node, all the way up to the root of the tree. This also implies that all of these nodes had better be in memory, or you will need to read them into memory before you can write them back out. Which is why CoW file systems tend to be very memory hungry; if you are under a lot of memory pressure because you're running a cloud server, and are trying to keep lots of VM's packed into a server (or are on an EC2 VM where extra memory costs $$$$), good luck to you.

At least in theory, CoW file systems will try to batch multiple file system operations into a single big transaction (just as ext3 will try to batch many file system operations into a single transaction, to try to minimize writes to the journal). But if you have a really fsync()-happy workload, there definitely could be situations where a CoW file system like btrfs or ZFS could end up needing to update more blocks on an SSD than a traditional update-in-place file system with journaling, such as ext3 or XFS.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 12:13 UTC (Mon) by jlokier (guest, #52227) [Link] (3 responses)

I don't know if btrfs works as you describe, but it is certainly possible to implement a CoW filesystem without "writing all the way up the tree". Think about how journals work without requiring updates to the superblocks that point to them. If btrfs doesn't use that, it's an optimisation waiting to happen.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 24, 2011 20:56 UTC (Sat) by rich0 (guest, #55509) [Link] (2 responses)

You can't implement a COW tree without writing all the way up the tree. You write a new node to the tree, so you have to have the tree point to it. You either copy an existing parent node and fix it, or you overwrite it in place. If you do the latter, then you aren't doing COW. If you copy the parent node, then its parent is pointing to the wrong place, all the way up to the root.

I believe Btrfs actually uses a journal, and then updates the tree very 30 seconds. This is a compromise between pure journal-less COW behavior and the memory-hungry behavior described above. So, the tree itself is always in a clean state (if the change propagates to the root then it points to an up-to-date clean tree, and if it doesn't propagate to the root then it points to a stale clean tree), and then the journal can be replayed to catch the last 30 seconds worth of writes.

I believe that the Btfs journal does effectively protect both data and metadata (equivalent to data=ordered). Since data is not overwritten in place you end up with what appears to be atomic writes I think (within a single file only).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 24, 2011 22:17 UTC (Sat) by jlokier (guest, #52227) [Link] (1 responses)

You can't implement a COW tree without writing all the way up the tree. You write a new node to the tree, so you have to have the tree point to it. You either copy an existing parent node and fix it, or you overwrite it in place. If you do the latter, then you aren't doing COW. If you copy the parent node, then its parent is pointing to the wrong place, all the way up to the root.

In fact you can. The simplest illustration: for every tree node currently, allocate 2 on storage, and replace every pointer in a current interior node format with 2 pointers, pointing to the 2 allocated storage nodes. Those 2 storage nodes both contain a 2-bit version number. The one with larger version number (using wraparound comparison) is "current node", and the other is "potential node".

To update a tree node in COW fashion, without writing all the way up the tree on every update, simply locate the tree node's "potential node" partner, and overwrite that in place with a version 1 higher than the existing tree node. The tree is thus updated. It is made atomic using the same methods as needed for a robust journal: if it's a single sector and the medium writes those atomically, or by using a node checksum, or by writing version number at start and end if the medium is sure to write sequentially.

Note I didn't say it made reading any faster :-) (Though with non-seeking media, speed might not be a problem.)

That method is clearly space inefficient and reads slowly (unless you can cache a lot of the node selections). It can be made more efficient in a variety of ways, such as sharing "potential node" space among multiple potential nodes, or having a few pre-allocated pools of "potential node" space which migrate into the explicit tree with a delay - very much like multiple classical journals. One extreme of that strategy is a classical journal, which can be viewed as every tree node having an implicit reference to the same range of locations, any of which might be regarded as containing that node's latest version overriding the explicit tree structure.

You can imagine there a variety of structures with space and behaviour in between a single, flat journal and an explicitly replicated tree of micro-journalled nodes.

The "replay" employed by classical journals also has an analogue: preloading of node selections either on mount, or lazily as parts of the tree are first read in after mounting, potentially updating tree nodes at preload time to reduce the number of pointer traversals on future reads.

The modern trick of "mounted dirty" bits for large block ranges in some filesystems to reduce fsck time, also has a natural analogue: Dirty subtree bits, indicating whether the "potential" pointers (implicit or explicit) need to be followed or can be ignored. Those bits must be set with a barrier in advance of using the pointers, but they don't have to be set again for new updates after that, and can be cleaned in a variety of ways; one of which is the preload mentioned above.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted May 29, 2012 8:49 UTC (Tue) by marcH (subscriber, #57642) [Link]

I'm not 100% sure but I think you just meant:

"You can implement a COW tree without writing all the way up the tree if your tree implements versioning".

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 13:11 UTC (Sun) by dlang (guest, #313) [Link] (2 responses)

SSDs are inherently COW, they can't modify a block in place, but have to copy it (and all the other disk blocks that form up the erase block) to a new erase block.

this is ok with large streaming writes, but horrible with many small writes to the same area of disk.

the journal is many small writes to the same area of disk, exactly the worst case for an SSD

also with rotational media, writing all the block in place requires many seeks before the data can be considered safe, and if you need to write the blocks in a particular order, you may end up seeking back and forth across the disk. with a SSD the order the blocks are written in doesn't affect how long it takes to write them.

by the way, i'm not the OP who said that all journaling filesystems are bad on SSDs, I'm just pointing out some reasons why this could be the case.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 17:39 UTC (Sun) by tytso (subscriber, #9993) [Link] (1 responses)

Flash erase blocks are around a megabyte these days. So all modern SSD's use a Flash Translation Layer (FTL) that allows writes smaller to than an erase block to get written together in a single erase block. So it's simply not true that if you do a small random write of 16k, that the SSD will need to copy all of the other disk blocks that form up the erase block.

This might be the case for cheap MMC or SD cards that are designed for use in digital cameras, but an SSD which is meant for use in a computer will have a much more sophisticated FTL than that.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 19:38 UTC (Sun) by dlang (guest, #313) [Link]

if you have 4K of data that is part of an eraseblock that you modify, that eraseblock now no longer contains valid info, so since it can't overwrite the 4K it will need to write to a new eraseblock.

yes, in theory it could mark that 4k of data as being obsolete and only write new data to a new eraseblock, but that would lead to fragmentation where the disk could have 256 1M chunks, each with 4K of obsolete data in them, and to regain any space it would then need to re-write 255M of data.

given the performance impact of stalling for this long on a write (not the mention the problems you would run into if you didn't have that many blank eraseblocks available), I would assume that if you re-write a 4k chunk, when it writes that data it will re-write the rest of the eraseblock as well so that it can free up the old eraseblock

the flash translation layer lets it mix the logical blocks in the eraseblocks, and the drives probably do something in between the two extremes I listed above (so they probably track a few holes, but not too many)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 1:34 UTC (Mon) by cmccabe (guest, #60281) [Link]

walex said:
> > No journaled filesystem is good for SSDs

alankila said:
> Just to get the argument out in the open, what is the basis
> for making this claim?

Well, SSDs have a limited number of write cycles. With metadata journaling, you're effectively writing all the metadata changes twice instead of once. That will wear out the flash faster. I think a filesystem based on soft updates might do well on SSDs.

Of course the optimal thing would be if the hardware would just expose an actual MTD interface and let us use NilFS or UBIFS. But so far, that shows no signs of happening. The main reason seems to be that Windows is not able to use raw MTD devices, and most SSDs are sold into the traditional Windows desktop market.

Valerie Aurora also wrote an excellent article about the similarities between SSD block remapping layers and log structured filesystems here: http://lwn.net/Articles/353411/