LWN: Comments on "Improving ext4: bigalloc, inline data, and metadata checksums"

Improving ext4: bigalloc, inline data, and metadata checksums

Cato — Tue, 19 Feb 2013 10:23:46 +0000

For LVM and write caching setup generally, see http://serverfault.com/questions/279571/lvm-dangers-and-c...

You might also like to try ZFS or btrfs - both have enough built-in checksumming that they should detect issues sooner, though in this case Ogg's checksumming is doing that for audio files. With a checksumming FS you could detect whether the corruption is in RAM (seen when writing to file) or on disk (seen when reading from file). ZFS also does periodic scrubbing to validate checksums.

bigalloc

Klavs — Thu, 14 Jun 2012 14:19:02 +0000

like varnish does :)

Improving ext4: bigalloc, inline data, and metadata checksums

marcH — Tue, 29 May 2012 08:49:50 +0000

I'm not 100% sure but I think you just meant:

"You can implement a COW tree without writing all the way up the tree if your tree implements versioning".

Improving ext4: bigalloc, inline data, and metadata checksums

fragmede — Fri, 13 Apr 2012 19:34:49 +0000

I though *Oracle* was a/the big contributor to btrfs...

Improving ext4: bigalloc, inline data, and metadata checksums

fragmede — Fri, 13 Apr 2012 19:30:09 +0000

OCFS2 does have snapshots though, which is why I use it. :)

LVM snapshots are a joke if you have *lots* of snapshots, though I haven't looked at btrfs snapshots since it became production ready.

Improving ext4: bigalloc, inline data, and metadata checksums

sbergman27 — Mon, 30 Jan 2012 08:50:33 +0000

Old thread, I know. But why people are still talking about barriers I'm not sure. Abandoning the use of barriers was agreed upon at the 2010 Linux Filesystem Summit. And they completed their departure in 2.6.37, IIRC. Barriers are no more. They don't matter. They've been replaced by FUA, etc.

Improving ext4: bigalloc, inline data, and metadata checksums

sbergman27 — Sun, 15 Jan 2012 03:45:55 +0000

Mount with "nodelalloc". I have servers which host quite a few Cobol C/ISAM files. I was uncomfortable with the very idea of delayed allocation. But the EXT4 delayed allocation cheer-leading section, headed by Ted T'So, convinced me that after 2.6.30, it would be OK.

The very first time we had an power failure, with a UPS with a bad battery, we experienced corruption in several files of those files. Never *ever* *ever* had we experienced such a thing with EXT3. I immediately added nodelalloc as a mount option, and the EXT4 filesystem now seems as resilient as EXT3 ever was. Note that at around the same time as 2.6.30, EXT3 was made less reliable by adding the same 2.6.30 patches to it, and making data=writeback the default journalling mode. So if you do move back to EXT3, make sure to mount with data=journal.

BTW, I've not noted any performance differences mounting EXT4 with nodelalloc. Maybe in a side by side benchmark comparison I'd detect something.

Improving ext4: bigalloc, inline data, and metadata checksums

jsdyson — Tue, 03 Jan 2012 17:38:40 +0000

Actually, as the author of earlier forms of the FreeBSD readahead/writebehind, I do know that FreeBSD can be very aggressive with larger reads/writes than just the block size. One really big advantage of the FreeBSD buffering is that the length of the queues/pending writes is generally planned to be smaller, thereby avoiding that nasty sluggish feeling (or apparent stopping) that occurs with horribly large pending writes.

Improving ext4: bigalloc, inline data, and metadata checksums

jlokier — Sat, 24 Dec 2011 22:17:03 +0000

You can't implement a COW tree without writing all the way up the tree. You write a new node to the tree, so you have to have the tree point to it. You either copy an existing parent node and fix it, or you overwrite it in place. If you do the latter, then you aren't doing COW. If you copy the parent node, then its parent is pointing to the wrong place, all the way up to the root.

In fact you can. The simplest illustration: for every tree node currently, allocate 2 on storage, and replace every pointer in a current interior node format with 2 pointers, pointing to the 2 allocated storage nodes. Those 2 storage nodes both contain a 2-bit version number. The one with larger version number (using wraparound comparison) is "current node", and the other is "potential node".

To update a tree node in COW fashion, without writing all the way up the tree on every update, simply locate the tree node's "potential node" partner, and overwrite that in place with a version 1 higher than the existing tree node. The tree is thus updated. It is made atomic using the same methods as needed for a robust journal: if it's a single sector and the medium writes those atomically, or by using a node checksum, or by writing version number at start and end if the medium is sure to write sequentially.

Note I didn't say it made reading any faster :-) (Though with non-seeking media, speed might not be a problem.)

That method is clearly space inefficient and reads slowly (unless you can cache a lot of the node selections). It can be made more efficient in a variety of ways, such as sharing "potential node" space among multiple potential nodes, or having a few pre-allocated pools of "potential node" space which migrate into the explicit tree with a delay - very much like multiple classical journals. One extreme of that strategy is a classical journal, which can be viewed as every tree node having an implicit reference to the same range of locations, any of which might be regarded as containing that node's latest version overriding the explicit tree structure.

You can imagine there a variety of structures with space and behaviour in between a single, flat journal and an explicitly replicated tree of micro-journalled nodes.

The "replay" employed by classical journals also has an analogue: preloading of node selections either on mount, or lazily as parts of the tree are first read in after mounting, potentially updating tree nodes at preload time to reduce the number of pointer traversals on future reads.

The modern trick of "mounted dirty" bits for large block ranges in some filesystems to reduce fsck time, also has a natural analogue: Dirty subtree bits, indicating whether the "potential" pointers (implicit or explicit) need to be followed or can be ignored. Those bits must be set with a barrier in advance of using the pointers, but they don't have to be set again for new updates after that, and can be cleaned in a variety of ways; one of which is the preload mentioned above.

Improving ext4: bigalloc, inline data, and metadata checksums

rich0 — Sat, 24 Dec 2011 20:56:07 +0000

You can't implement a COW tree without writing all the way up the tree. You write a new node to the tree, so you have to have the tree point to it. You either copy an existing parent node and fix it, or you overwrite it in place. If you do the latter, then you aren't doing COW. If you copy the parent node, then its parent is pointing to the wrong place, all the way up to the root.

I believe Btrfs actually uses a journal, and then updates the tree very 30 seconds. This is a compromise between pure journal-less COW behavior and the memory-hungry behavior described above. So, the tree itself is always in a clean state (if the change propagates to the root then it points to an up-to-date clean tree, and if it doesn't propagate to the root then it points to a stale clean tree), and then the journal can be replayed to catch the last 30 seconds worth of writes.

I believe that the Btfs journal does effectively protect both data and metadata (equivalent to data=ordered). Since data is not overwritten in place you end up with what appears to be atomic writes I think (within a single file only).

Improving ext4: bigalloc, inline data, and metadata checksums

nix — Thu, 22 Dec 2011 11:32:48 +0000

A little civility would be appreciated. Unless you're a minor filesystem deity in pseudonymous disguise, it is reasonable to assume that Ted knows a hell of a lot more about filesystems than you (because he knows a hell of a lot more about filesystems than almost anyone). It's also extremely impolite to accuse someone of lying unless you have proof that what they are saying is not only wrong but maliciously meant. That is very unlikely here.

Improving ext4: bigalloc, inline data, and metadata checksums

GalacticDomin8r — Wed, 21 Dec 2011 23:09:44 +0000

> Also, note that because of how Soft Update works, it requires forcing metadata blocks out to disk more frequently than without Soft Updates

Duh. Can you name a file system with integrity features that doesn't introduce a performance penalty? I thought not. The point is that the Soft Updates method is (far) less overhead than most.

> What's worse, it depends on the disk not reordering write requests

Bald faced lie. The only requirement of SU's is that writes reported as done by disk driver are indeed safely landed in the nonvolatile storage.

Improving ext4: bigalloc, inline data, and metadata checksums

nye — Wed, 14 Dec 2011 12:15:40 +0000

>Maybe you got very lucky?

Maybe I did. Or maybe you got unlucky. Most of the people commenting on it though *never tried*; they just heard something bad via hearsay and parrotted it, and that just gets to me.

Improving ext4: bigalloc, inline data, and metadata checksums

andresfreund — Tue, 13 Dec 2011 13:38:59 +0000

If you want I can give you the approx calltrace for jbd2 as well, I know it took me some time when I looked it up...

Improving ext4: bigalloc, inline data, and metadata checksums

nix — Tue, 13 Dec 2011 13:35:10 +0000

Well, this is clear as mud :) guess I'd better do some code reading and figure out wtf the properties of the system actually are...

Improving ext4: bigalloc, inline data, and metadata checksums

andresfreund — Mon, 12 Dec 2011 18:53:59 +0000

Err. Read the code. xfs uses io completion callbacks and only relies on the contents of the journal after the completion returned. (xlog_sync()->xlog_bdstrat()->xfs_buf_iorequest()->_xfs_buf_ioend()).
jbd does something similar but I don't want to look it up unless youre really interested.

It worked a littlebit more like you describe before 2.6.37 but back then it waited if barriers were disabled.

Improving ext4: bigalloc, inline data, and metadata checksums

dlang — Mon, 12 Dec 2011 18:39:34 +0000

issueing barriers is _how_ the filesystem 'waits'

it actually doesn't stop processing requests and wait for the confirmation from the disk, it issues a barrier to tell the rest of the storage stack not to reorder around that point and goes on to process the next requrest and get it in flight.

Improving ext4: bigalloc, inline data, and metadata checksums

andresfreund — Mon, 12 Dec 2011 18:15:33 +0000

The do wait for journaled data uppon journal commit. Which is the place where barriers are issued anyway.

Improving ext4: bigalloc, inline data, and metadata checksums

dlang — Mon, 12 Dec 2011 18:14:11 +0000

no, jlokier is right, barriers are still needed to enforce ordering

there is no modern filesystem that waits for the data to be written before proceeding. Every single filesystem out there will allow it's writes to be cached and actually written out later (in some cases, this can be _much_ later)

when the OS finally gets around to writing the data out, it has no idea what the application (or filesystem) cares about, unless there are barriers issued to tell the OS that 'these writes must happen before these other writes'

Improving ext4: bigalloc, inline data, and metadata checksums

jimparis — Mon, 12 Dec 2011 16:46:19 +0000

In my long-ago experience, reiserfsck --fix-fixable did absolutely nothing to improve a broken filesystem, and --rebuild-tree was the only way to get anything out. Maybe you got very lucky?

Improving ext4: bigalloc, inline data, and metadata checksums

nye — Mon, 12 Dec 2011 16:15:31 +0000

>Did anyone ever fix the ReiserFS tools to the point that you could safely fsck a ReiserFS volume that contained an uncompressed ReiserFS image?

The existing replies have basically answered this, but just to make it clear:

You could always do that.

Reiserfs *additionally* came with an *option* designed to make a last-ditched attempt at recovering a totally hosed filesystem by looking for any data on the disk that looked like Reiserfs data structures and making its best guess at rebuilding it based on that.

Somehow the FUD brigade latched on to the drawbacks of that feature and conveniently 'forgot' that it was neither the only, nor the default fsck method.

Improving ext4: bigalloc, inline data, and metadata checksums

andresfreund — Mon, 12 Dec 2011 15:40:47 +0000

I think these days any sensible fs actually waits for the writes to reach storage independent of barrier usage. The only different with barriers on/off is whether a FUA/barrier/whatever is sent to the device to force the device to write out the data.
I am rather sure at least ext4 and xfs do it that way.

Improving ext4: bigalloc, inline data, and metadata checksums

jlokier — Mon, 12 Dec 2011 12:13:53 +0000

I don't know if btrfs works as you describe, but it is certainly possible to implement a CoW filesystem without "writing all the way up the tree". Think about how journals work without requiring updates to the superblocks that point to them. If btrfs doesn't use that, it's an optimisation waiting to happen.

Improving ext4: bigalloc, inline data, and metadata checksums

jlokier — Mon, 12 Dec 2011 12:01:29 +0000

I believe dlang is right. You need to enable barriers even with battery-backed disk write cache. If the storage device has a good implementation, the cache flush requests (used to implement barriers) will be low overhead.

Some battery-backed disk write caches can commit the RAM to flash storage or something else, on battery power, in the event that the power supply is removed for a long time. These systems don't need a large battery and provide stronger long-term guarantees.

Even ignoring ext3's no barrier default, and LVM missing them for ages, there is the kernel I/O queue (elevator) which can reorder requests. If the filesystem issues barrier requests, the elevator will send writes to the storage device in the correct order. If you turn off barriers in the filesystem when mounting, the kernel elevator is free to send writes out of order; then after a system crash, the system recovery will find inconsistent data from the storage unit. This can happen even after a normal crash such as a kernel panic or hard-reboot, no power loss required.

Whether that can happen when you tell the filesystem not to bother with barriers depends on the filesystem's implementation. To be honest, I don't know how ext3/4, xfs, btrfs etc. behave in that case. I always use barriers :-)

Lossy format conversion

jimparis — Mon, 12 Dec 2011 02:54:21 +0000

You can't replace missing information, but you could still make something that sounds better -- in a subjective sense. For example, maybe the mp3 has harsh artifacts at higher frequencies that the ogg encoder would remove.

It could apply to lossy image transformations too. Consider this sample set of images. An initial image is pixelated (lossy), and that result is then blurred (also lossy). Some might argue that the final result looks better than the intermediate one, even though all it did was throw away more information.

But I do agree that this is off-topic, and that such improvement is probably rare in practice.

Improving ext4: bigalloc, inline data, and metadata checksums

vsrinivas — Sun, 11 Dec 2011 10:18:27 +0000

FFS w/ soft updates assumes that drives honor write requests in the order they were dispatched. This is not necessarily the case, weakening the guarantees it means to provide. Also FFS doesn't ever issue what linux calls 'barriers' (on BSD known as device cache flushes or BUF_CMD_FLUSH).

Lossy format conversion

corbet — Sat, 10 Dec 2011 15:20:13 +0000

Pretty far off-topic, but: it is a rare situation indeed where the removal of information will improve the fidelity of a signal. One might not be able to hear the difference, but I have a hard time imagining how conversion between lossy formats could do anything but degrade the quality. You can't put back something that the first lossy encoding took out, but you can certainly remove parts of the signal that the first encoding preserved.

Improving ext4: bigalloc, inline data, and metadata checksums

ibukanov — Sat, 10 Dec 2011 01:04:26 +0000

When one approximate another approximation it is possible the result will be closer to the original than the initial approximation. So in theory one can get better result with MP3->OGG conversion. For this reason if tests show that people cannot detect the difference with the *properly* done conversion, then I do not see how one can claim that it can only made the quality worse.

Improving ext4: bigalloc, inline data, and metadata checksums

lopgok — Fri, 09 Dec 2011 15:19:43 +0000

I agree. The last 3 motherboards I have bought were for AMD processors. I bought a 3 core phenom II, an asus motherboard, and 4gb of ECC ram for around $200. I have no idea why Intel only supports ECC on their server motherboards. For me, this is a critical feature. In my experience, many Gigabyte motherboards do not support ECC, so check the motherboard manual, or list of supported memory before buying. In fact AMD supports IBM's Chipkill technology which will detect 4 bit errors and correct 3 bit errors. In addition, my Asus motherboards support memory scrubbing, which can help detect memory errors in a timely fashion.

If you buy assembled computers and can't get ECC support without spending big bucks, it is time to switch vendors.

It is true that ECC memory is more expensive and less available than non-ECC memory, but the price difference is around 20% or so, and Newegg and others sell a wide variety of ECC memory. Mainstream memory manufacturers, including Kingston sell ECC memory.

Of course, virtually all server computers come with ECC memory.

Improving ext4: bigalloc, inline data, and metadata checksums

james — Fri, 09 Dec 2011 13:53:10 +0000

AMD processors since the Athlon 64 all support ECC, and most Asus AMD boards (even cheap ones) wire the lines up.

Even ECC memory isn't that much more expensive: Crucial do a 2x2GB ECC kit for £27 + VAT ($42 in the US) against £19 ($30).

Improving ext4: bigalloc, inline data, and metadata checksums

nix — Fri, 09 Dec 2011 12:40:04 +0000

Oh, agreed. I've seen multiple rounds of friends deciding to save money on a cheap PC, trying to do real work on it, and finding the result a crashy erratic data-corrupting horror that is almost impossible to debug unless you have a second identical machine to swap parts out of... and losing years of working time to these unreliable nightmares. I pay a bit more (well, OK, quite a lot more) and those problems simply don't happen. I don't think this is ECCRAM, though: I think it's simply a matter of tested components with a decent safety margin rather than bargain-basement junk.

EDAC support for my Nehalem systems landed in mainline a couple of years ago but I'll admit to never having looked into how to get it to tell me what errors may have been corrected, so I have no idea how frequent they might be.

(And if it didn't mean dealing with Dell I might consider one of those machines myself...)

Improving ext4: bigalloc, inline data, and metadata checksums

quotemstr — Fri, 09 Dec 2011 07:41:10 +0000

> Dell T3500 Precision Workstation, which supports up to 24GB of ECC or non-ECC memory.

I have the same machine. Oddly enough, it only supports 12GB of non-ECC memory, at least according to Dell's manual. How does that happen?

(Also, Intel's processor datasheet claims that several hundred gigabytes of either ECC or non-ECC memory should be supported using the integrated memory controller. I wonder why Dell's system supports less.)

Improving ext4: bigalloc, inline data, and metadata checksums

tytso — Fri, 09 Dec 2011 00:57:16 +0000

What I have under my desk at work (and I'm quite happy with it) is the Dell T3500 Precision Workstation, which supports up to 24GB of ECC or non-ECC memory. It's not a mini-ATX desktop, but it's definitely not a server, either.

I really like how quickly I can build kernels on this machine. :-)

I'll grant it's not "cheap" in absolute terms, but I've always believed that skimping on a craftsman's tools is false economy.

Improving ext4: bigalloc, inline data, and metadata checksums

raven667 — Thu, 08 Dec 2011 23:24:18 +0000

I think you are right, I may have misspoke.

Improving ext4: bigalloc, inline data, and metadata checksums

nix — Thu, 08 Dec 2011 19:10:46 +0000

Yep. That's why I said it was worthwhile. But 'very cheap'? Not unless 'cheap' means 'costs much more money than other alternatives'. Yes, it has benefits, but immediate financial return is not one of them.

(Also, last time I tried you couldn't buy a desktop with ECCRAM for love nor money. Servers, sure, but not desktops. So of course all my work stays on the server with battery-backed hardware RAID and ECCRAM, and I just have to hope the desktop doesn't corrupt it in transit.)

Improving ext4: bigalloc, inline data, and metadata checksums

nye — Thu, 08 Dec 2011 17:54:42 +0000

>Barriers are there to control the write cache after data has been posted to the storage device, to ensure that the device doesn't report completion until the data is actually perminanely committed

Surely what you're describing is a cache flush, not a barrier?

A barrier is intended to control the *order* in which two pieces of data are written, not when or even *if* they're written. A barrier *could* be implemented by issuing a cache flush in between writes (maybe this is what's commonly done in practice?) but in that case you're getting slightly more than you asked for (ie. you're getting durability of the first write), with a corresponding performance impact.

Improving ext4: bigalloc, inline data, and metadata checksums

tytso — Thu, 08 Dec 2011 17:47:09 +0000

Whether or not it is cheap or not depends on how much you value your data.

It's like people who balk at spending an extra $200 to mirror their data, or to provide a hot spare for their RAID array. How much would you be willing to spend to get back your data after you discover it's been vaporized? What kind of chances are you willing to take against that eventuality happen?

It will vary depending on each person, but traditional people are terrible and figuring out cost/benefit tradeoffs.

Improving ext4: bigalloc, inline data, and metadata checksums

nix — Thu, 08 Dec 2011 16:24:45 +0000

It is very cheap insurance.

Look at the price differential between the motherboards and CPUs that support ECCRAM and those that do not. Now add in the extra cost of the RAM.

ECCRAM is worthwhile, but it is not at all cheap once you factor all that in.

Improving ext4: bigalloc, inline data, and metadata checksums

lopgok — Thu, 08 Dec 2011 15:34:52 +0000

You should generate a checksum for each file in your filesystem.
I wrote a trivial python script to generate a checksum file for each directory's files. If you run it, and it finds a checksum file, it checks that the files in the directory match the checksum file, and if they don't it reports that.

I wrote it when I had a serverworks chipset on my motherboard that corrupted IDE hard drives when DMA was enabled. However, the utility lets me know there is no bit rot in my files.

It can be found at http://jdeifik.com/ , look for 'md5sum a directory tree'. It is GPL3 code. It works independently from the files being checksummed and independently of the file system. I have found flaky disks that passed every other test with this utility.

The other thing that can corrupt files is memory errors. Many new computers do not support ECC memory. If you care about data integrity, you should use ECC memory. Intel has this feature for their server chips (xeons) and AMD has this feature for all ofgf their processors (though not all motherboard makers support it).
It is very cheap insurance.

Rebuild tree a useful feature with side-effects.

tytso — Thu, 08 Dec 2011 05:35:16 +0000

The fsck for ext2/3/4 doesn't have this feature because it doesn't need it. One of the tradeoffs of using a dynamic inode table (since in reiserfs it is stored as part of the btree) is if you lose the root node of the file system, you have no choice but to search the entire disk looking for nodes that appear to belong to the file system b-tree.

With ext 2/3/4, we have a static inode table. This does have some disadvantages, but the advantage is that it's much more robust against file system damage, since the location of the metadata is much more predictable.