User: Password:
|
|
Subscribe / Log in / New account

Improving ext4: bigalloc, inline data, and metadata checksums

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 11:05 UTC (Sat) by nix (subscriber, #2304)
In reply to: Improving ext4: bigalloc, inline data, and metadata checksums by dlang
Parent article: Improving ext4: bigalloc, inline data, and metadata checksums

Really? In that case there's an awful lot of documentation out there that needs rewriting. I was fairly sure that the raison d'etre of battery backup was 1) to make RAID-[56] work in the presence of power failures without data loss, and 2) to eliminate the need to force-flush to disk to ensure data integrity, ever, except if you think your power will be off for so very long that the battery will run out.

If the power is out for months, civilization has probably fallen, and I'll have bigger things to care about than a bit of data loss. Similarly I don't care that battery backup doesn't defend me against people disconnecting the controller or pulling the battery while data is in transit. What other situation does battery backup not defend you against?


(Log in to post comments)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 15:39 UTC (Sat) by dlang (subscriber, #313) [Link]

there are two stages to writing things to a raid array

1. writing from the OS to the raid card

2. writing from the raid card to the drives

battery backup on the raid card makes step 2 reliable. this means that if the data is written to the raid card it should be considered as safe as if it was on the actual drives (it's not quite that safe, but close enough)

However, without barriers, the data isn't sent from the OS to the raid card in any predictable pattern. It's sent at the whim of the OS cache flusing algorithm. This can result in some data making it to the raid controller and other data not making it to raid controller if you have an unclean shutdown. If the data is never sent to the raid controller, then the battery there can't do you any good.

With Barriers, the system can enforce that data gets to raid controller in a particular order, and so the only data that would be lost is the data since the last barrier operation was completed.

note that if you are using software raid, things are much uglier as the OS may have written the stripe to one drive and not to another (barriers only work on a single drive, not across drives). this is one of the places where hardware raid is significantly more robust than software raid.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 18:04 UTC (Sat) by raven667 (subscriber, #5198) [Link]

Maybe I'm wrong but I dont think it works that way. Barriers are there to control the write cache after data has been posted to the storage device, to ensure that the device doesn't report completion until the data is actually perminanely committed. So I think it already works the way you want, filesystems already manage their writes and caching afaik

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 19:31 UTC (Sat) by dlang (subscriber, #313) [Link]

I'm not quite sure which part of my statement you are disagreeing with

barriers preserve the ordering of writes throughout the entire disk subsystem, so once the filesystem decides that a barrier needs to be at a particular place, going through a layer of LVM (before it supported barriers) would run the risk of the writes getting out of order

with barriers on software raid, the raid layer won't let the writes on a particular disk get out of order, but it doesn't enforce that all writes before the barrier on disk 1 get written before the writes after the barrier on disk 2

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 6:17 UTC (Sun) by raven667 (subscriber, #5198) [Link]

I guess I was under the impression, incorrect it may be, that the concepts of write barriers were already baked into most responsible filesystems but that the support for working through LVM was recent (in the last 5yrs) and the support for actually issuing the right commands to the storage and having the storage respect them was also more recent. Maybe I'm wrong and barriers as a concept are newer.

In any event there is a bright line between how the kernel handles internal data structures and what the hardware does and for storage with battery backed write cache once an IO is posted to the storage it is as good as done so there is no need to ask the storage to commit its blocks in any particular fashion. The only issue is that the kernel issue the IO requests in a responsible manner.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 6:41 UTC (Sun) by dlang (subscriber, #313) [Link]

barriers as a concept are not new, but your assumption that filesystems support them is the issue.

per the messages earlier in this thread, JFS does not, for a long time (even after it was the default in Fedora), LVM did not.

so barriers actually working correctly is relatively new (and very recently they have found more efficient ways to enforce ordering than the older version of barriers.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 11:24 UTC (Sun) by tytso (subscriber, #9993) [Link]

JFS still to this day does not issues barriers / cache flushes.

It shouldn't be that hard to add support, but no one is doing any development work on it.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 16:26 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link]

JFS has never been default in Fedora.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 16:50 UTC (Sun) by dlang (subscriber, #313) [Link]

I didn't think that I ever implied that it was.

Fedora has actually been rather limited in it's support of various filesystems. The kernel supports the different filesystems, but the installer hasn't given you the option of using XFS and JFS for your main filsystem for example.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 17:41 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link]

It appears you did

"JFS does not, for a long time (even after it was the default in Fedora)"

You are inaccurate about your claim on the installer as well. XFS is a standard option in Fedora for several releases ever since Red Hat hired Eric Sandeen from SGI to maintain it (and help develop Ext4). JFS is a non-standard option.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 19:22 UTC (Sun) by dlang (subscriber, #313) [Link]

re: JFS, oops, I don't know what I was thinking when I typed that.

re: XFS, I've been using linux since '94, so XFS support in the installer is very recent :-)

I haven't been using Fedora for quite a while, my experiance to RedHat distros is mostly RHEL (and CentOS), which lag behind. I believe that RHEL5 still didn't support XFS in the installer

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 19:53 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link]

"Very recent" is relative and not quite so accurate either. All versions of Fedora installer have supported XFS. You just had to pass "xfs" as a installer option. Same with jfs or reiserfs. Atleast Fedora 10 beta onwards supports XFS as a standard option without having to do anything

http://fedoraproject.org/wiki/Releases/10/Beta/ReleaseNot...

That is early 2008. RHEL 6 has xfs support as a add-on subscription and is supported within the installer as well IIRC.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 16:15 UTC (Mon) by wookey (subscriber, #5501) [Link]

I think dlang meant this:
"..., for a long time (even after it was the default in Fedora), LVM did not"

(I parsed it the way rahulsundaram did too - it's not clear).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 16:59 UTC (Mon) by dlang (subscriber, #313) [Link]

yes, now that you say that ir reminds me that I was meaning that for a long time after LVM was the default on Fedora, it didn't support barriers.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Jan 30, 2012 8:50 UTC (Mon) by sbergman27 (guest, #10767) [Link]

Old thread, I know. But why people are still talking about barriers I'm not sure. Abandoning the use of barriers was agreed upon at the 2010 Linux Filesystem Summit. And they completed their departure in 2.6.37, IIRC. Barriers are no more. They don't matter. They've been replaced by FUA, etc.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 8, 2011 17:54 UTC (Thu) by nye (guest, #51576) [Link]

>Barriers are there to control the write cache after data has been posted to the storage device, to ensure that the device doesn't report completion until the data is actually perminanely committed

Surely what you're describing is a cache flush, not a barrier?

A barrier is intended to control the *order* in which two pieces of data are written, not when or even *if* they're written. A barrier *could* be implemented by issuing a cache flush in between writes (maybe this is what's commonly done in practice?) but in that case you're getting slightly more than you asked for (ie. you're getting durability of the first write), with a corresponding performance impact.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 8, 2011 23:24 UTC (Thu) by raven667 (subscriber, #5198) [Link]

I think you are right, I may have misspoke.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 12:01 UTC (Mon) by jlokier (guest, #52227) [Link]

I believe dlang is right. You need to enable barriers even with battery-backed disk write cache. If the storage device has a good implementation, the cache flush requests (used to implement barriers) will be low overhead.

Some battery-backed disk write caches can commit the RAM to flash storage or something else, on battery power, in the event that the power supply is removed for a long time. These systems don't need a large battery and provide stronger long-term guarantees.

Even ignoring ext3's no barrier default, and LVM missing them for ages, there is the kernel I/O queue (elevator) which can reorder requests. If the filesystem issues barrier requests, the elevator will send writes to the storage device in the correct order. If you turn off barriers in the filesystem when mounting, the kernel elevator is free to send writes out of order; then after a system crash, the system recovery will find inconsistent data from the storage unit. This can happen even after a normal crash such as a kernel panic or hard-reboot, no power loss required.

Whether that can happen when you tell the filesystem not to bother with barriers depends on the filesystem's implementation. To be honest, I don't know how ext3/4, xfs, btrfs etc. behave in that case. I always use barriers :-)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 15:40 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

I think these days any sensible fs actually waits for the writes to reach storage independent of barrier usage. The only different with barriers on/off is whether a FUA/barrier/whatever is sent to the device to force the device to write out the data.
I am rather sure at least ext4 and xfs do it that way.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:14 UTC (Mon) by dlang (subscriber, #313) [Link]

no, jlokier is right, barriers are still needed to enforce ordering

there is no modern filesystem that waits for the data to be written before proceeding. Every single filesystem out there will allow it's writes to be cached and actually written out later (in some cases, this can be _much_ later)

when the OS finally gets around to writing the data out, it has no idea what the application (or filesystem) cares about, unless there are barriers issued to tell the OS that 'these writes must happen before these other writes'

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:15 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

The do wait for journaled data uppon journal commit. Which is the place where barriers are issued anyway.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:39 UTC (Mon) by dlang (subscriber, #313) [Link]

issueing barriers is _how_ the filesystem 'waits'

it actually doesn't stop processing requests and wait for the confirmation from the disk, it issues a barrier to tell the rest of the storage stack not to reorder around that point and goes on to process the next requrest and get it in flight.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:53 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

Err. Read the code. xfs uses io completion callbacks and only relies on the contents of the journal after the completion returned. (xlog_sync()->xlog_bdstrat()->xfs_buf_iorequest()->_xfs_buf_ioend()).
jbd does something similar but I don't want to look it up unless youre really interested.

It worked a littlebit more like you describe before 2.6.37 but back then it waited if barriers were disabled.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 13, 2011 13:35 UTC (Tue) by nix (subscriber, #2304) [Link]

Well, this is clear as mud :) guess I'd better do some code reading and figure out wtf the properties of the system actually are...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 13, 2011 13:38 UTC (Tue) by andresfreund (subscriber, #69562) [Link]

If you want I can give you the approx calltrace for jbd2 as well, I know it took me some time when I looked it up...


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds