User: Password:
|
|
Subscribe / Log in / New account

Improving ext4: bigalloc, inline data, and metadata checksums

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 2, 2011 23:15 UTC (Fri) by tytso (subscriber, #9993)
In reply to: Improving ext4: bigalloc, inline data, and metadata checksums by walex
Parent article: Improving ext4: bigalloc, inline data, and metadata checksums

One caution about JFS. JFS does not issue cache flush (i.e., barrier) requests, which (a) gives it a speed advantage of file systems that do issue cache flush commands as necessary, and (b) it makes JFS unsafe against power failures. Which is most of the point of having a journal...

So benchmarking JFS against file systems that are engineered to be safe against power failures, such as ext4 and XFS, isn't particularly fair. You can disable cache flushes for both ext4 and XFS, but would you really want to run in an unsafe configuration for production servers? And JFS doesn't even have an option for enabling barrier support, so you can't make it run safely without fixing the file system code.


(Log in to post comments)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 0:56 UTC (Sat) by walex (guest, #69836) [Link]

As to JFS and performance and barriers with XFS and ext4:

  • I mentioned JFS as a "general purpose" filesystem, for example desktops, and random servers, in that it should have been the default instead of ext3 (which acquired barriers a bit late).
  • Anyhow on production servers I personally regard battery backup as essential, as barriers and/or disabling write caching both can have a huge impact, depending on workload.
  • The speed tests I have done and seen and that I trust are with barriers disabled and either batteries or write caching off, and with O_DIRECT (it is very difficult for me to like any file system test without O_DIRECT). I think these are fair conditions.
  • Part of the reason why barriers were added to ext3 (and at least initially they had horrible performance) and not JFS is that ext3 was chosen as the default filesystem and thus became community supported and JFS did not.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 1:56 UTC (Sat) by dlang (subscriber, #313) [Link]

battery backup does not make disabling barriers safe. without barriers, stuff leaves RAM to be sent to the disk at unpredictable times, and so if you loose the contents of RAM (power off, reboot, hang, etc) you can end up with garbage on your disk as a result, even if you have a battery-backed disk controller.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 3:06 UTC (Sat) by raven667 (subscriber, #5198) [Link]

I'm pretty sure, in this context, the OP was talking about battery backed write cache ram on the disk controller

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 6:29 UTC (Sat) by dlang (subscriber, #313) [Link]

that's what I think as well, and my point is that having battery backed ram on the controller does not make it safe to disable barriers.

it should make barriers very fast so there isn't a big performance hit from leaving them on, but if you disable barriers and think the battery will save you, you are sadly mistaken

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 11:05 UTC (Sat) by nix (subscriber, #2304) [Link]

Really? In that case there's an awful lot of documentation out there that needs rewriting. I was fairly sure that the raison d'etre of battery backup was 1) to make RAID-[56] work in the presence of power failures without data loss, and 2) to eliminate the need to force-flush to disk to ensure data integrity, ever, except if you think your power will be off for so very long that the battery will run out.

If the power is out for months, civilization has probably fallen, and I'll have bigger things to care about than a bit of data loss. Similarly I don't care that battery backup doesn't defend me against people disconnecting the controller or pulling the battery while data is in transit. What other situation does battery backup not defend you against?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 15:39 UTC (Sat) by dlang (subscriber, #313) [Link]

there are two stages to writing things to a raid array

1. writing from the OS to the raid card

2. writing from the raid card to the drives

battery backup on the raid card makes step 2 reliable. this means that if the data is written to the raid card it should be considered as safe as if it was on the actual drives (it's not quite that safe, but close enough)

However, without barriers, the data isn't sent from the OS to the raid card in any predictable pattern. It's sent at the whim of the OS cache flusing algorithm. This can result in some data making it to the raid controller and other data not making it to raid controller if you have an unclean shutdown. If the data is never sent to the raid controller, then the battery there can't do you any good.

With Barriers, the system can enforce that data gets to raid controller in a particular order, and so the only data that would be lost is the data since the last barrier operation was completed.

note that if you are using software raid, things are much uglier as the OS may have written the stripe to one drive and not to another (barriers only work on a single drive, not across drives). this is one of the places where hardware raid is significantly more robust than software raid.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 18:04 UTC (Sat) by raven667 (subscriber, #5198) [Link]

Maybe I'm wrong but I dont think it works that way. Barriers are there to control the write cache after data has been posted to the storage device, to ensure that the device doesn't report completion until the data is actually perminanely committed. So I think it already works the way you want, filesystems already manage their writes and caching afaik

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 19:31 UTC (Sat) by dlang (subscriber, #313) [Link]

I'm not quite sure which part of my statement you are disagreeing with

barriers preserve the ordering of writes throughout the entire disk subsystem, so once the filesystem decides that a barrier needs to be at a particular place, going through a layer of LVM (before it supported barriers) would run the risk of the writes getting out of order

with barriers on software raid, the raid layer won't let the writes on a particular disk get out of order, but it doesn't enforce that all writes before the barrier on disk 1 get written before the writes after the barrier on disk 2

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 6:17 UTC (Sun) by raven667 (subscriber, #5198) [Link]

I guess I was under the impression, incorrect it may be, that the concepts of write barriers were already baked into most responsible filesystems but that the support for working through LVM was recent (in the last 5yrs) and the support for actually issuing the right commands to the storage and having the storage respect them was also more recent. Maybe I'm wrong and barriers as a concept are newer.

In any event there is a bright line between how the kernel handles internal data structures and what the hardware does and for storage with battery backed write cache once an IO is posted to the storage it is as good as done so there is no need to ask the storage to commit its blocks in any particular fashion. The only issue is that the kernel issue the IO requests in a responsible manner.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 6:41 UTC (Sun) by dlang (subscriber, #313) [Link]

barriers as a concept are not new, but your assumption that filesystems support them is the issue.

per the messages earlier in this thread, JFS does not, for a long time (even after it was the default in Fedora), LVM did not.

so barriers actually working correctly is relatively new (and very recently they have found more efficient ways to enforce ordering than the older version of barriers.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 11:24 UTC (Sun) by tytso (subscriber, #9993) [Link]

JFS still to this day does not issues barriers / cache flushes.

It shouldn't be that hard to add support, but no one is doing any development work on it.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 16:26 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link]

JFS has never been default in Fedora.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 16:50 UTC (Sun) by dlang (subscriber, #313) [Link]

I didn't think that I ever implied that it was.

Fedora has actually been rather limited in it's support of various filesystems. The kernel supports the different filesystems, but the installer hasn't given you the option of using XFS and JFS for your main filsystem for example.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 17:41 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link]

It appears you did

"JFS does not, for a long time (even after it was the default in Fedora)"

You are inaccurate about your claim on the installer as well. XFS is a standard option in Fedora for several releases ever since Red Hat hired Eric Sandeen from SGI to maintain it (and help develop Ext4). JFS is a non-standard option.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 19:22 UTC (Sun) by dlang (subscriber, #313) [Link]

re: JFS, oops, I don't know what I was thinking when I typed that.

re: XFS, I've been using linux since '94, so XFS support in the installer is very recent :-)

I haven't been using Fedora for quite a while, my experiance to RedHat distros is mostly RHEL (and CentOS), which lag behind. I believe that RHEL5 still didn't support XFS in the installer

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 19:53 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link]

"Very recent" is relative and not quite so accurate either. All versions of Fedora installer have supported XFS. You just had to pass "xfs" as a installer option. Same with jfs or reiserfs. Atleast Fedora 10 beta onwards supports XFS as a standard option without having to do anything

http://fedoraproject.org/wiki/Releases/10/Beta/ReleaseNot...

That is early 2008. RHEL 6 has xfs support as a add-on subscription and is supported within the installer as well IIRC.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 16:15 UTC (Mon) by wookey (subscriber, #5501) [Link]

I think dlang meant this:
"..., for a long time (even after it was the default in Fedora), LVM did not"

(I parsed it the way rahulsundaram did too - it's not clear).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 16:59 UTC (Mon) by dlang (subscriber, #313) [Link]

yes, now that you say that ir reminds me that I was meaning that for a long time after LVM was the default on Fedora, it didn't support barriers.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Jan 30, 2012 8:50 UTC (Mon) by sbergman27 (guest, #10767) [Link]

Old thread, I know. But why people are still talking about barriers I'm not sure. Abandoning the use of barriers was agreed upon at the 2010 Linux Filesystem Summit. And they completed their departure in 2.6.37, IIRC. Barriers are no more. They don't matter. They've been replaced by FUA, etc.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 8, 2011 17:54 UTC (Thu) by nye (guest, #51576) [Link]

>Barriers are there to control the write cache after data has been posted to the storage device, to ensure that the device doesn't report completion until the data is actually perminanely committed

Surely what you're describing is a cache flush, not a barrier?

A barrier is intended to control the *order* in which two pieces of data are written, not when or even *if* they're written. A barrier *could* be implemented by issuing a cache flush in between writes (maybe this is what's commonly done in practice?) but in that case you're getting slightly more than you asked for (ie. you're getting durability of the first write), with a corresponding performance impact.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 8, 2011 23:24 UTC (Thu) by raven667 (subscriber, #5198) [Link]

I think you are right, I may have misspoke.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 12:01 UTC (Mon) by jlokier (guest, #52227) [Link]

I believe dlang is right. You need to enable barriers even with battery-backed disk write cache. If the storage device has a good implementation, the cache flush requests (used to implement barriers) will be low overhead.

Some battery-backed disk write caches can commit the RAM to flash storage or something else, on battery power, in the event that the power supply is removed for a long time. These systems don't need a large battery and provide stronger long-term guarantees.

Even ignoring ext3's no barrier default, and LVM missing them for ages, there is the kernel I/O queue (elevator) which can reorder requests. If the filesystem issues barrier requests, the elevator will send writes to the storage device in the correct order. If you turn off barriers in the filesystem when mounting, the kernel elevator is free to send writes out of order; then after a system crash, the system recovery will find inconsistent data from the storage unit. This can happen even after a normal crash such as a kernel panic or hard-reboot, no power loss required.

Whether that can happen when you tell the filesystem not to bother with barriers depends on the filesystem's implementation. To be honest, I don't know how ext3/4, xfs, btrfs etc. behave in that case. I always use barriers :-)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 15:40 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

I think these days any sensible fs actually waits for the writes to reach storage independent of barrier usage. The only different with barriers on/off is whether a FUA/barrier/whatever is sent to the device to force the device to write out the data.
I am rather sure at least ext4 and xfs do it that way.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:14 UTC (Mon) by dlang (subscriber, #313) [Link]

no, jlokier is right, barriers are still needed to enforce ordering

there is no modern filesystem that waits for the data to be written before proceeding. Every single filesystem out there will allow it's writes to be cached and actually written out later (in some cases, this can be _much_ later)

when the OS finally gets around to writing the data out, it has no idea what the application (or filesystem) cares about, unless there are barriers issued to tell the OS that 'these writes must happen before these other writes'

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:15 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

The do wait for journaled data uppon journal commit. Which is the place where barriers are issued anyway.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:39 UTC (Mon) by dlang (subscriber, #313) [Link]

issueing barriers is _how_ the filesystem 'waits'

it actually doesn't stop processing requests and wait for the confirmation from the disk, it issues a barrier to tell the rest of the storage stack not to reorder around that point and goes on to process the next requrest and get it in flight.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:53 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

Err. Read the code. xfs uses io completion callbacks and only relies on the contents of the journal after the completion returned. (xlog_sync()->xlog_bdstrat()->xfs_buf_iorequest()->_xfs_buf_ioend()).
jbd does something similar but I don't want to look it up unless youre really interested.

It worked a littlebit more like you describe before 2.6.37 but back then it waited if barriers were disabled.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 13, 2011 13:35 UTC (Tue) by nix (subscriber, #2304) [Link]

Well, this is clear as mud :) guess I'd better do some code reading and figure out wtf the properties of the system actually are...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 13, 2011 13:38 UTC (Tue) by andresfreund (subscriber, #69562) [Link]

If you want I can give you the approx calltrace for jbd2 as well, I know it took me some time when I looked it up...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 11:00 UTC (Sat) by nix (subscriber, #2304) [Link]

You got that backwards. Filesystems do not become community-supported because they are chosen as a default (though if they are common, community members *are* more likely to show an interest in them). It is more that they are very unlikely ever to be chosen as a default by anyone except their originator unless they are already community-supported.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 18:06 UTC (Sat) by raven667 (subscriber, #5198) [Link]

Reiserfs3 being an example of that, being widely shipped but unsupported and unsupportable by the community leading to more stringent support guidelines for future code acceptance

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 20:33 UTC (Sat) by tytso (subscriber, #9993) [Link]

Again, you have this backwards. Ext3 was chosen in part because it was a community-supported file system. From the very beginning, ext2 and ext3 had support from a broad set of developers, at a large number of ***different*** companies. Of the original three major developers of ext2/ext3 (Remy Card, Stephen Tweedie, and myself), only Stephen worked at Red Hat. Remy was a professor at a university in France, and I was working at MIT as the technical lead for Kerberos. And there were many other people submitting contributions to ext3 and choosing to use ext3 in embedded products (including Andrew Morton, when he worked at Digeo between 2001 and 2003).

ext3 was first supported by RHEL as of RHEL 2 which was released May 2003 --- and as you can see from the dates above, we had developers working at a wide range of companies, thus making it a communuty-supported distribution, long before Red Hat supported ext3 in their RHEL product. In contrast, most of the reiserfs developers worked at Namesys (with a one or two exceptions, most notably Chris Mason when he was at SuSE), and most of the XFS developers worked at SGI.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 16:29 UTC (Mon) by wookey (subscriber, #5501) [Link]

I'm very surprised by the assertion that XFS is intended to be safe against power failures, as it directly condtradicts my experience. I found it to be a nice filesystem with some cool features (live resizing was really cool back circa 2005/6), but I also found (twice, on different machines) that it was emphatically not designed for systems without UPS. In both caces a power failure caused significant filesystem corruption (those machines had lvm as well as XFS).

When I managed to repair them I found that many files had big blocks of zeros in them - essentially anything that was in the journal and had not been written. Up to that point I had naively thought that the point of the journal was to keep actual data, not just filesystem metadata. Files that have been 'repaired' by being silently filled with big chunks of zeros did not impress me.

So I now believe that XFS is/was good, but only on properly UPSed servers. Am I wrong about that?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 17:03 UTC (Mon) by dlang (subscriber, #313) [Link]

for a very long time, LVS did not support barriers, which means that _any_ filesystem running on top of LVM could not be safe.

XFS caches more stuff than ext does, so a crash looses more stuff.

so XFS or ext* with barriers disabled is not good to use, For a long time, running these things on top of LVM had the side effect of disabling barriers, it's only recently that LVM gained the ability to support them

JFS is not good to use (as it doesn't have barriers at all)

note that when XFS is designed to be safe, that doesn't mean that it won't loose data, just that the metadata will not be corrupt.

the only way to not loose data in a crash/power failure is to do no buffering at all, and that will absolutely kill your performance (and we are talking hundreds of times slower, not just a few percentage points)


Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds