Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Posted Dec 3, 2011 1:56 UTC (Sat) by dlang (guest, #313)In reply to: Improving ext4: bigalloc, inline data, and metadata checksums by walex
Parent article: Improving ext4: bigalloc, inline data, and metadata checksums
Posted Dec 3, 2011 3:06 UTC (Sat)
by raven667 (subscriber, #5198)
[Link] (26 responses)
Posted Dec 3, 2011 6:29 UTC (Sat)
by dlang (guest, #313)
[Link] (25 responses)
it should make barriers very fast so there isn't a big performance hit from leaving them on, but if you disable barriers and think the battery will save you, you are sadly mistaken
Posted Dec 3, 2011 11:05 UTC (Sat)
by nix (subscriber, #2304)
[Link] (24 responses)
If the power is out for months, civilization has probably fallen, and I'll have bigger things to care about than a bit of data loss. Similarly I don't care that battery backup doesn't defend me against people disconnecting the controller or pulling the battery while data is in transit. What other situation does battery backup not defend you against?
Posted Dec 3, 2011 15:39 UTC (Sat)
by dlang (guest, #313)
[Link] (15 responses)
1. writing from the OS to the raid card
2. writing from the raid card to the drives
battery backup on the raid card makes step 2 reliable. this means that if the data is written to the raid card it should be considered as safe as if it was on the actual drives (it's not quite that safe, but close enough)
However, without barriers, the data isn't sent from the OS to the raid card in any predictable pattern. It's sent at the whim of the OS cache flusing algorithm. This can result in some data making it to the raid controller and other data not making it to raid controller if you have an unclean shutdown. If the data is never sent to the raid controller, then the battery there can't do you any good.
With Barriers, the system can enforce that data gets to raid controller in a particular order, and so the only data that would be lost is the data since the last barrier operation was completed.
note that if you are using software raid, things are much uglier as the OS may have written the stripe to one drive and not to another (barriers only work on a single drive, not across drives). this is one of the places where hardware raid is significantly more robust than software raid.
Posted Dec 3, 2011 18:04 UTC (Sat)
by raven667 (subscriber, #5198)
[Link] (14 responses)
Posted Dec 3, 2011 19:31 UTC (Sat)
by dlang (guest, #313)
[Link] (11 responses)
barriers preserve the ordering of writes throughout the entire disk subsystem, so once the filesystem decides that a barrier needs to be at a particular place, going through a layer of LVM (before it supported barriers) would run the risk of the writes getting out of order
with barriers on software raid, the raid layer won't let the writes on a particular disk get out of order, but it doesn't enforce that all writes before the barrier on disk 1 get written before the writes after the barrier on disk 2
Posted Dec 4, 2011 6:17 UTC (Sun)
by raven667 (subscriber, #5198)
[Link] (10 responses)
In any event there is a bright line between how the kernel handles internal data structures and what the hardware does and for storage with battery backed write cache once an IO is posted to the storage it is as good as done so there is no need to ask the storage to commit its blocks in any particular fashion. The only issue is that the kernel issue the IO requests in a responsible manner.
Posted Dec 4, 2011 6:41 UTC (Sun)
by dlang (guest, #313)
[Link] (8 responses)
per the messages earlier in this thread, JFS does not, for a long time (even after it was the default in Fedora), LVM did not.
so barriers actually working correctly is relatively new (and very recently they have found more efficient ways to enforce ordering than the older version of barriers.
Posted Dec 4, 2011 11:24 UTC (Sun)
by tytso (subscriber, #9993)
[Link]
It shouldn't be that hard to add support, but no one is doing any development work on it.
Posted Dec 4, 2011 16:26 UTC (Sun)
by rahulsundaram (subscriber, #21946)
[Link] (6 responses)
Posted Dec 4, 2011 16:50 UTC (Sun)
by dlang (guest, #313)
[Link] (5 responses)
Fedora has actually been rather limited in it's support of various filesystems. The kernel supports the different filesystems, but the installer hasn't given you the option of using XFS and JFS for your main filsystem for example.
Posted Dec 4, 2011 17:41 UTC (Sun)
by rahulsundaram (subscriber, #21946)
[Link] (4 responses)
"JFS does not, for a long time (even after it was the default in Fedora)"
You are inaccurate about your claim on the installer as well. XFS is a standard option in Fedora for several releases ever since Red Hat hired Eric Sandeen from SGI to maintain it (and help develop Ext4). JFS is a non-standard option.
Posted Dec 4, 2011 19:22 UTC (Sun)
by dlang (guest, #313)
[Link] (3 responses)
re: XFS, I've been using linux since '94, so XFS support in the installer is very recent :-)
I haven't been using Fedora for quite a while, my experiance to RedHat distros is mostly RHEL (and CentOS), which lag behind. I believe that RHEL5 still didn't support XFS in the installer
Posted Dec 4, 2011 19:53 UTC (Sun)
by rahulsundaram (subscriber, #21946)
[Link]
http://fedoraproject.org/wiki/Releases/10/Beta/ReleaseNot...
That is early 2008. RHEL 6 has xfs support as a add-on subscription and is supported within the installer as well IIRC.
Posted Dec 5, 2011 16:15 UTC (Mon)
by wookey (guest, #5501)
[Link] (1 responses)
(I parsed it the way rahulsundaram did too - it's not clear).
Posted Dec 5, 2011 16:59 UTC (Mon)
by dlang (guest, #313)
[Link]
Posted Jan 30, 2012 8:50 UTC (Mon)
by sbergman27 (guest, #10767)
[Link]
Posted Dec 8, 2011 17:54 UTC (Thu)
by nye (subscriber, #51576)
[Link] (1 responses)
Surely what you're describing is a cache flush, not a barrier?
A barrier is intended to control the *order* in which two pieces of data are written, not when or even *if* they're written. A barrier *could* be implemented by issuing a cache flush in between writes (maybe this is what's commonly done in practice?) but in that case you're getting slightly more than you asked for (ie. you're getting durability of the first write), with a corresponding performance impact.
Posted Dec 8, 2011 23:24 UTC (Thu)
by raven667 (subscriber, #5198)
[Link]
Posted Dec 12, 2011 12:01 UTC (Mon)
by jlokier (guest, #52227)
[Link] (7 responses)
Some battery-backed disk write caches can commit the RAM to flash storage or something else, on battery power, in the event that the power supply is removed for a long time. These systems don't need a large battery and provide stronger long-term guarantees.
Even ignoring ext3's no barrier default, and LVM missing them for ages, there is the kernel I/O queue (elevator) which can reorder requests. If the filesystem issues barrier requests, the elevator will send writes to the storage device in the correct order. If you turn off barriers in the filesystem when mounting, the kernel elevator is free to send writes out of order; then after a system crash, the system recovery will find inconsistent data from the storage unit. This can happen even after a normal crash such as a kernel panic or hard-reboot, no power loss required.
Whether that can happen when you tell the filesystem not to bother with barriers depends on the filesystem's implementation. To be honest, I don't know how ext3/4, xfs, btrfs etc. behave in that case. I always use barriers :-)
Posted Dec 12, 2011 15:40 UTC (Mon)
by andresfreund (subscriber, #69562)
[Link] (6 responses)
Posted Dec 12, 2011 18:14 UTC (Mon)
by dlang (guest, #313)
[Link] (5 responses)
there is no modern filesystem that waits for the data to be written before proceeding. Every single filesystem out there will allow it's writes to be cached and actually written out later (in some cases, this can be _much_ later)
when the OS finally gets around to writing the data out, it has no idea what the application (or filesystem) cares about, unless there are barriers issued to tell the OS that 'these writes must happen before these other writes'
Posted Dec 12, 2011 18:15 UTC (Mon)
by andresfreund (subscriber, #69562)
[Link] (4 responses)
Posted Dec 12, 2011 18:39 UTC (Mon)
by dlang (guest, #313)
[Link] (3 responses)
it actually doesn't stop processing requests and wait for the confirmation from the disk, it issues a barrier to tell the rest of the storage stack not to reorder around that point and goes on to process the next requrest and get it in flight.
Posted Dec 12, 2011 18:53 UTC (Mon)
by andresfreund (subscriber, #69562)
[Link] (2 responses)
It worked a littlebit more like you describe before 2.6.37 but back then it waited if barriers were disabled.
Posted Dec 13, 2011 13:35 UTC (Tue)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Dec 13, 2011 13:38 UTC (Tue)
by andresfreund (subscriber, #69562)
[Link]
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
"..., for a long time (even after it was the default in Fedora), LVM did not"
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
I am rather sure at least ext4 and xfs do it that way.
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
jbd does something similar but I don't want to look it up unless youre really interested.
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums