User: Password:
|
|
Subscribe / Log in / New account

Improving ext4: bigalloc, inline data, and metadata checksums

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 21:01 UTC (Wed) by walex (subscriber, #69836)
In reply to: Improving ext4: bigalloc, inline data, and metadata checksums by pr1268
Parent article: Improving ext4: bigalloc, inline data, and metadata checksums

As to corruption, you might want to read some nice papers by CERN presented at HEPiX on silent corruption. It happens everywhere, and it can have subtle effects. But as a legendary quote goes, "as far as we know we never had an undetected error" (mainframe IT manager interviewed by Datamation many years ago) is a common position. Thanks to 'ogginfo' you have discovered just how important end-to-end arguments are.

But the main issue is not that, by all accounts 'ext4' is quite reliable (when on a properly setup storage system and properly used by applications).

The big problem with 'ext4' is that its only reason to be is to allow Red Hat customers to upgrade in place existing systems, and what Red Hat wants, Red Hat gets (also because they usually pay for that and the community is very grateful).

Other than that new "typical" systems almost only JFS and XFS make sense (and perhaps in the distant future BTRFS).

In particular JFS should have been the "default" Linux filesystem instead of ext[23] for a long time. Not making JFS the default was probably the single worst strategic decision for Linux (but it can be argued that letting GKH near the kernel was even worse). JFS is still probably (by a significant margin) the best ''all-rounder'' filesystem (XFS beats it in performance only on very parallel large workloads, and it is way more complex, and JFS has two uncommon but amazingly useful special features).

Sure it was very convenient to let people (in particular Red Hat customers) upgrade in place from 'ext' to 'ext2' to 'ext3' to 'ext4' (each in-place upgrade keeping existing files unchanged and usually with terrible performance), but given that when JFS was introduced the Linux base was growing rapidly, new installations could be expected to outnumber old ones very soon, making that point largely moot.

PS: There are other little known good filesystems, like OCFS2 (which is pretty good in non-clustered mode) and NILFS2 (probably going to be very useful on SSDs), but JFS is amazingly still very good. Reiser4 was also very promising (it seems little known that the main developer of BTRFS was also the main developer of Reiser4). As a pet peeve of mine UDF could have been very promising too, as it was quite well suited to RW media like hard disks too (and the Linux implementation almost worked in RW mode on an ordinary partition), and also to SSDs.


(Log in to post comments)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 22:07 UTC (Wed) by yokem_55 (subscriber, #10498) [Link]

I agree that both jfs and disk based RW udf are way underrated. I use jfs on our laptop as it supposedly tends to have less cpu usage, and thus is better for reducing power usage. UDF, if properly supported by the kernel, would make a fantastic fs for accessing data in dual boot situations as Windows has pretty good support and it doesn't have the limitations of vfat nor does it require a nasty, awful, performance sucking hack like ntfs-3g.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 23:12 UTC (Wed) by Lennie (guest, #49641) [Link]

The main developer of ext234fs is currently a Google employee.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 0:53 UTC (Thu) by SLi (subscriber, #53131) [Link]

I very much disagree about JFS or XFS being the preferable filesystem on normal Linux use. Believe me, I've tried them both, benchmarked them both, and on almost all counts ext4 outperforms the two by a really wide margin (note that strictly speaking I'm not comparing the filesystems but their Linux implementations). In addition any failures have tended to be much worse on JFS and XFS than on ext4.

The only filesystem, years back, that could have said to outperform ext4 on most counts was ReiserFS 4. Unfortunately on each of the three times I stress tested it I hit different bugs that caused data loss.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 2:03 UTC (Thu) by dlang (subscriber, #313) [Link]

for a lot of people, ext4 is a pretty new filesystem, just now getting to the point where it has enough of a track record to trust data to.

I haven't benchmarked against ext4, but I have done benchmarks with the filesystems prior to it, and I've run into many cases where JFS and XFS are clear winners.

even against ext4, if you have a fileserver situation where you have lots of drives involved, XFS is still likely to be a win, ext4 just doesn't have enough developers/testers with large numbers of disks to work with (this isn't my opinion, it's a statement from Ted Tso in response to someone pointing out where EXT4 doesn't do as well as XFS with a high performance disk array)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 2, 2011 18:52 UTC (Fri) by walex (subscriber, #69836) [Link]

JFS or XFS being the preferable filesystem on normal Linux use. Believe me, I've tried them both, benchmarked them both, and on almost all counts ext4 outperforms the two by a really wide margin (note that strictly speaking I'm not comparing the filesystems but their Linux implementations). In addition any failures have tended to be much worse on JFS and XFS than on ext4.

Most well done benchmarks I have seen show them mostly equivalent performance, with XFS leading the group in scalability, JFS pretty good across the field, and 'ext4' just like the previous 'ext's being good only on totally freshly loaded filesystems as it packs newly created files pretty densely, and when there is ample caching (no use of 'O_DIRECT'), and both fresh loading and caching mask its fundamental, BSD FFS derived, downsides. It is very very easy to do meaningless filesystem benchmarks (the vast majority that I see on LWN and most others are worthless).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 2, 2011 23:15 UTC (Fri) by tytso (subscriber, #9993) [Link]

One caution about JFS. JFS does not issue cache flush (i.e., barrier) requests, which (a) gives it a speed advantage of file systems that do issue cache flush commands as necessary, and (b) it makes JFS unsafe against power failures. Which is most of the point of having a journal...

So benchmarking JFS against file systems that are engineered to be safe against power failures, such as ext4 and XFS, isn't particularly fair. You can disable cache flushes for both ext4 and XFS, but would you really want to run in an unsafe configuration for production servers? And JFS doesn't even have an option for enabling barrier support, so you can't make it run safely without fixing the file system code.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 0:56 UTC (Sat) by walex (subscriber, #69836) [Link]

As to JFS and performance and barriers with XFS and ext4:

  • I mentioned JFS as a "general purpose" filesystem, for example desktops, and random servers, in that it should have been the default instead of ext3 (which acquired barriers a bit late).
  • Anyhow on production servers I personally regard battery backup as essential, as barriers and/or disabling write caching both can have a huge impact, depending on workload.
  • The speed tests I have done and seen and that I trust are with barriers disabled and either batteries or write caching off, and with O_DIRECT (it is very difficult for me to like any file system test without O_DIRECT). I think these are fair conditions.
  • Part of the reason why barriers were added to ext3 (and at least initially they had horrible performance) and not JFS is that ext3 was chosen as the default filesystem and thus became community supported and JFS did not.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 1:56 UTC (Sat) by dlang (subscriber, #313) [Link]

battery backup does not make disabling barriers safe. without barriers, stuff leaves RAM to be sent to the disk at unpredictable times, and so if you loose the contents of RAM (power off, reboot, hang, etc) you can end up with garbage on your disk as a result, even if you have a battery-backed disk controller.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 3:06 UTC (Sat) by raven667 (subscriber, #5198) [Link]

I'm pretty sure, in this context, the OP was talking about battery backed write cache ram on the disk controller

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 6:29 UTC (Sat) by dlang (subscriber, #313) [Link]

that's what I think as well, and my point is that having battery backed ram on the controller does not make it safe to disable barriers.

it should make barriers very fast so there isn't a big performance hit from leaving them on, but if you disable barriers and think the battery will save you, you are sadly mistaken

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 11:05 UTC (Sat) by nix (subscriber, #2304) [Link]

Really? In that case there's an awful lot of documentation out there that needs rewriting. I was fairly sure that the raison d'etre of battery backup was 1) to make RAID-[56] work in the presence of power failures without data loss, and 2) to eliminate the need to force-flush to disk to ensure data integrity, ever, except if you think your power will be off for so very long that the battery will run out.

If the power is out for months, civilization has probably fallen, and I'll have bigger things to care about than a bit of data loss. Similarly I don't care that battery backup doesn't defend me against people disconnecting the controller or pulling the battery while data is in transit. What other situation does battery backup not defend you against?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 15:39 UTC (Sat) by dlang (subscriber, #313) [Link]

there are two stages to writing things to a raid array

1. writing from the OS to the raid card

2. writing from the raid card to the drives

battery backup on the raid card makes step 2 reliable. this means that if the data is written to the raid card it should be considered as safe as if it was on the actual drives (it's not quite that safe, but close enough)

However, without barriers, the data isn't sent from the OS to the raid card in any predictable pattern. It's sent at the whim of the OS cache flusing algorithm. This can result in some data making it to the raid controller and other data not making it to raid controller if you have an unclean shutdown. If the data is never sent to the raid controller, then the battery there can't do you any good.

With Barriers, the system can enforce that data gets to raid controller in a particular order, and so the only data that would be lost is the data since the last barrier operation was completed.

note that if you are using software raid, things are much uglier as the OS may have written the stripe to one drive and not to another (barriers only work on a single drive, not across drives). this is one of the places where hardware raid is significantly more robust than software raid.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 18:04 UTC (Sat) by raven667 (subscriber, #5198) [Link]

Maybe I'm wrong but I dont think it works that way. Barriers are there to control the write cache after data has been posted to the storage device, to ensure that the device doesn't report completion until the data is actually perminanely committed. So I think it already works the way you want, filesystems already manage their writes and caching afaik

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 19:31 UTC (Sat) by dlang (subscriber, #313) [Link]

I'm not quite sure which part of my statement you are disagreeing with

barriers preserve the ordering of writes throughout the entire disk subsystem, so once the filesystem decides that a barrier needs to be at a particular place, going through a layer of LVM (before it supported barriers) would run the risk of the writes getting out of order

with barriers on software raid, the raid layer won't let the writes on a particular disk get out of order, but it doesn't enforce that all writes before the barrier on disk 1 get written before the writes after the barrier on disk 2

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 6:17 UTC (Sun) by raven667 (subscriber, #5198) [Link]

I guess I was under the impression, incorrect it may be, that the concepts of write barriers were already baked into most responsible filesystems but that the support for working through LVM was recent (in the last 5yrs) and the support for actually issuing the right commands to the storage and having the storage respect them was also more recent. Maybe I'm wrong and barriers as a concept are newer.

In any event there is a bright line between how the kernel handles internal data structures and what the hardware does and for storage with battery backed write cache once an IO is posted to the storage it is as good as done so there is no need to ask the storage to commit its blocks in any particular fashion. The only issue is that the kernel issue the IO requests in a responsible manner.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 6:41 UTC (Sun) by dlang (subscriber, #313) [Link]

barriers as a concept are not new, but your assumption that filesystems support them is the issue.

per the messages earlier in this thread, JFS does not, for a long time (even after it was the default in Fedora), LVM did not.

so barriers actually working correctly is relatively new (and very recently they have found more efficient ways to enforce ordering than the older version of barriers.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 11:24 UTC (Sun) by tytso (subscriber, #9993) [Link]

JFS still to this day does not issues barriers / cache flushes.

It shouldn't be that hard to add support, but no one is doing any development work on it.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 16:26 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link]

JFS has never been default in Fedora.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 16:50 UTC (Sun) by dlang (subscriber, #313) [Link]

I didn't think that I ever implied that it was.

Fedora has actually been rather limited in it's support of various filesystems. The kernel supports the different filesystems, but the installer hasn't given you the option of using XFS and JFS for your main filsystem for example.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 17:41 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link]

It appears you did

"JFS does not, for a long time (even after it was the default in Fedora)"

You are inaccurate about your claim on the installer as well. XFS is a standard option in Fedora for several releases ever since Red Hat hired Eric Sandeen from SGI to maintain it (and help develop Ext4). JFS is a non-standard option.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 19:22 UTC (Sun) by dlang (subscriber, #313) [Link]

re: JFS, oops, I don't know what I was thinking when I typed that.

re: XFS, I've been using linux since '94, so XFS support in the installer is very recent :-)

I haven't been using Fedora for quite a while, my experiance to RedHat distros is mostly RHEL (and CentOS), which lag behind. I believe that RHEL5 still didn't support XFS in the installer

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 19:53 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link]

"Very recent" is relative and not quite so accurate either. All versions of Fedora installer have supported XFS. You just had to pass "xfs" as a installer option. Same with jfs or reiserfs. Atleast Fedora 10 beta onwards supports XFS as a standard option without having to do anything

http://fedoraproject.org/wiki/Releases/10/Beta/ReleaseNot...

That is early 2008. RHEL 6 has xfs support as a add-on subscription and is supported within the installer as well IIRC.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 16:15 UTC (Mon) by wookey (subscriber, #5501) [Link]

I think dlang meant this:
"..., for a long time (even after it was the default in Fedora), LVM did not"

(I parsed it the way rahulsundaram did too - it's not clear).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 16:59 UTC (Mon) by dlang (subscriber, #313) [Link]

yes, now that you say that ir reminds me that I was meaning that for a long time after LVM was the default on Fedora, it didn't support barriers.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Jan 30, 2012 8:50 UTC (Mon) by sbergman27 (guest, #10767) [Link]

Old thread, I know. But why people are still talking about barriers I'm not sure. Abandoning the use of barriers was agreed upon at the 2010 Linux Filesystem Summit. And they completed their departure in 2.6.37, IIRC. Barriers are no more. They don't matter. They've been replaced by FUA, etc.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 8, 2011 17:54 UTC (Thu) by nye (guest, #51576) [Link]

>Barriers are there to control the write cache after data has been posted to the storage device, to ensure that the device doesn't report completion until the data is actually perminanely committed

Surely what you're describing is a cache flush, not a barrier?

A barrier is intended to control the *order* in which two pieces of data are written, not when or even *if* they're written. A barrier *could* be implemented by issuing a cache flush in between writes (maybe this is what's commonly done in practice?) but in that case you're getting slightly more than you asked for (ie. you're getting durability of the first write), with a corresponding performance impact.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 8, 2011 23:24 UTC (Thu) by raven667 (subscriber, #5198) [Link]

I think you are right, I may have misspoke.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 12:01 UTC (Mon) by jlokier (guest, #52227) [Link]

I believe dlang is right. You need to enable barriers even with battery-backed disk write cache. If the storage device has a good implementation, the cache flush requests (used to implement barriers) will be low overhead.

Some battery-backed disk write caches can commit the RAM to flash storage or something else, on battery power, in the event that the power supply is removed for a long time. These systems don't need a large battery and provide stronger long-term guarantees.

Even ignoring ext3's no barrier default, and LVM missing them for ages, there is the kernel I/O queue (elevator) which can reorder requests. If the filesystem issues barrier requests, the elevator will send writes to the storage device in the correct order. If you turn off barriers in the filesystem when mounting, the kernel elevator is free to send writes out of order; then after a system crash, the system recovery will find inconsistent data from the storage unit. This can happen even after a normal crash such as a kernel panic or hard-reboot, no power loss required.

Whether that can happen when you tell the filesystem not to bother with barriers depends on the filesystem's implementation. To be honest, I don't know how ext3/4, xfs, btrfs etc. behave in that case. I always use barriers :-)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 15:40 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

I think these days any sensible fs actually waits for the writes to reach storage independent of barrier usage. The only different with barriers on/off is whether a FUA/barrier/whatever is sent to the device to force the device to write out the data.
I am rather sure at least ext4 and xfs do it that way.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:14 UTC (Mon) by dlang (subscriber, #313) [Link]

no, jlokier is right, barriers are still needed to enforce ordering

there is no modern filesystem that waits for the data to be written before proceeding. Every single filesystem out there will allow it's writes to be cached and actually written out later (in some cases, this can be _much_ later)

when the OS finally gets around to writing the data out, it has no idea what the application (or filesystem) cares about, unless there are barriers issued to tell the OS that 'these writes must happen before these other writes'

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:15 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

The do wait for journaled data uppon journal commit. Which is the place where barriers are issued anyway.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:39 UTC (Mon) by dlang (subscriber, #313) [Link]

issueing barriers is _how_ the filesystem 'waits'

it actually doesn't stop processing requests and wait for the confirmation from the disk, it issues a barrier to tell the rest of the storage stack not to reorder around that point and goes on to process the next requrest and get it in flight.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:53 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

Err. Read the code. xfs uses io completion callbacks and only relies on the contents of the journal after the completion returned. (xlog_sync()->xlog_bdstrat()->xfs_buf_iorequest()->_xfs_buf_ioend()).
jbd does something similar but I don't want to look it up unless youre really interested.

It worked a littlebit more like you describe before 2.6.37 but back then it waited if barriers were disabled.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 13, 2011 13:35 UTC (Tue) by nix (subscriber, #2304) [Link]

Well, this is clear as mud :) guess I'd better do some code reading and figure out wtf the properties of the system actually are...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 13, 2011 13:38 UTC (Tue) by andresfreund (subscriber, #69562) [Link]

If you want I can give you the approx calltrace for jbd2 as well, I know it took me some time when I looked it up...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 11:00 UTC (Sat) by nix (subscriber, #2304) [Link]

You got that backwards. Filesystems do not become community-supported because they are chosen as a default (though if they are common, community members *are* more likely to show an interest in them). It is more that they are very unlikely ever to be chosen as a default by anyone except their originator unless they are already community-supported.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 18:06 UTC (Sat) by raven667 (subscriber, #5198) [Link]

Reiserfs3 being an example of that, being widely shipped but unsupported and unsupportable by the community leading to more stringent support guidelines for future code acceptance

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 20:33 UTC (Sat) by tytso (subscriber, #9993) [Link]

Again, you have this backwards. Ext3 was chosen in part because it was a community-supported file system. From the very beginning, ext2 and ext3 had support from a broad set of developers, at a large number of ***different*** companies. Of the original three major developers of ext2/ext3 (Remy Card, Stephen Tweedie, and myself), only Stephen worked at Red Hat. Remy was a professor at a university in France, and I was working at MIT as the technical lead for Kerberos. And there were many other people submitting contributions to ext3 and choosing to use ext3 in embedded products (including Andrew Morton, when he worked at Digeo between 2001 and 2003).

ext3 was first supported by RHEL as of RHEL 2 which was released May 2003 --- and as you can see from the dates above, we had developers working at a wide range of companies, thus making it a communuty-supported distribution, long before Red Hat supported ext3 in their RHEL product. In contrast, most of the reiserfs developers worked at Namesys (with a one or two exceptions, most notably Chris Mason when he was at SuSE), and most of the XFS developers worked at SGI.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 16:29 UTC (Mon) by wookey (subscriber, #5501) [Link]

I'm very surprised by the assertion that XFS is intended to be safe against power failures, as it directly condtradicts my experience. I found it to be a nice filesystem with some cool features (live resizing was really cool back circa 2005/6), but I also found (twice, on different machines) that it was emphatically not designed for systems without UPS. In both caces a power failure caused significant filesystem corruption (those machines had lvm as well as XFS).

When I managed to repair them I found that many files had big blocks of zeros in them - essentially anything that was in the journal and had not been written. Up to that point I had naively thought that the point of the journal was to keep actual data, not just filesystem metadata. Files that have been 'repaired' by being silently filled with big chunks of zeros did not impress me.

So I now believe that XFS is/was good, but only on properly UPSed servers. Am I wrong about that?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 17:03 UTC (Mon) by dlang (subscriber, #313) [Link]

for a very long time, LVS did not support barriers, which means that _any_ filesystem running on top of LVM could not be safe.

XFS caches more stuff than ext does, so a crash looses more stuff.

so XFS or ext* with barriers disabled is not good to use, For a long time, running these things on top of LVM had the side effect of disabling barriers, it's only recently that LVM gained the ability to support them

JFS is not good to use (as it doesn't have barriers at all)

note that when XFS is designed to be safe, that doesn't mean that it won't loose data, just that the metadata will not be corrupt.

the only way to not loose data in a crash/power failure is to do no buffering at all, and that will absolutely kill your performance (and we are talking hundreds of times slower, not just a few percentage points)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 2:58 UTC (Thu) by tytso (subscriber, #9993) [Link]

The main reason JFS wasn't accepted in the community was because all of the developers worked at IBM. Very few people in the other distributions understood it, which meant that there weren't people who could support at the distro's. One of the things that I've always been very happy about is the fact that developers for ext2/3/4 come from many, many different companies.

JFS was a very good file system, and at the time when it was released, it certainly was better than ext3. But there's a lot more to having a successful open source project beyond having the best technology. The fact that ext2 was well understood, and had a mature set of file system utilities, including tools like "debugfs", are one of the things that do make a huge difference towards people accepting the technology.

At this point, though, ext4 has a number of features which JFS lacks, including delayed allocation, fallocate, punch, and TRIM/discard support. These are all features which I'm sure JFS would have developed if it still had a development community, but when IBM decided to defund the project, there were few or no developers who were not IBM'ers, and so the project stalled out.

---

People who upgrade in place from ext3 to ext4 will see roughly half the performance increase compared to doing a backup, reformat to ext4, and restore operation. But they *do* see a performance increase if they do an upgrade-in-place operation. In fact, even if they don't upgrade the file system image, and use ext4 to mount an ext2 file system image, they will see some performance improvement. So this gives them flexibility, which from a system administrator's point of view, is very, very important!

---

Finally, I find it interesting that you consider OCFS2 "pretty good" in non-clustered mode. OCFS2 is a fork of the ext3 code base[1] (it even uses fs/jbd and now fs/jbd2) with support added for clustered operation, and with support for extents (which ext4 has as well, of course). It doesn't have delayed allocation. But ext4 will be better than ocfs2 in non-clustered mode, simply because it's been optimized for it. The fact that you seem to think OCFS2 to be "pretty good", while you don't seem to think much about ext4 makes me wondered if you have some pretty strong biases against the ext[234] file system family.

[1] Ocfs2progs is also a fork of e2fsprogs. Which they did with my blessing, BTW. I'm glad to see that the code that has come out of the ext[234] project have been useful in so many places. Heck, parts of the e2fsprogs (the UUID library, which I relicensed to BSD for Apple's benefit) can be found in Mac OS X! :-)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 20:25 UTC (Thu) by sniper (guest, #13219) [Link]

Small correction.

ocfs2 is not a fork of ext3 and neither is ocfs2-tools a fork of e2fsprogs. But both have benefited a _lot_ from ext3. In some instances, we copied code (non-indexed dir layout). In some instances, we used a different approach because of collective experience (indexed dir). grep ext3 fs/ocfs2/* for more.

The toolset has a lot more similarities to e2fsprogs. It was modeled after it because it is well designed and to also allow admins to quickly learn it. The tools even use the same parameter names where possible. grep -r e2fsprogs * for more.

BTW, ocfs2 has had bigalloc (aka clusters) since day 1, inline-data since 2.6.24 and metadata checksums since 2.6.29. Yes, it does not have delayed allocations.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Apr 13, 2012 19:30 UTC (Fri) by fragmede (guest, #50925) [Link]

OCFS2 does have snapshots though, which is why I use it. :)

LVM snapshots are a joke if you have *lots* of snapshots, though I haven't looked at btrfs snapshots since it became production ready.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 3:22 UTC (Thu) by tytso (subscriber, #9993) [Link]

One other thought. At least at the beginning ext4's raison d'etre (its reason for being) was as a stopgap file system until btrfs could be ready. We started with ext3 code which was proven, solid code, and support for delayed allocation, multiblock allocation, and extents had also been in use for quite a while in Clustrefs's Lustre product. So that code wasn't exactly new, either. What I did was integrate Cluterfs's contributions, and then worked on stablizing them so that we would have something that was better than ext3 ready in the short term.

At the time when I started working on ext4, XFS developers were all mostly still working for SGI, so there was a similar problem with the distributions not having anyone who could support or debugfs XFS problems. This has changed more recently, as more and more XFS developers have left (volunteraliy or involuntarily) SGI and joined companies such as Red Hat. XFS has also improved its small file performance, which was something it didn't do particularly well simply because SGI didn't optimize for that; its sweet spot was and still is really large files on huge RAID arrays.

One of the reasons why I felt it was necessary to work on ext4 was that everyone I talked to who had created a file system before in the industry, whether it was GPFS (IBM's cluster file system), or Digital Unix's advfs, or Sun's ZFS, gave estimates of somewhere between 50 to 200 person years worth of effort before the file system was "ready". Even if we assume that open source development practices would make development go twice as fast, and if we ignore the high end of the range because cluster file systems are hard, I was skeptical it would get done in two years (which was the original estimate) given the number of developers it was likely to attract. Given that btrfs started at the beginning of 2007, and here we are almost at 2012, I'd say my fears were justified.

At this point, I'm actually finding that ext4 has found a second life as a server file system in large cloud data centers. It turns out that if you don't need the fancy-shamcy features that Copy-on-Write file systems give you, they aren't free. In particular, ZFS has truly a prodigious appetite for memory, and one of the things about cloud servers is that in order for them to make economic sense, you try to pack as many jobs or VM's on them, so they are constantly under memory pressure. We've done some further optimizations so that ext4 performs much better when under memory pressure, and I suspect at this point that in a cloud setting, using a CoW file system may simply not make sense.

Once btrfs is ready for some serious benchmarking, it would be interesting to benchmark it under serious memory pressure, and see how well it performs. Previous CoW file systems, such as BSD's lfs two decades ago, and ZFS more recently, have needed a lot of memory to cache metadata blocks, and it will be interesting to see if btrfs has similar issues.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 19:36 UTC (Thu) by nix (subscriber, #2304) [Link]

I shouldn't respond to this troll-bait, but nonetheless...
The big problem with 'ext4' is that its only reason to be is to allow Red Hat customers to upgrade in place existing systems, and what Red Hat wants, Red Hat gets (also because they usually pay for that and the community is very grateful).
Interesting. tytso wasn't working for RH when ext4 started up, and still isn't working for them now. So their influence must be more subtle.

I also see that I was making some sort of horrible mistake by installing ext4 on all my newer systems, but you never make clear what that mistake might have been.

In particular JFS should have been the "default" Linux filesystem instead of ext[23] for a long time. Not making JFS the default was probably the single worst strategic decision for Linux (but it can be argued that letting GKH near the kernel was even worse).
Ah, yeah. Because stable kernels, USB support, mentoring newbies, the driver core, -staging... all these things were bad.

I've been wracking my brains and I can't think of one thing Greg has done that has come to public knowledge and could be considered bad. So this looks like groundless personal animosity to me.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 19:41 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

> I've been wracking my brains and I can't think of one thing Greg has done that has come to public knowledge and could be considered bad. So this looks like groundless personal animosity to me.
Also, uhm. Didn't he work for Suse?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 2, 2011 11:35 UTC (Fri) by alankila (guest, #47141) [Link]

I dimly recall that the animosity originated from the work with udev, and the removal of devfs. Since I personally don't care one bit about this issue, I have hard time now reconstructing the relevant arguments, but my guess is that some people really hate the idea that a system needs more than just kernel to be useful.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 2, 2011 18:40 UTC (Fri) by nix (subscriber, #2304) [Link]

udev is prone to creating frothing-at-the-mouth even in otherwise reasonable people, due to the udev authors' patent lack of concern for backward compatibility. Twice now they've broken existing systems without so much as a by-your-leave: firstly with the massive migration of all system-provided state out of /etc/udev.d/rules into /lib/udev/rules: what, you customized them? sucks to be you, now you have to customize them before *building* udev, and more recently with the abrupt movement of /sbin/udevd into /lib/udev without even leaving behind a symlink! Oh, you were starting that at bootup and relying on it to be there? Sorry, we just broke your bootup, your own fault for not reading the release notes! Hope you don't need to downgrade!

(Yes, I read the release notes, so didn't fall into these traps, but FFS, at least the latter problem was trivial to work around -- one line in the makefile to drop a symlink in /sbin -- and they just didn't bother.)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 2, 2011 23:40 UTC (Fri) by walex (subscriber, #69836) [Link]

As to udev some people dislike smarmy shysters who replace well designed working subsystems seemingly for the sole reason of making a political landgrab, because the replacement has both more kernel complexity and more userland complexity and less stability.

The key features of devfs were that it would populate automatically /dev from the kernel with basic device files (major, minor) and then use a very simple userland daemon to add extra aliases as required.

It turns out that after several attempts to get it to work udev adds to /sys from inside the kernel exactly the same information, so there has been no migration of functionality from kernel to userspace:

$ ls -ld /dev/tty9
crw--w---- 1 root tty 4, 9 2011-11-28 14:03 /dev/tty9
$ cat /sys/class/tty/tty9/dev
4:9

And the userland part is also far more complex and unstable than devfsd ever was (for example devfs did not require cold start).

And udev is just the most shining example of a series of similar poor decisions (which however seem to have been improving a bit with time).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 3:16 UTC (Sat) by raven667 (subscriber, #5198) [Link]

I'm not sure that is an accurate portrayal of what happened, on this planet at least. My recollection from the time is that there were fundamental technical problems with the devfs implementation which is why it was redone into udev. I think those problems were some inherent race conditions on device add/removal, plus concerns about how much policy about /dev file names, permissions, etc was hard coded into the kernel and unmodifyable by an end user or sysadmin. That is just my recollection.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 11:07 UTC (Sat) by nix (subscriber, #2304) [Link]

The latter is doubly ironic now that udev forbids you from changing the names given to devices by the kernel. (You can introduce new names, but you can't change the kernel's anymore.)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 4:04 UTC (Sat) by alankila (guest, #47141) [Link]

To your specific example: obviously the kernel is going to have some kind of (generated) name for a device, and to know the major/minor number pair which is the very thing that faciliates the communication between userspace and kernel... But udev is still controlling things like permissions and aliases for those devices where necessary.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 0:12 UTC (Sat) by walex (subscriber, #69836) [Link]

«tytso wasn't working for RH when ext4 started up, and still isn't working for them now. So their influence must be more subtle. »

Quite irrelevant: a lot of file system were somebody's hobby file systems, but they did not achieve prominence and instant integration into mainline even if rather alpha, and RedHat did not spend enormous amounts of resources quality assuring them to make them production ready either, and quality assurance is a pretty vital detail for file systems, as the Namesys people discovered.

Pointing to tytso is just misleading. Also because ext4 really was seeded by Lustre people before tytso became active on it in his role as ext3 curator (and in 2005, which is 5 years later than when JFS became available).

Similarly for BTRFS, it has been initiated by Oracle (who have an ext3 installed base), but its main appeal is still as the next inplace upgrade on the Red Hat installed base (thus the interest in trialing it in Fedora, where EL candidate stuff is mass tested), even if for once it is not just an extension of the ext line but has some interesting new angles.

But considering ext4 on its own is a partial view; one must consider the pre-existing JFS and XFS stability and robustness and performance, and from a technical point of view ext4 is not that interesting (euphemism) and its sole appeal is inplace upgrades, and the widest installed based for that is RedHat, and to a large extent that could have been said of ext3 too.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 0:52 UTC (Sat) by nix (subscriber, #2304) [Link]

So you're blaming the Lustre people now? You do realise Lustre is not owned by Red Hat, and never was?

And if you're claiming that btrfs is effectively RH-controlled merely because RH customers will benefit, then *everything* that happens to Linux must by your bizarre definition be RH-controlled. That's a hell of a conspiracy: so vague that the coconspirators don't even realise they're conspiring!

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Apr 13, 2012 19:34 UTC (Fri) by fragmede (guest, #50925) [Link]

I though *Oracle* was a/the big contributor to btrfs...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 19:45 UTC (Sat) by tytso (subscriber, #9993) [Link]

Sure, and I've always been careful to give the Lustre folk credit for the work that they did between 2003 and 2006 extending ext3 to add support for delayed allocation (which JFS didn't have), multi-block allocation (which JFS didn't have) and extents (OK, JFS had extents).

But you can't have it both ways. If that code had been in use by paying Lustre companies, then it's hardly alpha code, wouldn't you agree?

And why did the Lustre developers at Clustrefs chose ext3? Because the engineers they hired knew ext3, since it was a community-supported distribution, whereas JFS was controlled by a core team that was all IBM'ers, and hardly anyone outside of IBM was available who knew JFS really well.

But as others have already pointed out, there was no grand conspiracy to pick ext2/3/4 over its competition. It won partially due to its installed base, and partially because of the availability of developers who understood it (and books written about it, etc., etc., etc.) The way you've been writing you seem to think there was some secret cabal (at Red Hat?) that made these decisions, and there was a "mistake" because they didn't chose your favorite file systems.

The reality is that file systems all have trade-offs, and what's good for some people are not so great for others. Take a look at some of the benchmarks at btrfs.boxacle.net; they're a bit old, but they are well done, and they show that across many different workloads at that time (2-3 years ago) there was no one single file system that was the best across all of the different workloads. So anyone who only uses a single workload, or a single hardware configuration, and tries to use that to prove that their favorite file system is the "best" is trying to sell you something, or who is a slashdot kiddie who has a fan-favorite file system. The reality is a lot more complicated than that, and it's not just about performance. (Truth be told, for many/most uses cases, the file system is not the bottleneck.) Issues like availability of engineers to support the file system in a commercial product, the maturity of the userspace support tools, ease of maintainability, etc. are at least as important if not more so.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 20:43 UTC (Sat) by dlang (subscriber, #313) [Link]

at the time ext3 became the standard, JFS and XFS had little support (single vendor) and were both 'glued on' to linux with heavy compatibility layers.

Add to this the fact that you did not need to reformat your system to use ext3 when upgrading, and the fact that ext3 became the standard (taking over from ext2, which was the prior standard) is a no-brainer, and no conspiracy.

In those days XFS would outperform ext3, but only in benchmarks on massive disk arrays (which were even more out of people's price ranges at that point then they are today)

XFS was scalable to high-end systems, but it's low-end performance was mediocre

looking at things nowdays, XFS has had a lot of continuous improvement and integration, both improving it's high-end performance and reliability, and improving it's low-end performance without loosing it's scalability. There are also more people, working for more companies supporting it, making it far less of a risk today, with far more in the way of upsides.

JFS has received very little attention after the initial code dump from IBM, and there is now nobody actively maintaining/improving it, so it really isn't a good choice going forward.

reiserfs had some interesting features and performance, but it suffered from some seriously questionably benchmarking (the one that turned me off to it entirely was a spectacular benchmarking test that reiserfs completed in 20 seconds that took several minutes on ext*, but then we discovered that reiserfs defaulted to a 30 second delay before writing everything to disk, so the entire benchmark was complete before any day started getting written to disk, after that I didn't trust anything that they claimed), and a few major problems (the fsck scrambling is a huge one). It was then abandoned by the developer in favor of the future reiserfs4, with improvements that were submitted being rejected as they were going to be part of the new, incompatible filesystem.

ext4 is in large part a new filesystem who's name just happens to be similar to what people are running, but it has now been out for several years, with developers who are responsive to issues, are a diverse set (no vendor lock-in or dependencies) and are willing to say where the filesystem is not the best choice.

btrfs is still under development (the fact that they don't yet have a fsck tool is telling), is making claims that seem too good to be true, and have already run into several cases where they have pathalogical behavior and have had to modify things significantly. I wouldn't trust it for anything other than non-critical personal use for another several years.

as a result, I am currently using XFS for the most part, but once I get a chance to do another round of testing, ext4 will probably join it. I have a number of systems that have significant numbers of disks, so XFS will probably remain in use.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 1:12 UTC (Sun) by nix (subscriber, #2304) [Link]

ext4 is in large part a new filesystem who's name just happens to be similar to what people are running
ext4 is ext3 with a bunch of new extensions (some incompatible): indeed, initially the ext4 work was going to be done to ext3, until Linus asked for it to be done in a newly-named clone of the code instead. It says a lot for the ext2 code and disk formats that they've been evolvable to this degree.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds