LWN.net Logo

Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch (Debian Administration)

Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch (Debian Administration)

Posted Apr 26, 2006 0:31 UTC (Wed) by sbergman27 (subscriber, #10767)
In reply to: Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch (Debian Administration) by dwheeler
Parent article: Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch (Debian Administration)

What this article leaves out... what most *every* fs benchmark article leaves out is the fact that EXT3 gives you a much greater level of journalling protection than the others by default. By default, EXT3 gives you metadata journalling and ordered writes of metadata and data. The others only give you metadata journalling by default. And, of the others, only reiserfs gives you the option of anything higher.

XFS, JFS, EXT3, and Resierfs all offer plain metadata journalling. This level basically gives you no more protection than a nonjournalling filesystem beyond the fact that you don't have to run fsck after a power failure. After an unclean shutdown, you are guaranteed that the filesystem structure will be intact, but some of your file data may be garbage. This is the default for XFS, JFS, and Reiserfs.

Reiserfs and EXT3 offer full data journalling. In general, this is the slowest of the journalling levels because you are writing the data twice. Once to the journal (which is fast because the journal is always streamed out sequentially) and once to the ultimate location. It gives you the greatest level of protection. In fact, it gives the same level of protection as mounting the filesystem synchronously. If the application was is told that the data was written, it's as good as written. After a power failure, the filesystem structure is guaranteed to be intact. And the file data is guaranteed to be intact. None of the tested filesystems default to this level.

EXT3 offers an intermediate option called "ordered" mode. It is basically metadata journalling, but adds a constraint that the writes to disk will be ordered in such a way that it gives some an extra guarantee. Not only is the filesystem structure guaranteed to be intact, but the data is also guaranteed to be intact. It may not be the *latest* data that the app was told was written to disk. But it is guaranteed n0t t0 b3 g@rb@g3. i.e. it may be the data that was there before the write request. There is a significant performance penalty, but not as great as for full data journalling. This is (wisely in my opinion) EXT3's default.

Another thing that the article misses is the fact that ext3 reserves 5% of space for its fragmentation avoidance algorithms. So that 92.77% partition capacity really should be 97.77%. You can set the reserved space to 0% if you really want. (But keep in mind that the 5% reserve, like the performance penalty for ordered journalling, buys you something.)

I firmly believe that if ext3 had been designed and maintained by people with a little more marketing savvy, and a little less concern with doing what makes sense from a technical standpoint (like reis... *cough*... er), ext3 would have a much better performance reputation. And I'm glad it wasn't.

(David, I know you know all this, but some budding filesystem benchmarker might read this and get it right.)


(Log in to post comments)

Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch (Debian Administration)

Posted Apr 26, 2006 1:47 UTC (Wed) by dlang (✭ supporter ✭, #313) [Link]

keep in mind that these benifits you are listing only count if you are useing a hard drive that allows you to disable the write cache on it (i.e. no IDE drives qualify). if the drive can tell you that the write is completed while it's in the drives memory it's still vundrable to being lost.

since there is a huge performance penalty for doing this it's seldom done even on the drives that support it.

David Lang

P.S. and no, the drives don't use their platter energy to drive the electronics long enough to write their data. as an excercise consider how many seeks could be required to write all the data in a 16M buffer, and how long that could take. then add in the problems of writing to a platter that's slowing down as you write and you start to realize why nobody does that anymore.

Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch (Debian Administration)

Posted Apr 26, 2006 2:49 UTC (Wed) by sbergman27 (subscriber, #10767) [Link]

There is a difference between the benefits "not counting" and the guarantees not being absolute.

But you are correct. Hardware write caching does figure in.

You *can* turn write caching off on most IDE drives but, as you say, there is a considerable performance penalty due to the insane limitation on write request size imposed by the ATA standard. So far as i know, SCSI drives are not so hampered.

P.S. Did I say something about the drive using its platter energy for something?

Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch (Debian Administration)

Posted Apr 26, 2006 7:28 UTC (Wed) by dlang (✭ supporter ✭, #313) [Link]

no, you didn't say anything aobut platter energy, but it's a common misconception that people have about drives and the cache on them

Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch (Debian Administration)

Posted Apr 26, 2006 15:14 UTC (Wed) by sbergman27 (subscriber, #10767) [Link]

I saw refernces to that misconception twice yesterday. Oddly, the last time I encountered it (that I remember) was in 1979 when my first computer science instructor made reference to the "fact" that the drive did that. I believed it at the time, and then later decided it was urban legend.

I notice in your previous post you indicate that they don't do that "anymore". Did they really used to do that?

Also of note, and sorry, I don't have a ready link, I was once struck by a thread on lkml in which someone was crash testing different filesystems under Linux. He had a filesystem with lots of writes going on and knew what all the md5sums should be or something like that and would then pull the plug at random times and observe what happened. His question to the list was for someone to please point out what he was "doing wrong". You see, all the other filesystems in the test corrupted files with more or less the same frequency. All except for ext3, that is. It seemed never (or very rarely) to ever corrupt files and he felt that there must be a problem with his methodology. It was explained that ext3 defaults to data=ordered mode and that such behavior was really expected.

My summary of the thread is probably not completely accurate because it's been a while since I read it, but I was struck by the fact that someone observed the difference even though they were not expecting it.

If I can dig up a link, I'll post it.

Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch (Debian Administration)

Posted Apr 26, 2006 16:15 UTC (Wed) by dlang (✭ supporter ✭, #313) [Link]

>I notice in your previous post you indicate that they don't do that "anymore". Did they really used to do that?

I don't know for sure, but every urban legend needs to get started somewhere right? :-)

thinking about it from a practical standpoint, at one point drives had very small buffers (ram was too expensive, and too large to put much on a drive) along with a lot of rotating mass, so it would have been possible to flush that buffer immediatly on power-loss (and the tolorances of the data on disk were loose enough to accept the slight distortion that would result).

Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch (Debian Administration)

Posted Apr 26, 2006 6:46 UTC (Wed) by joib (guest, #8541) [Link]

keep in mind that these benifits you are listing only count if you are useing a hard drive that allows you to disable the write cache on it (i.e. no IDE drives qualify).

As was already mentioned, most IDE drives allow you to disable write-back cache. However, many manufacturers consider this operation a warranty-voiding one, since disabling write-back caching causes much more physical writes which significantly reduces the life of the drive.

if the drive can tell you that the write is completed while it's in the drives memory it's still vundrable to being lost. since there is a huge performance penalty for doing this it's seldom done even on the drives that support it.

Fortunately, you can have your cake and eat it too. The trick is to implement IO barriers using the CACHE FLUSH and/or FUA commands. That way you can have the performance and MTBF benefits of write-back caching while still having a safe fsync() (safe as in doesn't return before data is on the platters).

Also note that the IO barrier rewrite referenced above was included only from 2.6.16+; I don't know how previous kernels did it.

Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch (Debian Administration)

Posted Apr 26, 2006 7:29 UTC (Wed) by dlang (✭ supporter ✭, #313) [Link]

I hadn't caught the fact that the IO barriers had made it into the kernel, I knew they were being worked on. prior to that going in the only option the kernel had was to stop all IO to the drive while issueing a full flush to it.

Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch (Debian Administration)

Posted May 6, 2006 13:56 UTC (Sat) by anton (subscriber, #25547) [Link]

>hard drive that allows you to disable the write cache on it (i.e. no
>IDE drives qualify).

hdparm -W0 works on any IDE drive I have tried it on (and we run all
our ext3 FSs without disk write caches).

>since there is a huge performance penalty for doing this

Well, I recently tried it:

Disk: SAMSUNG SV1204H, ATA DISK drive
FS: ext3
Task: writing a 4.3GB file
Time: 12 min without write caching
6 min with write caching

So the penalty was a factor of 2 in this case.

>as an excercise consider how many seeks could be required to write all
>the data in a 16M buffer, and how long that could take.

The drive could have a 16MB spare area for just this case, and dump
the buffer contents there; it could read that area on powerup, and
then proceed as if there had been no power outage (i.e. write the
blocks to their home location). On a modern drive, this would take
0.25s. So it's not impossible, but I don't believe it is being done.

Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch (Debian Administration)

Posted Apr 26, 2006 16:05 UTC (Wed) by Duncan (guest, #6647) [Link]

> What this article leaves out... what most *every*
> fs benchmark article leaves out is the fact that
> EXT3 gives you a much greater level of journalling
> protection than the others by default. By default,
> EXT3 gives you metadata journalling and ordered
> writes of metadata and data. The others only give
> you metadata journalling by default. And, of the
> others, only reiserfs gives you the option of
> anything higher.

> Reiserfs and EXT3 offer full data journalling.

> EXT3 offers an intermediate option called "ordered"
> mode. [...] This is (wisely in my opinion) EXT3's
> default.

Actually, reiserfs offers ordered mode as well. It didn't originally, but
Chris Mason's patch adding the functionality was merged into to the
mainline kernel before full data journalling for reiserfs was added. (A
google turns up this changelog for 2.6.6-rc1:
http://lwn.net/Articles/80719/ ) It became the default either at that
time or soon thereafter.

That said, if you didn't know it was there (I knew in part due to LWN
coverage), it would be and remains very hard to notice that it's now using
ordered. The output at filesystem mount doesn't mention the fact, and of
course being the default, there's no indication in fstab. One has to look
quite carefully at the dmesg output for the mount to notice it. It should
have a line something like (from my boot log, md_d1p1 of course indicates
partitioned RAID):

ReiserFS: md_d1p1: using ordered data mode

I remember looking to see that I was using ordered shortly after
installing and booting that kernel, and wondering why it didn't
mention "ordered mode" in the mount output. I certainly would have missed
it too, had I not known it was there. To your credit, you knew about the
later journalled mode reiserfs patches, but you apparently missed the
ordered mode patches, and that just as with ext3, that's now the default
for reiserfs.

Oh, for anyone interested, those patches /did/ make reiserfs far more
stable. I had a bout with some bad memory during which I was crashing
quite frequently -- and usually under high load and disk activity at
that -- and reiserfs came thru with flying colors! =8^) That's far better
than it did when I first started using it, back in the bad old early 2.4
days.

Duncan

Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch (DebianAdministration)

Posted Apr 26, 2006 21:41 UTC (Wed) by nix (subscriber, #2304) [Link]

Yeah, I haven't lost any reiserfs filesystems since then anyway.

(It's just panicked instead. Three times.)

Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch (DebianAdministration)

Posted Apr 27, 2006 9:23 UTC (Thu) by Wol (guest, #4433) [Link]

I believe there's also an experimental filesystem called TuxFS. It guarantees integrity without journaling :-)

Basically, it never overwrites a live file. Any changes, it writes the modified block out in full somewhere else. Then it rewrites the updated file header out in full somewhere else. Then the directory header ...

Until finally it rewrites the root block. The only time there's any danger is if it goes down while writing the root block. At all other times, the root block is pointing at a completely valid filesystem. If the system crashes, all updates since the last root block update are lost because the modified blocks are orphaned. But all previous data is safe, because data is never modified "in situ".

Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch (DebianAdministration)

Posted Apr 27, 2006 12:01 UTC (Thu) by nix (subscriber, #2304) [Link]

Downside: a minor problem with fragmentation. :)))

(The extra disk overhead can be disregarded as long as you have decent amounts of cache, because disk writes are generally localized in the directory tree anyway. At least they are if atime updates are disabled. Using a filesystem like this with atime updates enabled strikes me as... perhaps unwise.)

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds