LWN.net Logo

Advertisement

E-Commerce & credit card processing - the Open Source way!

Advertise here

Developer interview: Eric Sandeen on ext4 implementation

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 9, 2008 19:29 UTC (Sun) by sbergman27 (subscriber, #10767)
In reply to: Developer interview: Eric Sandeen on ext4 implementation by ArbitraryConstant
Parent article: Developer interview: Eric Sandeen on ext4 implementation

JFS has always suffered from severe lack of PR.  If you look at various benchmarks which have
been done over the year, the ones that happen to include JFS tend to show it doing quite well.
Its performance seems quite even and good in various tests, whereas other filesystems reveal
their achilles' heel in some phase of the tests.  But in the analyses... you'd think JFS was
invisible.  One of the other filesystems always "wins" in the opinion of the reporter.  And
JFS is hardly mentioned.

XFS has a certain mystique to it that comes from its SGI heritage. (Never mind that it's only
fast with hardware and workloads that most of us will never have.  And never mind that it is
the most likely of the Linux filesystems to trash your data.)  ext3 is ext3: Linux's tried and
true workhorse filesystem.  Reiserfs, and its would be successor, Reiser4, were the spoiled
children of Linux filesystems, lavished with all the benefits that the Namesys PR machine
could spew out... until daddy went to jail.  And JFS has basked in all of the limelight that
IBM cared to give it... which amounts to pretty much nothing.  Ironically, the major portion
of PR attention that it has received has been from The SCO Group claiming that its (allegedly
illegal) inclusion in Linux was responsible for Linux's success.

I'm an ext3/4 fan, myself.  But I have always thought that the lack of excitement around what
is apparently a very solid and well designed filesystem was notably strange. 


(Log in to post comments)

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 10, 2008 16:44 UTC (Mon) by Nelson (subscriber, #21712) [Link]

How do you figure that XFS is the most likely to trash your data?

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 11, 2008 16:58 UTC (Tue) by sbergman27 (subscriber, #10767) [Link]

Are you surprised?  Many people are.  That is part of what I mean about the XFS "mystique".
:-)

There are actually several factors involved.  I will relate the two that I recall off the top
of my head.  There is at least one other, which makes XFS particularly prone to data loss
after a power failure, but I cannot recall what it is.

XFS only does metadata only journalling.  ext3, ext4, and reiser3 can do full data journaling.
They will also do metadata journaling with ordered writes, and, of course, just plain metadata
journaling.  Here is a summary of what guarantees each level of journaling gives you:

metadata only: If you lose power, the filesystem structure is guaranteed to be valid and does
not require an fsck.  Actual data blocks may contain garbage

metadata with ordered writes: If you lose power, the filesystem structure is guaranteed to be
valid and does not require an fsck.  The data blocks may or may not contain the very *latest*
data.  They But they will not be garbage.

Data journaling:  The filesystem structure if guaranteed to be valid, and the data blocks will
contain the latest data.  If the application was told that the write succeeded, the correct
data will be there.

Ext3 defaults to ordered.  Not sure about reiser3.  Reiser4 does not do journaling but all
writes are atomic, which yields the same guarantees as full data journaling.

So XFS (and JFS) can leave garbage in the datablocks after an upplanned shutdown.  But it gets
worse.  Due to a design decision, on remount, XFS actually nulls out any blocks that were
supposed to be written that didn't actually get written.  i.e. if you pull the plug during a
write, you are pretty much guaranteed to suffer data loss.  If random chance does not leave
garbage in a block, the filesystem will thoughtfully zap your data intentionally.   This is
done for security reasons.

There is at least one other factor.  I can't remember exactly what it is.  But I do remember
that SGI made certain design decisions based upon the hardware they expected XFS to run on.
You see, unlike PC hardware, SGI hardware has "hardware power interrupts", so the OS is
informed *instantly* if the power fails, and can do emergency cleanup.  To that end, SGI power
supplies have large capacitors to give the OS just enough time to do its emergency clean up
before the power is really gone.

Design decisions which made perfect sense in that environment, make far less sense on x86
hardware.

If anyone has any more details, I would interested in being reminded of just what that third
factor actually is.  I think Ted Tso or someone wrote a mailing list post on this topic some
years back.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 11, 2008 17:39 UTC (Tue) by Nelson (subscriber, #21712) [Link]

Actually, I asked because I did extensive testing of XFS, JFS and EXT3 for a Linux based set-top box a few years back and found exactly the opposite to be true after 10s of thousands of power-cycles.

Not once did we have an XFS that had it's structure damaged, it was the only filesystem that behaved that way. What you say about blocks being zeroed during writes is true and that was advantageous as we could easily determine when and where in our data streams the power failure happened and didn't have to monkey around guessing if the data was some how missing portions. There aren't good semantics defined as to what should happen to the data in a file during a write and a power failure. It's also worth noting that this was usually a very very small amount of data that was actually actively being written, you could have dozen of files open for write and unless they had unsynced buffers, they were fine. I don't have the exact numbers in front of me at the moment but ext3 rendered itself broken under 1% of the time I want to say like 17 times out of 10000 power-cycle tests, Reiserfs 3 was more prone to breaking, more than 1% and I forget the JFS numbers but I want to say it was between ext3 and Reiserfs.

I guess it depends on how you measure robustness and reliability, having one of the few files being actively written to lose a couple blocks was thought to be better than the whole thing not being mountable and losing all of the other files too.

I wouldn't say it has a mystique around it, it's a very fine piece of software, it's very good at some things and somewhat average at others; unfortunately it's not the best supported piece of code in the Linux kernel and it seems to be prone to kernel panics as some parts of the kernel are updated and that gives me pause. It's unusual how the kernel community at large hasn't really embraced it and that's somewhat fascinating. I'd not exactly encourage folks to run it but that wouldn't be because of reliability, it would be because of support.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 11, 2008 18:01 UTC (Tue) by sbergman27 (subscriber, #10767) [Link]

Some years back, there was a post to lkml from someone doing filesystem reliability testing.
This was before reiserfs had ordered writes.  And the fellow was asking for help to determine
what was wrong with his methodology.  You see, in his tests, in which power was removed at
random times, over and over, reiserfs, jfs, and xfs all performed about the same.  But ext3
never showed to have dropped anything.  He was quite certain that he must be doing something
wrong.  Turned out it was the ordered writes that made the difference.  I've tried to locate
that thread a few times, over the years, but never been able to find it again.

I've also noted more complaints from desktop users about it eating their data than for other
filesystems. 

I believe that the reason that the kernel guys are not so hot on XFS is that it is:

1. Big and complex

2. Has an ugly glue layer to make it interface to the Linux VFS.

And face it, as well designed ibn the 90s as it might have been... it was still designed in
the 90s.

I'm quite anxious to see how btrfs shapes up!

XFS and NULLS and power, oh my

Posted Mar 12, 2008 13:32 UTC (Wed) by sandeen (subscriber, #42852) [Link]

People bring this up over and over, but rarely get it right.

The XFS page has a FAQ about this.

XFS does not leave garbage in datablocks after an unclean shutdown. Doing so would be a security hole. And, the filesystem is not explicitly zeroing anything - what is going on is that a truncate may happen which removes extents from a file, and then an extending write is done. Because XFS uses delayed allocation, which waits until flush to actualy allocate data blocks, it is possible that, depending on when the crash occurs, you have metadata on-disk with a file size but no extents - i.e., a sparse file. XFS is not explicitly zeroing any blocks; at this point in time the file has no blocks. And so you see nulls when you read, just as you would with a sparse file. Some of this has to do with applications (im)properly looking after their data - see for example Stewart Smith's LCA talk.

Recent changes to XFS have alleviated this problem by forcing a flush on file close when a file has been through these operations.

Also, this business about flux capacitors or whatnot in SGI hardware is total bunk, as far as I know. I know that this is taken as an article of faith on the interwebs, but I do not think it is correct.

One big hardware issue w.r.t. any journaling filesystem is the write cache in disks, because the filesystem must know when writes are truly safe on disk. XFS has write barriers enabled by default whenever the underlying storage allows it which, despite some performance overhead, is critical to ensure filesystem integrity after a crash. Due to implementation details, XFS journaling may be more susceptible to write cache problems, but in general it will affect any journaling filesystem.

XFS and NULLS and power, oh my

Posted Mar 13, 2008 5:27 UTC (Thu) by sbergman27 (subscriber, #10767) [Link]

"""
Also, this business about flux capacitors or whatnot in SGI hardware is total bunk, as far as
I know. I know that this is taken as an article of faith on the interwebs, but I do not think
it is correct.
"""

Thanks for the info.  But about the capacitors, I was just repeating what Ted Tso has said.
:-)

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 11, 2008 11:51 UTC (Tue) by njd27 (subscriber, #5770) [Link]

Truly it has been said that in most cases you don't want something that is storing all your
useful data to be "exciting".

Boringly reliable is a good goal for a filesystem.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.