|
|
Log in / Subscribe / Register

Developer interview: Eric Sandeen on ext4 implementation

Rodrigo Menezes talks with Eric Sandeen about the ext4 implementation in Fedora 9. "How much upstream development does Fedora drive on Ext4? Eric Sandeen: ext4 development has been a joint effort by several entities. A quick look at the linux-ext4 mailing list will show contributions from several companies and individuals, all interested in helping to develop ext4. One of my responsibilities at Red Hat is to do filesystem work for Fedora and RHEL, so I've also been doing what I can to move things along by submitting patches, testing, fixing, etc."

to post comments

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 8, 2008 4:45 UTC (Sat) by beoba (guest, #16942) [Link] (1 responses)

The Fedora 9 Alpha installer already can do an install onto the ext4 filesystem, provided that the proper secret phrase is provided on the installer boot prompt.

I hope the phrase is something witty.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 8, 2008 11:24 UTC (Sat) by rahulsundaram (subscriber, #21946) [Link]

The phrase is "iamanext4developer" at the installation boot prompt.


More details at

http://fedoraproject.org/wiki/FedoraExt4
http://fedoraproject.org/wiki/Features/Ext4

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 8, 2008 10:06 UTC (Sat) by jpetso (guest, #36230) [Link]

> One of my responsibilities at Red Hat is to do filesystem work
> for Fedora and RHEL

Heh... *RHEL*
Guess that new policy hasn't found its way to the interview team :P

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 8, 2008 13:15 UTC (Sat) by szh (guest, #23558) [Link] (24 responses)

Is ext4 going to become irrelevant because of lacking metadata checksums ? Why is it hard to
add this feature ?

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 8, 2008 15:47 UTC (Sat) by ArbitraryConstant (guest, #42725) [Link] (23 responses)

I'm also a bit irritated that they didn't include dynamic inode allocation.

ext2/3 can exhaust their inodes if you're doing something with tons of small files (eg a
server with maildir inboxes), and there's no easy way to increase the inode allocation on a
filesystem without resizing the whole thing, and that's not always an option. Allocating a
larger number of inodes at mkfs time is possible, but this uses up a lot of space so it
implies knowledge about how the machine will be used that may not be available ahead of time.
Ext4 exacerbates this problem by doubling the size of the inode on-disk.

This is one of the reasons I have historically preferred reiser, but it's pretty much dead at
this point.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 8, 2008 20:20 UTC (Sat) by rjamestaylor (guest, #339) [Link] (2 responses)

> ext2/3 can exhaust their inodes if you're doing something with tons of small files

Yes, it can!

I've had clients use things like phpThumb to create millions (in one case, 6 million+) of
under 1K files in a single directory and then wonder why they are out of disk space and why
IOWait is so high... *sigh* There is nothing so idiot-proof that an educated idiot can't
break.

Second law of Engineering

Posted Mar 9, 2008 20:27 UTC (Sun) by vonbrand (subscriber, #4458) [Link]

I heard it called that...

It is impossible to create anything idiot-proof. As soon as you manage to create something no existing idiot can foul up, the universe makes a bigger idiot.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 15, 2008 14:50 UTC (Sat) by job (guest, #670) [Link]

If they didn't know about the limitation it is a bit harsh to call them idiots. I think it is
a perfectly fair assumption that a modern computer should be able to handle a couple of
million files without hiccups. After all, one would consider a database broken if it slowed
down after six million records. The difference here is historical rather than technical.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 9, 2008 11:16 UTC (Sun) by tzafrir (subscriber, #11501) [Link] (19 responses)

In some cases there are only inodes and no files. This is called a hard link.

In some cases you can have many hard links in a filesystem. One such case is a "backup" scheme
that hard-links the files that remained the same. rsync can do it for you.

I use a similar programs (pdumpfs or faubackup). And this currently keeps me from using ext3.
All of the alternatives at the moment seem to have issues and ext[34] seems to be the one that
gets the most development.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 9, 2008 13:01 UTC (Sun) by danieldk (guest, #27876) [Link] (2 responses)

In some cases there are only inodes and no files. This is called a hard link.

Normally, a hard link is a directory entry (dirents point to inodes), and increases the link count in the inode that is being referred to. Adding hard links creates an additional directory entry for the directory where the hard link is being made.

Maybe you were referring to symbolic links?

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 9, 2008 13:04 UTC (Sun) by tzafrir (subscriber, #11501) [Link] (1 responses)

Hmm.... so I counted wrong. The extra directories take the excessive inodes.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 9, 2008 14:22 UTC (Sun) by ArbitraryConstant (guest, #42725) [Link]

You are correct. I use rsnapshot, which appears to use the same hardlinking technique to
generate its backups.

In this case, the directories on disk use up inodes, and for the purposes of inode consumption
they act very much like small files.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 9, 2008 14:33 UTC (Sun) by ArbitraryConstant (guest, #42725) [Link] (15 responses)

The one bright spot on the horizon seems to be BTRFS. It's going to be a long time before it's
production ready, but it's similar to ZFS is many respects and primary development is
happening on Linux, so it should be able to avoid the weirdities that JFS/XFS have
experienced. It's also backed by Oracle, so it won't go away if the lead developer
encounters... legal troubles.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 9, 2008 18:39 UTC (Sun) by jwb (guest, #15467) [Link] (14 responses)

JFS was backed by IBM and it went away.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 9, 2008 18:48 UTC (Sun) by ArbitraryConstant (guest, #42725) [Link] (11 responses)

Went away from what? I was under the impression it was pretty well supported these days.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 9, 2008 19:29 UTC (Sun) by sbergman27 (guest, #10767) [Link] (7 responses)

JFS has always suffered from severe lack of PR.  If you look at various benchmarks which have
been done over the year, the ones that happen to include JFS tend to show it doing quite well.
Its performance seems quite even and good in various tests, whereas other filesystems reveal
their achilles' heel in some phase of the tests.  But in the analyses... you'd think JFS was
invisible.  One of the other filesystems always "wins" in the opinion of the reporter.  And
JFS is hardly mentioned.

XFS has a certain mystique to it that comes from its SGI heritage. (Never mind that it's only
fast with hardware and workloads that most of us will never have.  And never mind that it is
the most likely of the Linux filesystems to trash your data.)  ext3 is ext3: Linux's tried and
true workhorse filesystem.  Reiserfs, and its would be successor, Reiser4, were the spoiled
children of Linux filesystems, lavished with all the benefits that the Namesys PR machine
could spew out... until daddy went to jail.  And JFS has basked in all of the limelight that
IBM cared to give it... which amounts to pretty much nothing.  Ironically, the major portion
of PR attention that it has received has been from The SCO Group claiming that its (allegedly
illegal) inclusion in Linux was responsible for Linux's success.

I'm an ext3/4 fan, myself.  But I have always thought that the lack of excitement around what
is apparently a very solid and well designed filesystem was notably strange. 

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 10, 2008 16:44 UTC (Mon) by Nelson (subscriber, #21712) [Link] (5 responses)

How do you figure that XFS is the most likely to trash your data?

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 11, 2008 16:58 UTC (Tue) by sbergman27 (guest, #10767) [Link] (4 responses)

Are you surprised?  Many people are.  That is part of what I mean about the XFS "mystique".
:-)

There are actually several factors involved.  I will relate the two that I recall off the top
of my head.  There is at least one other, which makes XFS particularly prone to data loss
after a power failure, but I cannot recall what it is.

XFS only does metadata only journalling.  ext3, ext4, and reiser3 can do full data journaling.
They will also do metadata journaling with ordered writes, and, of course, just plain metadata
journaling.  Here is a summary of what guarantees each level of journaling gives you:

metadata only: If you lose power, the filesystem structure is guaranteed to be valid and does
not require an fsck.  Actual data blocks may contain garbage

metadata with ordered writes: If you lose power, the filesystem structure is guaranteed to be
valid and does not require an fsck.  The data blocks may or may not contain the very *latest*
data.  They But they will not be garbage.

Data journaling:  The filesystem structure if guaranteed to be valid, and the data blocks will
contain the latest data.  If the application was told that the write succeeded, the correct
data will be there.

Ext3 defaults to ordered.  Not sure about reiser3.  Reiser4 does not do journaling but all
writes are atomic, which yields the same guarantees as full data journaling.

So XFS (and JFS) can leave garbage in the datablocks after an upplanned shutdown.  But it gets
worse.  Due to a design decision, on remount, XFS actually nulls out any blocks that were
supposed to be written that didn't actually get written.  i.e. if you pull the plug during a
write, you are pretty much guaranteed to suffer data loss.  If random chance does not leave
garbage in a block, the filesystem will thoughtfully zap your data intentionally.   This is
done for security reasons.

There is at least one other factor.  I can't remember exactly what it is.  But I do remember
that SGI made certain design decisions based upon the hardware they expected XFS to run on.
You see, unlike PC hardware, SGI hardware has "hardware power interrupts", so the OS is
informed *instantly* if the power fails, and can do emergency cleanup.  To that end, SGI power
supplies have large capacitors to give the OS just enough time to do its emergency clean up
before the power is really gone.

Design decisions which made perfect sense in that environment, make far less sense on x86
hardware.

If anyone has any more details, I would interested in being reminded of just what that third
factor actually is.  I think Ted Tso or someone wrote a mailing list post on this topic some
years back.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 11, 2008 17:39 UTC (Tue) by Nelson (subscriber, #21712) [Link] (1 responses)

Actually, I asked because I did extensive testing of XFS, JFS and EXT3 for a Linux based set-top box a few years back and found exactly the opposite to be true after 10s of thousands of power-cycles.

Not once did we have an XFS that had it's structure damaged, it was the only filesystem that behaved that way. What you say about blocks being zeroed during writes is true and that was advantageous as we could easily determine when and where in our data streams the power failure happened and didn't have to monkey around guessing if the data was some how missing portions. There aren't good semantics defined as to what should happen to the data in a file during a write and a power failure. It's also worth noting that this was usually a very very small amount of data that was actually actively being written, you could have dozen of files open for write and unless they had unsynced buffers, they were fine. I don't have the exact numbers in front of me at the moment but ext3 rendered itself broken under 1% of the time I want to say like 17 times out of 10000 power-cycle tests, Reiserfs 3 was more prone to breaking, more than 1% and I forget the JFS numbers but I want to say it was between ext3 and Reiserfs.

I guess it depends on how you measure robustness and reliability, having one of the few files being actively written to lose a couple blocks was thought to be better than the whole thing not being mountable and losing all of the other files too.

I wouldn't say it has a mystique around it, it's a very fine piece of software, it's very good at some things and somewhat average at others; unfortunately it's not the best supported piece of code in the Linux kernel and it seems to be prone to kernel panics as some parts of the kernel are updated and that gives me pause. It's unusual how the kernel community at large hasn't really embraced it and that's somewhat fascinating. I'd not exactly encourage folks to run it but that wouldn't be because of reliability, it would be because of support.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 11, 2008 18:01 UTC (Tue) by sbergman27 (guest, #10767) [Link]

Some years back, there was a post to lkml from someone doing filesystem reliability testing.
This was before reiserfs had ordered writes.  And the fellow was asking for help to determine
what was wrong with his methodology.  You see, in his tests, in which power was removed at
random times, over and over, reiserfs, jfs, and xfs all performed about the same.  But ext3
never showed to have dropped anything.  He was quite certain that he must be doing something
wrong.  Turned out it was the ordered writes that made the difference.  I've tried to locate
that thread a few times, over the years, but never been able to find it again.

I've also noted more complaints from desktop users about it eating their data than for other
filesystems. 

I believe that the reason that the kernel guys are not so hot on XFS is that it is:

1. Big and complex

2. Has an ugly glue layer to make it interface to the Linux VFS.

And face it, as well designed ibn the 90s as it might have been... it was still designed in
the 90s.

I'm quite anxious to see how btrfs shapes up!

XFS and NULLS and power, oh my

Posted Mar 12, 2008 13:32 UTC (Wed) by sandeen (guest, #42852) [Link] (1 responses)

People bring this up over and over, but rarely get it right.

The XFS page has a FAQ about this.

XFS does not leave garbage in datablocks after an unclean shutdown. Doing so would be a security hole. And, the filesystem is not explicitly zeroing anything - what is going on is that a truncate may happen which removes extents from a file, and then an extending write is done. Because XFS uses delayed allocation, which waits until flush to actualy allocate data blocks, it is possible that, depending on when the crash occurs, you have metadata on-disk with a file size but no extents - i.e., a sparse file. XFS is not explicitly zeroing any blocks; at this point in time the file has no blocks. And so you see nulls when you read, just as you would with a sparse file. Some of this has to do with applications (im)properly looking after their data - see for example Stewart Smith's LCA talk.

Recent changes to XFS have alleviated this problem by forcing a flush on file close when a file has been through these operations.

Also, this business about flux capacitors or whatnot in SGI hardware is total bunk, as far as I know. I know that this is taken as an article of faith on the interwebs, but I do not think it is correct.

One big hardware issue w.r.t. any journaling filesystem is the write cache in disks, because the filesystem must know when writes are truly safe on disk. XFS has write barriers enabled by default whenever the underlying storage allows it which, despite some performance overhead, is critical to ensure filesystem integrity after a crash. Due to implementation details, XFS journaling may be more susceptible to write cache problems, but in general it will affect any journaling filesystem.

XFS and NULLS and power, oh my

Posted Mar 13, 2008 5:27 UTC (Thu) by sbergman27 (guest, #10767) [Link]

"""
Also, this business about flux capacitors or whatnot in SGI hardware is total bunk, as far as
I know. I know that this is taken as an article of faith on the interwebs, but I do not think
it is correct.
"""

Thanks for the info.  But about the capacitors, I was just repeating what Ted Tso has said.
:-)

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 11, 2008 11:51 UTC (Tue) by njd27 (subscriber, #5770) [Link]

Truly it has been said that in most cases you don't want something that is storing all your
useful data to be "exciting".

Boringly reliable is a good goal for a filesystem.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 9, 2008 19:53 UTC (Sun) by jwb (guest, #15467) [Link] (2 responses)

I happen to use it, because it's fast, but the former maintainer has stated on l-k that he and
IBM have no interest in maintaining JFS for Linux, and nobody is working on it at all.

Like the below post says, JFS is fast and stable and widely ignored.  I use it for PostgreSQL
data volumes on fast RAIDs and find that it outperforms everything.  It's reputed to have
long-term fragmentation problems but I haven't personally seen those.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 9, 2008 20:25 UTC (Sun) by ArbitraryConstant (guest, #42725) [Link]

uhg... do you have a link?

Either way, if BTRFS becomes widely used I think its chances are a lot better if Oracle
abandons it. JFS is not very widely used.

In the xfs/jfs/reiser3/ext{3,4} generation of filesystems, they were all pretty roughly
comparable, but I don't think there's any plausible competition for BTRFS on Linux looking at
what's likely to be available in 2009+.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 10, 2008 11:50 UTC (Mon) by csamuel (✭ supporter ✭, #2624) [Link]

To clarify, this is what Dave Kleikamp, the maintainer, wrote in February on the JFS discussion list:

I guess it depends on what he means by "actively". There is not much in the way of active development, but I think I do a reasonably good job of keeping up with changes in the mainline kernel and responding to and fixing bugs. Some hard-to-diagnose bugs do fall through the cracks.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 9, 2008 19:26 UTC (Sun) by beoba (guest, #16942) [Link] (1 responses)

I'm using JFS, haven't ever had problems. Though I also haven't had issues with other
filesystems either. To Grandparent: What sort of oddities does JFS have?

Really, I consider filesystems to be one of those things that if you're getting frequent
updates, there's probably something wrong.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 10, 2008 21:14 UTC (Mon) by alankila (guest, #47141) [Link]

My experience: I used JFS as system disk for a debian testing or unstable distro for a few
years. The disk was always pretty tight on space (80 % or so), so I guess fragmentation took
its toll with the massive volume of updates coming in almost every day. In the end days, the
filesystem had abysmal performance -- I recall that doing anything at all on it basically
stalled the computer into a long-term disk-crunching session.

These days I have switched to ext3. I have figured it doesn't really make any kind of
practical difference, and it's best to use the thing that most others use as well. I haven't
noticed similar performance loss with time in any other filesystem I have used.

Developer interview: Eric Sandeen on ext4 implementation

Posted Mar 8, 2008 16:53 UTC (Sat) by linux_puttur (guest, #50963) [Link]

Good to know that we are by passing the 16T file system limit to 16P, this is what big storage
vendors were looking for a long time.


Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds