LWN.net Logo

never understood why Linux community continues to ignore ZFS

never understood why Linux community continues to ignore ZFS

Posted Mar 20, 2007 3:13 UTC (Tue) by qu1j0t3 (subscriber, #25786)
Parent article: The 2007 Linux Storage and File Systems Workshop

You can't ignore it forever... Here are just two reasons:
1) No fsck (always consistent).
2) (Unlike RAID or other "trusted" subsystems) never delivers bad data to applications (end to end checksummed).

Google for all the other reasons why we won't accept anything less than Solaris 10's ZFS, or something like it (actually there's really nothing else like it), soon. Well, trust Apple to take the lead: It will be in OS X 10.5.


(Log in to post comments)

never understood why Linux community continues to ignore ZFS

Posted Mar 20, 2007 3:21 UTC (Tue) by rfunk (subscriber, #4054) [Link]

You are aware that ZFS is not licensed under the GPL, and will likely
never be licensed compatibly with the GPLv2 (required for Linux kernel
inclusion), right?

Gee that's tough.

Posted Mar 20, 2007 3:43 UTC (Tue) by qu1j0t3 (subscriber, #25786) [Link]

I guess those who want next-generation capabilities will be running Solaris for a while yet. No great hardship.

But they're also ignoring the ideas of ZFS (those who forget...condemned to repeat...etc). Out of spite?

I see a nice Wikipedia page has sprung up.

Gee that's tough.

Posted Mar 20, 2007 4:21 UTC (Tue) by bronson (subscriber, #4806) [Link]

Are you trolling? I don't understand what you're trying to say.

Gee that's tough.

Posted Mar 20, 2007 4:47 UTC (Tue) by drag (subscriber, #31333) [Link]

Most of the features of ZFS are aviable on Linux right now.

LVM, MD, DM, and lots of other FS-related features can be combined in different ways that will accomplish do the vast majority of what ZFS does. All of it's been around for a long time and is proven. It's just in many smaller components instead of one big monolythic package.

The main thing that ZFS brings to the table is that it's simplier to administrate. There are a few fancier features that Linux doesn't have, like 128bit-ness, but there are a lot of things that Linux can do that ZFS can't.

Essentially if Linux developers decided to adopt ZFS they'd be replicating existing functionality.

Plus ZFS is copyrighted to Sun and is not aviable for Linux inclusion due to licensing differences.

Also I expect that Sun has patents on various aspects of ZFS, so unless Linux is able to incorporate code from Sun into the kernel then it's likely that Linux developers would violate obvious patents if they tried to re-impliment it.

Just my perspective on the whole 'zfs' issue.

ZFS

Posted Mar 20, 2007 8:06 UTC (Tue) by job (subscriber, #670) [Link]

I don't think that is the main thing ZFS brings. Of course, for Solaris admins, the main thing is that ZFS brings logical volume-type functionality which was badly missing from vanilla Solaris before.

But to me from a Linux background, it brings 1) much better snapshot capabilities (more of them with better performance) due to it being copy-on-write at the file system level, 2) checksummed data integrity (I haven't seen this elsewhere but I suspect this is great for running low end storage systems such as software RAID) and 3) better performing software raid-5 ("raid-z") because of its knowledge how the files are laid out.

I like ZFS. I haven't used it in production (and probably won't, at my current work), but it is a solid piece of engineering. It seems a bit bastardized at first as it is both a file system and a volume manager, but I think it is well motivated.

Gee that's tough.

Posted Mar 20, 2007 15:18 UTC (Tue) by cajal (guest, #4167) [Link]

"LVM, MD, DM, and lots of other FS-related features can be combined in different ways that will accomplish do the vast majority of what ZFS does."

No, they can't. They don't give you self-repairing storage with provable data correctness. They don't give you dynamic striping. The LVM still requires you to manually carve up PVs into LVs. ext3 doesn't support arbitrary number of extended attributes. I could go on.

Further, just because the GPL is not compatible with Sun's implementation of ZFS, is no reason that the Linux kernel community couldn't reimplement ZFS. The on-disk format and man pages are available at http://opensolaris.org/os/community/zfs/docs/ I think this could be a valuable piece of software for the Linux kernel.

Gee that's tough.

Posted Mar 21, 2007 13:40 UTC (Wed) by mennucc1 (subscriber, #14730) [Link]

> Most of the features of ZFS are aviable on Linux right now.

but still I would like to have snapshots in EXT (maybe in 5 :-) ?)

Gee that's tough.

Posted Mar 21, 2007 16:06 UTC (Wed) by bronson (subscriber, #4806) [Link]

For snapshotting, just run whatever filesystem you want on top of LVM. LVM is a little hard to get used to at first but it's definitely worth the effort. Here are some notes I took when setting it up on my systems: http://wiki.u32.net/LVM

I now put all nontrivial partitions on LVM. Works for me. I'll let others argue whether LVM snapshots are worse than ZFS, or if ZFS is a layering violation. :)

Gee that's tough.

Posted Mar 22, 2007 5:55 UTC (Thu) by snitm (subscriber, #4031) [Link]

LVM2 Snapshots are quite bad. For starters they are done at the block-level whereas ZFS provides file-level snapshots (aka redirect on write). LVM2 snapshots don't scale well either; seeing as each snapshot imposes a copy out penalty because there isn't a shared exception store (aka LVM snapshot LV) for all snapshots of an origin LV.

Gee that's tough.

Posted Mar 20, 2007 11:58 UTC (Tue) by nix (subscriber, #2304) [Link]

They're ignoring the ideas of ZFS. Right.

Of course this means that the Val Henson who's a Linux filesystems hacker must be an *entirely different person* from the vhenson@eng.sun.com who was closely involved with ZFS development? (Perhaps she has a secret twin.)

Gee that's tough.

Posted Mar 20, 2007 18:10 UTC (Tue) by zooko (subscriber, #2589) [Link]

The fact that Val Henson worked on ZFS doesn't prove that the Linux filesystem developers are learning lessons from ZFS. It doesn't even prove that Henson has learned lessons from ZFS. Indeed, since Henson is the author of the "Compare-By-Hash" paper critiquing compare-by-hash on intuitively appealing but incorrect grounds, perhaps she dislikes the ZFS design, which features compare-by-hash among many other ideas.

Note: whenever anyone cites that regrettable compare-by-hash paper, they really ought to cite Graydon Hoare's follow-up (disclaimer: I helped Graydon a bit on writing that page), John Black's follow-up, and Henson and Henderson's much better self-followup.

Oh, I see that Jeff Bonwick is given thanks in that last paper. This is evidence that Henson has benefitted from the lessons of ZFS.

Gee that's tough.

Posted Mar 31, 2007 0:57 UTC (Sat) by wmf (subscriber, #33791) [Link]

But ZFS doesn't use compare-by-hash; that's one of its advantages over previous work like Venti/
Fossil.

never understood why Linux community continues to ignore ZFS

Posted Mar 20, 2007 6:33 UTC (Tue) by dmarti (subscriber, #11625) [Link]

Try it with FUSE?

fsck

Posted Mar 20, 2007 6:03 UTC (Tue) by Ross (subscriber, #4065) [Link]

It's nice to have fsck even if you don't use it automatically because not all corruption happens in ways that can be repaired by journal replay or unrolling a transaction. Checksums are a great idea, and so is using RAID, but the ability to scan the entire disk for metadata inconsistencies remains valuable despite them.

fsck

Posted Mar 20, 2007 10:46 UTC (Tue) by etienne_lorrain@yahoo.fr (subscriber, #38022) [Link]

> not all corruption happens in ways that can be repaired by journal replay or unrolling a transaction

I've got a problem few days ago and have not been impressed by journal replay on ext3.
I think I remember the FS would not unmount at the end of the shutdown, but no other illness sign during the few hours Linux session.
No funny setup: no LVM, no RAID, simple partition table, no SMP, ia32... HD SMART good even after crash, no hardware change for a long time.
Result in the root directory + main E3FS descriptor loss and approximately 95 % of lost directories (with all the files inside them) after fsck, most directory inode have lost their name - for sure no way to get /var/log/messages or anything in /boot.
It was a test system with a linux 2.6.21-rc and no real important file on this partition, but sometimes I wonder if I shall not only use the simple ext2fs - when I had crashes (every 3-4 years) it has never been so extensive - maybe less chance to have a wrong journal?

Just my £0.02,
Etienne.

fsck

Posted Mar 20, 2007 11:45 UTC (Tue) by drag (subscriber, #31333) [Link]

Ouch I had a similar thing happen with a /home directory and XFS with a power outage (well my cat falling down and yanking out wires from behind my system).

Lost my ~/ directory. About 30 gigs worth of thousands of files and directories. All turned into numbers. No way to find out what file was what unless I tried to open them up individually. Was not a fun experiance.

fsck

Posted Mar 20, 2007 12:13 UTC (Tue) by nix (subscriber, #2304) [Link]

So it was a cat-astrophe?

(sorry)

More seriously, if fsck fails or acts in unuseful ways like that, and you can get an image of the disk pre-fsck, tytso might be interested in it...

fsck

Posted Mar 20, 2007 12:53 UTC (Tue) by etienne_lorrain@yahoo.fr (subscriber, #38022) [Link]

An image of the partition before applying the EXT3 journal would certainly be usefull, so quite a few full DVDs; but you always think: well, I've lost two or three files, just begin the recovery, will not be too bad... well it seems a bit more, give the "always answer yes" option to fsck... well it begin to feel bad... well there was nothing that important on the filesystem... and too late to do it right anyways.
That is at those times that you like the separate partition for valuable files, with simple partition schemes and no optimised filesystem (choice in between FAT and ext2fs).
The thing I should have done is to run an e3fsck after I had few crashs with the floppy driver on this PC (a bug already solved leaving interrupt disabled so power-off in X), but it was 6 to 10 sessions (i.e. few hours of work followed by shutdown/power off) before so it should have been handled by the EXT3 journal recovery.
I still should have forced the check from another distribution to be sure - or the FC6 live CD... too late.

Etienne.

fsck

Posted Mar 20, 2007 14:55 UTC (Tue) by nix (subscriber, #2304) [Link]

If you give e2fsck the -y option, IMHO you deserve everything you get. There's a *reason* it's not the default. (And, yes, when I'm lucky enough to have enough accessible storage and a major filesystem does south without recent backups, I *do* gzip up the filesystem image and back it up before doing a fsck. Perhaps I could use e2image but I've never dared risk it.)

fsck

Posted Mar 20, 2007 16:02 UTC (Tue) by bronson (subscriber, #4806) [Link]

Unless you're an ext3 filesystem engineer, how are you supposed to know what fixes to make?? If fsck asks my mom, "1377 unreferenced nodes, delete? (Y)" (whatever a typical error looks like; it's been a while), what is she supposed to do?

The only two modes that the average Linux user can run fsck in:
- All Y, which you say is a bad idea.
- All N, in which case there's no point.

Maybe fsck could offer an "all trivial" setting, where it would automatically make fixes that it thinks are unlikely to cause data loss. If a bigger problem is found, fsck could bail out saying, "Serious errors found, back up partition before repairing!"

This is the same problem as Windows users splatting "Yes" every time their OS asks, "Do you want to allow a connection to sdlkh.phishing.org" except that the fsck questions are even less understandable!

fsck

Posted Mar 20, 2007 17:02 UTC (Tue) by nix (subscriber, #2304) [Link]

Your points have merit: it's hard to work out which changes are safe. That's why I tend to e2fsck first with -n, and review the list to see if there are a lot of changes or they look intuitively frightening. If they do, it's image-first time, so I can retry with more n answers if fsck goes wrong.

(The `all trivial' option already exists: it's what you get if you run e2fsck with -a. If e2fsck says you must run fsck manually, that means it has nontrivial fixes to make. In my experience this is rare indeed, even in the presence of repeated power failures or qemu kills while doing rm -r.)

This is yet one more place where a CoW device-mapper layer would be useful: instead of doing huge copies, you could just do the e2fsck in a CoWed temporary image (maybe mounted on a loopback filesystem somewhere :) ) and see if that worked...

fsck / xfs

Posted Mar 20, 2007 13:24 UTC (Tue) by rfunk (subscriber, #4054) [Link]

When SGI first introduced XFS for Irix, they talked about "no need for fsck, ever", just as
the ZFS proponent above does. Later they were forced to admit that there are times
when filesystems need to be checked or repaired, and they introduced tools for doing so.

Unfortunately, those tools probably aren't as useful when things go wrong as they would
be if the need for them had been anticipated from the start. Like you, I've discovered the
hard way that XFS is very bad for situations when the disk might accidentally lose power
or get disconnected unexpectedly. SGI apparently designed it for a stable server-room
situation.

I've never had a problem with ext3, and am slowly migrating my XFS filesystems to ext3.

fsck / xfs

Posted Mar 22, 2007 11:47 UTC (Thu) by wookey (subscriber, #5501) [Link]

I too have found the hard way that yanking the power on XFS (or just hitting reset at a bad time) is a very bad idea. All the files that had pending writes just end up as the correct length of zeros. When this is includes your package database, perl binaries and a load of other libs, this is quite bad.

The xfs_repair tool did do a pretty-good repair job (once I fixed it so it ran! http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=414079) but it did take about 5 hours to do it on a pair of 200GB mirrored drives. Then I got to re-install everything to fix the damage.

About 3 days faff in total. Fair dues though - there was no user-data loss and the system was recoverable, but I've never had this trouble with reiser3 on my laptop or ext3 on other boxes. So, yes, XFS is a really nice filesystem (live resizing, nice and fast) but I'd avoid it unless there is a UPS around.

fsck / xfs - versus ZFS

Posted Apr 17, 2007 14:06 UTC (Tue) by qu1j0t3 (subscriber, #25786) [Link]

It would be wrong to assume ZFS has the same failure modes as XFS. See, for instance: Bill Moore's blog.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds