LWN.net Logo

Chris Mason leaving Oracle

From:  Chris Mason <chris.mason-AT-oracle.com>
To:  linux-btrfs <linux-btrfs-AT-vger.kernel.org>, "linux-fsdevel-AT-vger.kernel.org" <linux-fsdevel-AT-vger.kernel.org>
Subject:  Leaving Oracle
Date:  Wed, 6 Jun 2012 21:04:48 -0400
Message-ID:  <20120607010448.GA26531@shiny>
Cc:  chris.mason-AT-fusionio.com
Archive-link:  Article, Thread

Hello everyone,

Oracle has been a fantastic place to work, and I really appreciate their
support for my projects.  But, I've decided to take a new position at
Fusion-io.  I will start the new job on Monday, June 11.

From a Btrfs point of view, very little will change.  I'll still
maintain Btrfs and will continue all of my Btrfs development in the
open.  Oracle will still use Btrfs in their Oracle Linux products, and
I'll work with all of the distros using Btrfs in production.

Fusion-io really believes in open source, and I'm excited to help
them shape the future of high performance storage.

chris.mason@oracle.com will probably stop working this Friday June 8th.
chris.mason@fusionio.com will be my new email address.

Just let me know if you have any questions.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



(Log in to post comments)

Chris Mason leaving Oracle

Posted Jun 7, 2012 14:10 UTC (Thu) by Snooplop (guest, #84997) [Link]

Or maybe Oracle just pulled the plug on Btrfs.
Would be a very understandable move, because Btrfs is still
as unstable and unreliable as it was years ago. It simply
looks like the project is unable to reach a stable fix point.
Instead it keeps diverging.

Chris Mason leaving Oracle

Posted Jun 7, 2012 14:37 UTC (Thu) by masoncl (subscriber, #47138) [Link]

I'll actually disagree here, but regardless, Btrfs is still my prime focus.

Chris Mason leaving Oracle

Posted Jun 7, 2012 15:31 UTC (Thu) by jone (guest, #62596) [Link]

that's good to hear .. I like Flynn's focus here @fusionio and an integrated (and optimized) storage subsystem/volume mgr/filesystem is a good thing(tm) .. any thoughts on leveraging the DFS (Direct FS) work the princeton guys did to tie in closer with the VSL, bypass buffer cache, and reuse the FTL mappings and controller logs?

Regardless - whatever could be done to provide tighter integration from the target to the filesystem (or block object model) might be well received

Chris Mason leaving Oracle

Posted Jun 7, 2012 14:49 UTC (Thu) by beagnach (guest, #32987) [Link]

Snooploop - Troll

Chris Mason leaving Oracle

Posted Jun 7, 2012 15:37 UTC (Thu) by sciurus (subscriber, #58832) [Link]

That seems unlikely, considering "Btrfs ready for production in new Oracle Linux kernel".

Chris Mason leaving Oracle

Posted Jun 7, 2012 18:12 UTC (Thu) by littlesandra88 (guest, #64017) [Link]

Does Oracle have other btrfs developers?

If not, then they can't really offer enterprise btrfs support.

Chris Mason leaving Oracle

Posted Jun 7, 2012 18:33 UTC (Thu) by kreijack (guest, #43513) [Link]

I don't know if Oracle has other btrfs developers, but I don't think that Oracle (or Red Hat or Suse..) has one or more developers for every pieces of code of linux. This don't prevent to provide enterprise support for almost all linux functions.

Chris Mason leaving Oracle

Posted Jun 17, 2012 6:36 UTC (Sun) by ceplm (subscriber, #41334) [Link]

Certainly it was Red Hat's policy to have a key developer for every key component of RHEL, and certainly filesystem qualifies as a key component.

Chris Mason leaving Oracle

Posted Jun 7, 2012 20:37 UTC (Thu) by masoncl (subscriber, #47138) [Link]

Oracle will definitely continue having developers work on Btrfs

-chris

Chris Mason leaving Oracle

Posted Jun 7, 2012 18:09 UTC (Thu) by dowdle (subscriber, #659) [Link]

My (limited) understanding of filesystems is that new ones often take years... and those with a lot of complexity (like btrfs) take even longer. Given its age, if btrfs was already fully production quality and widely deployed, it would be unusual. How long did it take Sun to make ZFS start to finish?

ZFS Stability

Posted Jun 7, 2012 20:47 UTC (Thu) by clugstj (subscriber, #4020) [Link]

The last time I used it (last year) - it wasn't finished!

Chris Mason leaving Oracle

Posted Jun 8, 2012 7:03 UTC (Fri) by skx (subscriber, #14652) [Link]

Pretty much this, not to mention that even when it is declared stable you'll find sysadmins like myself will avoid it for a good few years "just in case".

I still use ext3 on the majority of systems, and it has only been within the past 2-3 years that I've been willing to tolerate the use of xfs. When brtfs is declared ready - and don't get me wrong I know people use it now - I'll avoid it for a long time out of a combination of paranoia and conservatism.

Chris Mason leaving Oracle

Posted Jun 7, 2012 18:29 UTC (Thu) by kreijack (guest, #43513) [Link]

First, my best wishes to Chris for his new job.

> [...] because Btrfs is still
> as unstable and unreliable as it was years ago.[...]

I strongly disagree. I am a btrfs user (it is my root filesystem) from 2-3 years. The only problem that I encountered due to btrfs was due a bug about an hard link between two subvolumes at the beginning. Except this event to me it is rock solid.

It is a young filesystem with some no "so good" performance in some corner case (like an upgrade with dpkg), but far away to be "unstable".
Regarding "unreliable", btrfs is the one of few filesystem with checksums, so I don't understand how it is possible to call it "unreliable".

G.Baroncelli

Chris Mason leaving Oracle

Posted Jun 7, 2012 18:44 UTC (Thu) by Snooplop (guest, #84997) [Link]

By unreliable I mean that your Btrfs partitions will sooner
or later encounter problems. At which point you must start from
zero again, because the long promised fsck tool still isn't released.

Apropos "not so good performance":
When was the last time you ran "filefrag" on your spam database or
your qemu images? Please try it, it may be enlightening...

Chris Mason leaving Oracle

Posted Jun 7, 2012 19:10 UTC (Thu) by SEJeff (subscriber, #51588) [Link]

>> By unreliable I mean that your Btrfs partitions will sooner
>> or later encounter problems. At which point you must start from
>> zero again, because the long promised fsck tool still isn't released.

However, there is a tool[1] to get data out of a corrupt btrfs filesystem I think Josef Bacik wrote that seems to work perfectly well in lieu of a stable btrfs fsck tool.

>> Apropos "not so good performance":
>> When was the last time you ran "filefrag" on your spam database or
>> your qemu images? Please try it, it may be enlightening...

Compared to other Linux filesystems, I believe btrfs is one of the only if not the only one that does online filesystem defrag. Like otherwise >> mentioned, I think you're just trolling. This isn't /., please don't troll.

[1] http://kernelnewbies.org/Linux_3.4#head-37f67ea2474e9e2aa...
[2] https://btrfs.wiki.kernel.org/index.php/Main_Page#Features

Chris Mason leaving Oracle

Posted Jun 9, 2012 14:44 UTC (Sat) by janfrode (subscriber, #244) [Link]

>> Compared to other Linux filesystems, I believe btrfs is one of the only if not the only one that does online filesystem defray.

XFS also has an online defrag utility. I believe it was a default cronjob on Irix to do a weekly run of xfs_fsr.

http://linux.die.net/man/8/xfs_fsr

Chris Mason leaving Oracle

Posted Jun 23, 2012 0:32 UTC (Sat) by makomk (guest, #51493) [Link]

Online defrag support for ext4 got merged into the kernel a few releases ago actually. It uses a similar approach to the XFS defragmentation tool - you allocate a scratch file then atomically copy the data over to the scratch file and replace its extents with the new extents from the scratch file. I don't remember seeing much fanfare when it was merged though.

Chris Mason leaving Oracle

Posted Jun 7, 2012 21:20 UTC (Thu) by kreijack (guest, #43513) [Link]

> By unreliable I mean that your Btrfs partitions will sooner
> or later encounter problems. At which point you must start from
> zero again, because the long promised fsck tool still isn't released.

Due to some hardware error, I had problem with my BTRFS partitions. But except for the files damaged by the hardware error, I recovered all other the data. I used a simple 'cp -Rfv'.

> Apropos "not so good performance":
> When was the last time you ran "filefrag" on your spam database or
> your qemu images? Please try it, it may be enlightening...

Even if BTRFS is not the fastest file system, I am quite happy with its performance and really appreciate its features (checksumming, supporting *huge* files/filesystem, scrubbing, snapshot-ting, online defrag/growing/shrinking...). If you don't like it... don't use it :-)

btrfs performance

Posted Jun 7, 2012 21:22 UTC (Thu) by geuder (subscriber, #62854) [Link]

> with some no "so good" performance in some corner case (like an upgrade
> with dpkg), but far away to be "unstable".

Can you elaborate what is so special about dpkg upgrade?

I have small demo system on a small cheap USB stick. Of course I wouldn't expect to use it for any real work, but hey on a laptop with 4 GB RAM nearly the whole "disk" can be cached, so the demo does indeed work well.

However, I have recognized that dpkg upgrade seems exremely slow in comparision to everything else. Like 90 minutes for 40 updates or something like that. The rootfs is btrfs with zlib compression.

Before complaining anywhere I thought I compare with ext2/ext3/ext4. But now that you mention it I dare to ask without hard numbers

Google brought up bug reports about slow fsync from 2010. Is thst still an issue with a current 3.2 kernel?

I think Meego used btrfs as rootfs and they thypically would use a flash based mass memory. But they used rpm. Would that make a difference???

btrfs performance

Posted Jun 7, 2012 22:47 UTC (Thu) by alankila (subscriber, #47141) [Link]

fsync before every rename -- I guess to workaround the ext4 issue of getting empty files after rename. On btrfs, the fsync is more costly because -- I guess -- it forces updates in the btree all the way to the root node.

I have seen updates take hours because of this on a very slow class 2 SD card which hosted a linux install I made recently. I use the "eatmydata" package to disable fsync capability from dpkg, after which it performs acceptably.

dpkg also supports --force-unsafe-io option, but I don't think this one disables the fsync.

btrfs performance

Posted Jun 8, 2012 5:58 UTC (Fri) by kreijack (guest, #43513) [Link]

> Can you elaborate what is so special about dpkg upgrade?

As dpkg upgrade I means an upgrade of a lot of packages (like apt-get dist-upgrade)..

DPKG issues a flush very often. BTRFS is very sensible to this workload. Because BTRFS writes a lot of data, they needs to group the writes to be efficient. But this is not possible because the flush have to be performed when it arrives.

I solved this issue avoiding the flush() during a DPKG upgrade; In order to have consistency I protected the *full upgrade process* with a snapshot before the upgrade. If something goes wrong I can recover from this snapshot, otherwise I issue a final flush() and remove the snapshot.

Now my upgrade performances are similar to the ext* file-system.

See this bug
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/601299
or this my link
http://permalink.gmane.org/gmane.comp.file-systems.btrfs/...

btrfs performance

Posted Jun 8, 2012 23:21 UTC (Fri) by engla (guest, #47454) [Link]

FYI, btrfs in 3.5 includes changes focused on making dpkg's use of fsync much faster: https://lkml.org/lkml/2012/6/1/160 (changes by Josef).

btrfs performance

Posted Jun 8, 2012 9:32 UTC (Fri) by geuder (subscriber, #62854) [Link]

Thanks alankila an kreijack for your detailed answers.

Now I vaguely remember having read about the issue and eatmydata quite a while back when I myself didn't use btrfs anywhere yet. My awareness got swapped out in the meantime. Guess I will try eatmydata and hope it won't do it...

So I would be really curious how MeeGo handled this. Is rpm inherently less cautious than dpkg and not issuing that many fsyncs? Or did MeeGo use some kind of eatmydata "just take the risk" trick? Or were updates on MeeGo really that slow??? Haven't used it that much, but I don't remember having observed such problem.

Chris Mason leaving Oracle

Posted Jun 7, 2012 19:43 UTC (Thu) by slashdot (guest, #22014) [Link]

Is it really unreliable?

For curiosity, when was the last data corrupting bug discovered and fixed?

Oracle has absolutely not pulled the plug on btrfs

Posted Jun 8, 2012 3:36 UTC (Fri) by jamesmorris (subscriber, #82698) [Link]

We are in fact actively hiring btrfs developers to work on mainline development.

Chris Mason leaving Oracle

Posted Jun 7, 2012 15:57 UTC (Thu) by littlesandra88 (guest, #64017) [Link]

I wonder how far Larry Ellison were able to through the chair.

It is just one of those things we will never know =(

Chris Mason leaving Oracle

Posted Jun 7, 2012 16:30 UTC (Thu) by SEJeff (subscriber, #51588) [Link]

I also wonder if this will cause patent aggression from Oracle. Seeing as ZFS has a ton of patents of things very similar to what Btrfs does, will Oracle try to be an agressor (As Larry is so good at being) against those who carry on the torch?

With them, you really never know sometimes.

Chris Mason leaving Oracle

Posted Jun 7, 2012 17:45 UTC (Thu) by spaetz (subscriber, #32870) [Link]

> I also wonder if this will cause patent aggression from Oracle.

But with btrfs being in Linux and ORacle being a member of the OIN, how could they exert patent aggression over code they have themselves contributed?

Mmh, nah, better don't answer.

Chris Mason leaving Oracle

Posted Jun 7, 2012 18:06 UTC (Thu) by dowdle (subscriber, #659) [Link]

Being that that is exactly what SCO did... I'm guessing you were being ironical / sarcastic... but not blatantly so. I like it.

Chris Mason leaving Oracle

Posted Jun 7, 2012 19:49 UTC (Thu) by slashdot (guest, #22014) [Link]

If GPLv2 gives an implicit patent license, then I guess both Chris Mason's work on btrfs as on Oracle employee and their releases of Unbreakable Linux would have resulted in the patents being licensed to btrfs.

Maybe we'll get a court case testing that though, since Oracle seems fond of trying their luck in court.

Chris Mason leaving Oracle

Posted Jun 10, 2012 18:38 UTC (Sun) by jospoortvliet (subscriber, #33164) [Link]

I wonder if they're still so eager after their recent encounter with Judge Alsup and his opinions ;-)

And Judge Richard Posner's recent statements doesn't instill them with confidence either, I bet.

Chris Mason leaving Oracle

Posted Jun 7, 2012 22:55 UTC (Thu) by robert_s (subscriber, #42402) [Link]

"Fusion-io really believes in open source"

Ouch. Was that a sideways swipe at Oracle?

Chris Mason leaving Oracle

Posted Jun 8, 2012 4:50 UTC (Fri) by masoncl (subscriber, #47138) [Link]

Sorry if it read that way, definitely not a swipe at Oracle.

Chris Mason leaving Oracle

Posted Jun 8, 2012 8:27 UTC (Fri) by nhippi (subscriber, #34640) [Link]

Hopefully this will mean native NAND filesystems for high-end SSDs like fusion-io pci-e cards.

It is so backwards that we have SSD emulating a block device while the filesystem on top of it tries to guess how the SSD controller is doing wear-leveling a write distributing and howto optimize writes and reads without knowing the eraseblock sizes etc...

Chris Mason leaving Oracle

Posted Jun 8, 2012 14:19 UTC (Fri) by axboe (subscriber, #904) [Link]

As mentioned higher up, Fusion is working on an open file system that leverages the same mapping structures for the fs and flash translation layer.

Oh, and welcome Chris :-)

Chris Mason leaving Oracle

Posted Jun 8, 2012 23:01 UTC (Fri) by masoncl (subscriber, #47138) [Link]

Finding new ways to use the flash is going to be my favorite part of the new job. There is a lot of overlap in the mapping information we're storing between the ftl and the FS, and this is a great area for research.

Chris Mason leaving Oracle

Posted Jun 8, 2012 17:26 UTC (Fri) by msnitzer (subscriber, #57232) [Link]

It may seem backwards but emulating a block interface really enabled Fusion-io SSDs to become widely used in the market. But now they can more easily take steps to introduce new interfaces (aka lock-in) that further optimize access.

Chris Mason leaving Oracle

Posted Jun 9, 2012 0:36 UTC (Sat) by bronson (subscriber, #4806) [Link]

It's rather hard to lock anybody in when your code is merged to Linus's tree.

Chris Mason leaving Oracle

Posted Jun 10, 2012 16:49 UTC (Sun) by msnitzer (subscriber, #57232) [Link]

Sure, btrfs is in mainline. But I was talking about all the other new interfaces Fusion-io is working on.

Chris Mason leaving Oracle

Posted Jun 8, 2012 22:02 UTC (Fri) by daniel (subscriber, #3181) [Link]

"It is so backwards that we have SSD emulating a block device while the filesystem on top of it tries to guess how the SSD controller is doing wear-leveling a write distributing and howto optimize writes and reads without knowing the eraseblock sizes etc..."

Why is that backwards while guessing where the track boundaries are in a filesystem optimized for rotating media is not?

Chris Mason leaving Oracle

Posted Jun 9, 2012 9:17 UTC (Sat) by alankila (subscriber, #47141) [Link]

To be fair, I don't think anybody has been trying to guess track boundaries for a decade. It's been well known that the number of sectors per track varies because of constant data density and larger surface area at the outer rim of the disc.

Because somebody has to do the write equalization on today's flash, I in fact prefer it to be hardware because there's less chance that changes in software get to break it. If it's all software, you presumably have some kind of data partition exposed over especially rewrite-durable media that tracks the info about flash block write cycles per erase cell, and which must never be overwritten during disk formats for instance. (If not, then I guess you use the same flash chips for wear-leveling information as the filesystem itself, which sounds like another kind of headache. Not sure how these things work.)

It's interesting that modern android phones appear to be using ext4 over hardware remapping layers too. I guess NAND-level chip access sounds better in theory than in practice.

Chris Mason leaving Oracle

Posted Jun 9, 2012 13:49 UTC (Sat) by drag (subscriber, #31333) [Link]

> It's interesting that modern android phones appear to be using ext4 over hardware remapping layers too

File systems for block devices is a proven technology and is very mature. FTL firmwares and techniques is a widely known and proven technology, also.. since it has been a basic requirement for generations of consumer flash devices in order to be compatible with Windows.

So I am guessing that phone manufacturers, which are mostly integrators and not developers, are more interested in getting products out the door rather then working with chipset manufacturers to develop interfaces for theoretical large file system design.

So while I doubt having multiple layers of abstraction been the kernel and the storage device is optimal performance and reliability wise, it is certainly the most cost effective approach for today's consumer electronics.

:)

Chris Mason leaving Oracle

Posted Jun 10, 2012 17:46 UTC (Sun) by linusw (subscriber, #40300) [Link]

> So I am guessing that phone manufacturers, which are mostly integrators and not developers, are more interested in getting products out the door rather then working with chipset manufacturers to develop interfaces for theoretical large file system design.

Oh no we love file systems, but we need to pool resources. Arnd Bergmann is maintaining this flash card survey (eMMC and SD alike):
https://wiki.linaro.org/WorkingGroups/Kernel/Projects/Fla...

From this we are discussing alterations needed to the file system, VFS etc.

In Linaro we had Samsung, SanDisk and Micron visiting us to discuss these issues at the latest Linaro connect meeting in Hong Kong.

Chris Mason leaving Oracle

Posted Jun 11, 2012 8:50 UTC (Mon) by etienne (subscriber, #25256) [Link]

> trying to guess track boundaries for a decade

On a rotating hard disk with two physical heads, there should be two areas which shall be accessible within a very short time (< 1ms) while it is long (>6 ms) to go from the beginning of one of those area to its end.
I didn't measure it myself, but some optimisation could probably be done there, even if it means some initial run-time testing to find those areas, and few failed optimisations when a sector is re-allocated.
Maybe that "stripping" is already done in hardware, I do not know.

Chris Mason leaving Oracle

Posted Jun 11, 2012 11:07 UTC (Mon) by ttonino (subscriber, #4073) [Link]

When one head is tracking correctly, the other most likely is not, with thermal expansion and what not.

Tracking is highly dynamic. Even writing and reading with the same head needs slightly offset tracking. It is probably better to use just a single head, as the seek behaviour can be more predictable.

Also, sequential throughput is not the limiting factor. Seek time is.

Chris Mason leaving Oracle

Posted Jun 11, 2012 11:59 UTC (Mon) by etienne (subscriber, #25256) [Link]

Yes, seek time is the limiting factor, at least by a factor of 6, so there are cases where it is better to read from another head nearby where it is currently located, instead of seeking the current head to the other side of the disk.
For instance, in a read mostly situation, if you have files organized in an increasing LBA order on one head and same files (RAID 1) in decreasing LBA order on the second head you should be able to access all files with half the disk seek time.
Or find a clever way to write the data of the file on one head and its meta-data (file descriptor, inode table, journal) in the other head at approximately the same physical location.
On the other side, if stripping is done by the disk firmware, disk partitions should be aligned to the stripping size.

Chris Mason leaving Oracle

Posted Jun 14, 2012 6:35 UTC (Thu) by nhippi (subscriber, #34640) [Link]

> It's interesting that modern android phones appear to be using ext4 over hardware remapping layers too. I guess NAND-level chip access sounds better in theory than in practice.

It is more a market thing. 32GB eMMC chips (without raw access) are easily available on the markets - the interface is standard and shares infrastructure with SD card manufacturing. Meanwhile there is no real standard for "raw" NAND interface, therefor finding a NAND chip that is both big and is compatible with your SoC is tricky and often more expensive..

Chris Mason leaving Oracle

Posted Jun 15, 2012 11:37 UTC (Fri) by wookey (subscriber, #5501) [Link]

SSD/SDs don't emulate block devices because it's the most efficient interface - they do it it because it's the same as the disk drive stuff so compatibility is automatic (and it allows huge size changes to occur without interface changes, which plagued early flash-device use (e.g. Smartmedia, xD))

We know for a fact that the write distribution and wear levelling inside SD cards is deeply crappy for many of our use-cases, and that the best thing to do is entirely different for a server ext4 rootfs and for FAT-in-a-camera storing JPEGs. SD cards are optimised for the latter, and even if they weren't we wouldn't know anything about it.

So it would be really nice to get a lower-level interface to the devices defined so that we could get device characteristics and do raw access, making the data-layout and garbage-collection decisions in the filesystem where we have a lot more cache and CPU smarts and knowledge of the OS usage pattern and knobs to twiddle.

It's taken years to get this message over to the flash manufacturers but I think it's getting through now thanks to them sending the right people to Ubuntu and Linaro events, so there is hope of more efficient use of flash in SD and SSD in the future.

Chris Mason leaving Oracle

Posted Jun 8, 2012 17:49 UTC (Fri) by simosx (subscriber, #24338) [Link]

btrfs will be huge, and it's good that a new company will have developers that will focus 100% on the filesystem.

It's all good news.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds