User: Password:
|
|
Subscribe / Log in / New account

RAID 5/6 code merged into Btrfs

From:  Chris Mason <chris.mason-AT-fusionio.com>
To:  linux-btrfs <linux-btrfs-AT-vger.kernel.org>
Subject:  experimental raid5/6 code in git
Date:  Sat, 2 Feb 2013 11:02:12 -0500
Message-ID:  <20130202160212.GB4694@shiny>
Archive-link:  Article

Hi everyone,

I've uploaded an experimental release of the raid5/6 support to git, in
branches named raid56-experimental.  This is based on David Woodhouse's
initial implementation (thanks Dave!).

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git raid56-experimental
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git raid56-experimental

These are working well for me, but I'm sure I've missed at least one or
two problems.  Most importantly, the kernel side of things can have
inconsistent parity if you crash or lower power.  I'm adding new code to
fix that right now, it's the big missing piece.

But, I wanted to give everyone the chance to test what I have while I'm
finishing off the last few details.  Also missing:

* Support for scrub repairing bad blocks.  This is not difficult, we
just need to make a way for scrub to lock stripes and rewrite the
whole stripe with proper parity.

* Support for discard.  The discard code needs to discard entire
stripes.

* Progs support for parity rebuild.  Missing drives upset the progs
today, but the kernel does rebuild parity properly.

* Planned support for N-way mirroring (triple mirror raid1) isn't
included yet.

With all those warnings out of the way, how does it work?  The
original plan was to base read/modify/write cycles at high levels in the
filesystem, so that we always gave full stripe writes down to raid56
layers.  But this had a few problems, especially when you start thinking
about converting from one stripe size to another.  It doesn't fit with
the delayed allocation model where we pick physical extents for a given
operation as late as we possibly can.

Instead I'm doing read/modify/write when we map bios down to the
individual drives.  This allows blocks from multiple files to share a
stripe, and it allows us to have metadata blocks smaller than a full
stripe.  That's important if you don't want to spin every disk for each
metadata read.

This does sound quite a lot like MD raid, and that's because it is.  By
doing the raid inside of Btrfs, we're able to use different raid levels
for metadata vs data, and we're able to force parity rebuilds when crcs
don't match.  Also management operations such as restriping and
adding/removing drives are able to hook into the filesystem
transactions.  Longer term we'll be able to skip reads on blocks that
aren't allocated and do other connections between raid56 and the FS
metadata.

I've spent a long time running different performance numbers, but there
are many benchmarks left to run.  The matrix of different configurations
is fairly large, with btrfs-raid56 vs MD-raid56 vs Btrfs-on-MD-raid56,
and then comparing all the basic workloads.  Before I dive into numbers,
I want to describe a few moving pieces.

Stripe cache -- This avoids read/modify/write cycles with an LRU of
recently written stripes.  Picture a database that does adjacent
synchronous 4K writes (say a log record and a commit block).  We want to
make sure we don't repeat read/modify/writes for the commit block after
writing the log block.

In btrfs the stripe cache changes because we're doing COW.  Hopefully we
are able to collect writes from multiple processes into a full stripe
and do fewer read/modify/write cycles.  But, we still need the cache.
The cache in btrfs defaults to 1024 stripes and can't (yet) be tuned.
In MD it can be tuned up to 32768 stripes.

In the btrfs code, the stripe cache is the director in a state machine
that pulls stripes from initial submission to completion.  It
coordinates merging stripes, parity rebuild and handing off the stripe
lock to the next bio.

Plugging -- The on stack plugging code has a slick way for anyone in the
IO stack to participate in plugging.  Btrfs is using this to collect
partial stripe writes in hopes of merging them into full stripes.  When
the kernel code unplugs, we sort, merge and fire off the IOs.  MD has a
plugging callback as well.

Parity calculations --  For full stripes, Btrfs does P/Q calculations
at IO submission time without handing off to helper threads.  The code
uses the synchronous xor/memcpy/raid6 lib apis.  For sub-stripe writes,
Btrfs kicks the work off to its own helper threads and uses the same
synchronous apis.  I'm definitely open to trying out the ioat code, but
so far I don't see the P/Q math as a real bottleneck.

Everyone who made it this far gets to see benchmarks!  I've run these on
two different systems.

1) A large HP DL380 with two sockets and 4TB of flash.  The
flash is spread over 4 drives and in a raid0 run it can do 5GB/s
streaming writes.  This machine has the IOAT async raid engine.

2) A smaller single socket box with 4 spindles and 2 fusionio drives.
No raid offload here.  This box can do 2.5GB/s streaming writes.

These are all on 3.7.0 with MD created with -c 64 and --assume-clean.
I upped the MD stripe cache to 32768, but didn't include Shaohua's
patches to parallelize the MD parity calculations.  I'll do those runs
after I have the next round of btrfs changes done.

Lets start with an easy benchmark:

machine #2 flash broken up into 8 logical volumes and then raid5
created on top (64K stripe size).  Single dd doing streaming full stripe
writes:

dd if=/dev/zero of=/mnt/oo bs=1344K oflag=direct count=4096

Btrfs -- 604MB/s
MD    -- 162MB/s

My guess is the performance difference here is coming from latencies
related to handing off parity to helpers.  Btrfs is doing everything
inline and MD is handing off.

fs/direct-io.c is sending down partial stripes (one IO per 64 pages),
but our plugging callbacks let us collect them.  Neither MD or Btrfs are
doing any reads here.

Now for something a little bigger:

machine #1 with all 4 drives configured in raid6.  This one is using fio
to do a streaming aio/dio write of large full stripes.  The numbers
below are from blktrace.  Since we're doing raid6 over 4 drives, half
our IO was for parity.  The actual tput seen by fio is 1/2 of this.

The MD runs are going directly to MD, no filesystem involved.

MD -- 800MB/s very little system time
http://masoncoding.com/mason/benchmark/btrfs-raid6/md-rai...
http://masoncoding.com/mason/benchmark/btrfs-raid6/md-rai...

Btrfs -- 3.8GB/s one CPU mostly pegged
http://masoncoding.com/mason/benchmark/btrfs-raid6/btrfs-...
http://masoncoding.com/mason/benchmark/btrfs-raid6/btrfs-...

That one CPU is handling interrupts for the flash.

I spent some time trying to figure out why MD was doing reads in this
run, but I wasn't able to nail it down.

Long story short, I spent a long time tuning for streaming writes on
flash.  MD isn't CPU bound in these runs, and latencytop shows it is
waiting for room in its stripe cache.

Ok, but what about read/modify/write?
Machine #2 with fio doing 32K writes onto raid5

Btrfs -- 380MB/s seen by fio
MD    -- 174MB/s seen by fio

http://masoncoding.com/mason/benchmark/btrfs-raid6/btrfs-...
http://masoncoding.com/mason/benchmark/btrfs-raid6/md-rai...

For the Btrfs run, I filled the disk with 8 files and then deleted one
of them.  The end result made it impossible for btrfs to ever allocate a
full stripe, even when it was doing COW.  So every 32K write triggered a
read/modify/write cycle.  MD was doing rmw on every IO as well.

It's interesting that MD is doing a 1:1 read/write while btrfs is doing
more reads than writes.  Some of that is metadata required for the IO.

How does Btrfs do at 32K sub stripe writes when the FS is empty?

http://masoncoding.com/mason/benchmark/btrfs-raid6/btrfs-...

COW lets us collect 32K writes from multiple procs into a full stripe,
so we can avoid the rmw cycle some of the time.  It's faster, but only
lasts while the space is free.

Metadata intensive workloads hit the read/modify/write code much harder,
and are even more latency sensitive than O_DIRECT.  To test this, I used
fs_mark, both on spindles and on flash.

The interesting thing is that on flash, MD was within 15% of the Btrfs
number.  The fs_mark run was actually CPU bound creating new files in
Btrfs, so once we used flash the storage wasn't the bottleneck any more.

Spindles looked a little different.  For these runs I tested btrfs on
top of MD vs btrfs raid5.

http://masoncoding.com/mason/benchmark/btrfs-raid5/btrfs-...
http://masoncoding.com/mason/benchmark/btrfs-raid5/btrfs-...

Creating 12 million files on Btrfs raid5 took 226 seconds, vs 485
seconds on MD.  In general MD is doing more reads for the same
workload.  I don't have a great explanation for this yet but the
Btrfs stripe cache may have a bigger window for merging concurrent IOs
into the same stripe.

Ok, that's enough for now, happy testing everyone.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




(Log in to post comments)

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 18:24 UTC (Mon) by butlerm (guest, #13312) [Link]

I have question. Is there a plan to store write intent information somewhere so that parity information can be rebuilt for partially completed stripe writes after a system crash? MD uses a write intent bitmap for this.

In an ordinary filesystem the journal or intent log would be an excellent place for this, but I understand Btrfs doesn't use one.

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 18:57 UTC (Mon) by drag (guest, #31333) [Link]

Wouldn't Btrfs's 'COW' design simply mean that partial writes are just discarded? Baring any foolishness in the hardware the filesystem should always remain in a consistent state regardless of it crashing or whatnot. Maybe just have to do a rollback on a commit or two to get a good state.

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 19:57 UTC (Mon) by masoncl (subscriber, #47138) [Link]

Almost. Since the implementation allows us to share stripes between different extents, you might have a shiny new extent going into the same stripe as an old extent.

In this case you need to protect the parity for the old extent just in case we crash while we're rewriting the parity blocks.

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 23:51 UTC (Mon) by butlerm (guest, #13312) [Link]

> Wouldn't Btrfs's 'COW' design simply mean that partial writes are just discarded?

That only works if you always write a full stripe, which is generally not the case. ZFS uses variable stripe sizes to achieve this for data writes, but the minimum stripe size tends to be rather large, depending on how many disks you have in your RAID set.

If you spread every filesystem block across all disks the way ZFS does, random read performance suffers dramatically. Every disk has to participate in every uncached data read. Minimum FS block sizes go up with the number of disks, and so on.

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 19:10 UTC (Mon) by Jonno (subscriber, #49613) [Link]

Due to the btrfs design a write intent bitmap isn't necessary. Checksums make it possible to figure out which drive is at fault without one, you just have to do a scrub after a crash.

Additionally, btrfs already keeps track of the last five transactions it committed, so it should be possible to automatically scrub just those, but I don't know if that is planed, or if the devs have something even smarter in mind.

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 19:59 UTC (Mon) by masoncl (subscriber, #47138) [Link]

It's true that crcs allow us to figure out if the data on the drives is correct. But, if you crash while updating the parity and you lose one of the drives (not unusual in a power failure), you need to be able to rebuild the data from parity.

If the parity isn't consistent with the rest of the stripe, the rebuild isn't possible.

-chris

RAID 5/6 code merged into Btrfs

Posted Feb 6, 2013 15:27 UTC (Wed) by Jonno (subscriber, #49613) [Link]

> If the parity isn't consistent with the rest of the stripe, the rebuild isn't possible.
True, but a write-intent bitmap wouldn't help with that either, as all it does is tell you which drive(s), if any, is out of date and need to be rebuilt, information that won't help if you lost a drive (or two for raid6) and can't rebuild anything.

RAID 5/6 code merged into Btrfs

Posted Feb 6, 2013 18:26 UTC (Wed) by butlerm (guest, #13312) [Link]

The purpose of a write intent bitmap is not to recover a failed drive, it is to recover from a lost write. In the event of a power failure or system crash, one or more of the writes may be lost (or partially completed), leaving the stripe parity in an inconsistent state.

Correct parity (sufficient to recover from a subsequent drive failure) can be trivially regenerated using the contents of the write intent bitmap. The data on the blocks actually being written to may be still be incomplete of course, but that doesn't matter for the purpose of protecting the data on other other blocks in the same stripe.

If a drive fails and the system crashes at the same time a stripe update is in progress, it is entirely possible of course that unrelated parts of the stripe being updated may become unrecoverable, for lack of consistent parity information. You can see the attraction of the ZFS full stripe minimum block size policy.

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 19:55 UTC (Mon) by masoncl (subscriber, #47138) [Link]

An intent log is similar to how I'll end up preventing bad parity after a crash. That's the part I'm still hacking on.

If we're doing a full stripe write that came from a COW operation, we don't need the extra logging because none of the blocks in the stripe are fully allocated until after the IO is complete.

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 19:44 UTC (Mon) by dlang (guest, #313) [Link]

> That's important if you don't want to spin every disk for each
metadata read.

when you have a raid setup, especially a raid5/6 setup, you very seldom have drives spinning down.

Also, if you don't read the entire stripe, how do you check the parity information? or do you not do that level of validation ond only look at the parity information if you get a read failure?

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 20:06 UTC (Mon) by masoncl (subscriber, #47138) [Link]

I probably should have written seek every disk. Metadata reads tend to be more seek intensive, and if every drive in the raid set needs to be involved to read every metadata block, you end up seek bound over the whole array pretty quickly.

We're only checking parity if we have to rebuild a block. The rebuilds only happen if the crc check fails, or if you get an IO error.

-chris

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 19:51 UTC (Mon) by daglwn (guest, #65432) [Link]

Is there any sort of tool in the works to transition between MD RAID and Btrfs? I'm thinking about a tool to do in-place upgrade from MD/ext4/whatever to Btrfs.

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 19:56 UTC (Mon) by dlang (guest, #313) [Link]

no, you would need to back up your system and reformat.

The only time you can do an in-place transition from one filesystem to another is when they are different generations of the same filesystem.

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 20:19 UTC (Mon) by daglwn (guest, #65432) [Link]

That's what I expected.

Thankfully, I have everything in the RAID array backed up over multiple machines via git. :)

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 21:09 UTC (Mon) by man_ls (guest, #15091) [Link]

Using git for backups must be just for the brave! How do you deal with huge incompressible files such as media data?

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 21:12 UTC (Mon) by dlang (guest, #313) [Link]

if they don't change, git based backups aren't going to be any worse than other backups. The big problem with git comes if the large, un-diffable files change.

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 21:38 UTC (Mon) by jaa (subscriber, #14170) [Link]

If anybody is already using git for backups, I highly recommend to familiarise yourself with bup. It's really interesting concept to backup data (incremental backups, deduplications of data etc...)

From bup's website: bup is "Highly efficient file backup system based on the git packfile format." https://github.com/bup/bup

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 23:24 UTC (Mon) by daglwn (guest, #65432) [Link]

Yes, I've looked at bup. I wish the author would try to get it integrated as part of the git suite (or at least in contrib/). Having to specify things like GIT_DIR before using bup commands seems dangerous. Plenty of git commands keep their own metadata in .git.

I would love to see "git bup!"

The reason I'm sticking with git is that's is easily available anywhere and right now I mostly just use it as a more convenient rsync. That is, I simply want to make a bunch of files easily available anywhere. As a bonus, replication provides a incremental backup capability.

RAID 5/6 code merged into Btrfs

Posted Feb 5, 2013 2:18 UTC (Tue) by josh (subscriber, #17465) [Link]

RAID 5/6 code merged into Btrfs

Posted Feb 5, 2013 10:27 UTC (Tue) by daglwn (guest, #65432) [Link]

That's pretty close. Unfortunately:

"But, git-annex also extends git's concept of remotes, with these special types of remotes. These can be used just like any normal remote by git-annex. They cannot be used by other git commands though."

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 23:19 UTC (Mon) by daglwn (guest, #65432) [Link]

Any such files I have don't change once created.

I've looked at bup but haven't yet taken the plunge. I'd rather keep a fully git-compatible format until I am forced to change.

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 20:30 UTC (Mon) by zuki (subscriber, #41808) [Link]

btrfs-convert converts between different filesystems :)

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 21:30 UTC (Mon) by safrax (guest, #83688) [Link]

Just ext3 and ext4 though.

RAID 5/6 code merged into Btrfs

Posted Feb 8, 2013 17:10 UTC (Fri) by daniel (guest, #3181) [Link]

Hi Chris,

Do you plan to push your generic raid improvements back to md?

Regards,

Daniel


Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds