LWN.net Logo

The 2006 Linux File Systems Workshop

July 5, 2006

This article was contributed by Valerie Henson

The Linux file systems community met in Portland in June 2006 to discuss the next 5 years of file system development in Linux. Organized by Val Henson, Zach Brown, and Arjan van de Ven, and sponsored by Intel, Google, Oracle, the Linux File Systems Workshop brought together thirteen Linux file systems developers and experts to share data and brainstorm for three days. Our goal was to discuss the direction of Linux file systems development during the next 5 years, with a focus on disruptive technologies rather than incremental improvements. Our goal was not to design one new file system to rule them all, but to come up with several useful new file system architecture ideas (which may or may not reuse existing file system code). To stay focused, we explicitly ruled out discussion of the design of distributed or clustered file systems, with the exception of how they impact local file system design. We came out of the workshop with broad agreement on the problems facing Linux file systems, several exciting file system architecture ideas, and a commitment to working together on the next generation of Linux file systems.

The Problem

Why do we need a Linux file systems workshop, when all seems well in Linux file systems land? Disks purr gently along, larger and fatter than ever before, but still essentially the same. I/O errors are an endangered species, more rumor than fact, and easily corrected with a simple fsck. The "df" command returns a comforting 50% free on most of your file systems. You chuckle gently as you read old file system man pages with directions for tuning inode/block ratios. Sure, that 32-bit file system size limit is looming somewhere over the horizon, but a quick patch to change the size of your block pointers is all you need and you'll be back in business again. After all, file systems are a solved problem, right? Right?

If computer hardware never changed, we kernel developers would have nothing better to do than argue about the optimal scheduling algorithm and flame each others' coding style. Unfortunately, hardware has this terrible habit of changing frequently, drastically, and worst of all, exponentially. File systems are especially vulnerable to changes in hardware because of their long-lived nature. Much of operating systems software can be changed at will given a simple system reboot. But file systems - and their on-disk data layouts - live on and on.

What has changed in hardware that affects file systems? Let's start with some simple, unavoidable facts about the way disks are evolving. Everyone knows that disk capacity is growing exponentially, doubling every 9-18 months. But what about disk bandwidth and seek time? At the last Storage Networking World conference, Seagate presented some details of their hard disk road map for the next 7 years (see page 16 of the slides [PDF]). Their predictions for 3.5 inch hard disks are summarized in the following table.

Parameter200620092013Improvement
Capacity (GB)5002000800016x
Bandwidth (Mb/s)1000200050005x
Read seek time (ms)87.26.51.2x

In summary, over the next 7 years, disk capacity will increase by 16 times, while disk bandwidth will increase only 5 times, and seek time will barely budge! Today it takes a theoretical minimum 4,000 seconds, or about 1 hour to read an entire disk sequentially (in reality, it's longer due to a variety of factors). In 2013, it will take a minimum of 12,800 seconds, or about 3.5 hours, to read an entire disk - an increase of 3 times. Random I/O workloads are even worse, since seek times are nearly flat. A workload that reads, e.g., 10% of the disk non-sequentially will take much longer on our 8TB 2013-era disk than it did on our 500GB 2006-era disk.

Another interesting change in hardware is the rate of increase in capacity versus the rate of reduction in I/O errors per bit. In order for a disk to have the same overall number of I/O errors, every time capacity doubles, the per-bit I/O error rate must halve. Needless to say, this isn't happening, so I/O errors are actually more common even though the per-bit error rate has dropped.

These are only a few of the changes in disk hardware that will occur over the next decade. What do these changes mean for file systems? First, fsck will take a lot longer in absolute terms, because disk capacity is larger, but disk bandwidth is relatively smaller, and seek time is relatively much larger. Fsck on multi-terabyte file systems today can easily take 2 days, and in the future it will take even longer! Second, the increasing number of I/O errors means that fsck is going to happen a lot more often - and journaling won't help. Existing file systems simply weren't designed with this kind of I/O error frequency in mind.

These problems aren't theoretical - they are already affecting systems that you care about. Recently, the main server for Linux kernel source, kernel.org, suffered file system corruption from a failure at the RAID level. It took over a week for fsck to repair the (ext3) file system, when it would have taken far less time to restore from backup.

The workshop

Now that the stage is set, we'll move on to what happened at the 2006 Workshop. The coverage has been split into the following pages:

  • Day 1, devoted mostly to understand the current state of the art: file system repair, disk errors, lessons learned from existing file systems, and major filesystem architectures.

  • Days 2 and 3, concerned with the way forward: interesting ideas, near-term needs, and development plans.

(Log in to post comments)

The 2006 Linux File Systems Workshop

Posted Jul 5, 2006 23:33 UTC (Wed) by jonabbey (subscriber, #2736) [Link]

Wonderful article.

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 0:00 UTC (Thu) by DYN_DaTa (guest, #34072) [Link]

Yes, a nice one :).

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 0:11 UTC (Thu) by cventers (guest, #31465) [Link]

Agreed! Now if we can just turn all the necessary gears to get reiser4
merged :)

File Systems Workshop Attendees

Posted Jul 6, 2006 5:43 UTC (Thu) by Felix.Braun (subscriber, #3032) [Link]

I was kind of wondering, why Mr Reiser didn't attend. I have the impression he is quite opinionated when it comes to File System development. Could anybody comment?

File Systems Workshop Attendees

Posted Jul 6, 2006 21:02 UTC (Thu) by khim (subscriber, #9252) [Link]

He is quite opinionated when it comes to File System development.

That's the reason. Mr Reiser is bright guy and he understands filesystems like noone else - but he is hard to budge. Quite often he's right and everyone else is wrong - but not always. And it takes herculean effort (typically month or so) to convince him when he's wrong. This means his attendance is mostly useless: better to read his papers - and agree or disagree.

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 9:57 UTC (Thu) by nix (subscriber, #2304) [Link]

Does reiser4 even try to solve any of the nasty problems Val mentions (the fsck crunch, the need for internal integrity checks, et al)?

As far as I can tell it doesn't. It used to have nice support for small files, but that got turned off because it impacted benchmarks and I don't know if it was ever turned on again.

reiser4

Posted Jul 7, 2006 1:48 UTC (Fri) by xoddam (subscriber, #2322) [Link]

One thing mentioned in the article which reiser4 does implement (though
Val doesn't include it in her list of examples) is copy-on-write atomic
storage hierarchy updates -- depending on settings, changing an inode can
result in allocation of new copies of the block containing the inode and,
recursively, every parent block which references the change, right up to
the master block. The old versions are only freed once the root has been
'committed' to its canonical location.

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 1:34 UTC (Thu) by mattdm (subscriber, #18) [Link]

Val, thanks for another excellent LWN article. Your contributions make me even more thankful for our subscription.

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 2:51 UTC (Thu) by nivola (subscriber, #5662) [Link]

Excellent article, a perfect example of why I subscribe to LWN. Free Software is fortunate in having so many technically capable people working on the code, but those that combine technical ability with a talent for written expression are more rare. From this and her other recent articles it seems Val has joined Greg K-H and our esteemed editor in that list.

The 2006 Linux File Systems Workshop

Posted Jul 10, 2006 0:25 UTC (Mon) by Lovechild (guest, #3592) [Link]

I completely agree, I'm a generally poor guy but I can readily afford the small monthly fee for LWN as it's more than worth it in terms of quality of information. It always brightens my thursday to open up LWN as the first thing and read the kernel section.

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 3:43 UTC (Thu) by AdHoc (subscriber, #1115) [Link]

Val is my new favorite LWN contributor (don't worry Jon, I still love ya, man). I'm trying to narrow things down for my master's thesis and she's given me all kinds of directions for further research in specific areas. Also, very good writer, her articles have been easy to read without being shallow.

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 11:13 UTC (Thu) by wingo (subscriber, #26929) [Link]

Agreed! I just dropped into her web site and lost a couple hours. Wonderful stuff.

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 3:44 UTC (Thu) by csamuel (✭ supporter ✭, #2624) [Link]

Interesting set of slides from Seagate, my favourite is slide 11 - "What if automobiles had improved as much?". :-)

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 5:44 UTC (Thu) by Los__D (subscriber, #15263) [Link]

"Another interesting change in hardware is the rate of increase in capacity versus the rate of reduction in I/O errors per bit. In order for a disk to have the same overall number of I/O errors, every time capacity doubles, the per-bit I/O error rate must halve. Needless to say, this isn't happening, so I/O errors are actually more common even though the per-bit error rate has dropped."

I don't understand how these are connected... Isn't the I/O error an artifact of reading or writing, and not a defect in the hardware? If it is, then it hardly matters, as you only care about I/O errors per second, or per bit read/written, not about I/O errors per bit of the total disksize.

Or maybe I misunderstood something?

Dennis

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 7:19 UTC (Thu) by arjan (subscriber, #36785) [Link]

You can express the error rate as a probability like this:

what is the probability that I can read this sector X days in the future

That number seems to be more or less constant (compared to the capacity increase), but with the number of sectors increasing the

what is the probability that I get an IO error in ANY sector in X days

goes up....

Disk error rates

Posted Jul 8, 2006 16:27 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

The article failed to specify completely the denominators in the error rates mentioned.

The article means to say that the error rate per bit read has gone down, but the error rate per disk per day has gone up. And with the implicit assumption that people have one filesystem per disk and use all the space, then the error rate per filesystem per day has gone up. It obviously assumes a system that, as it keeps more data, accesses more data too.

Oh, and an error is an instance of trying to read back a particular piece of data and not being able to. It usually means permanent data loss.

The reason the article brings it up is that if the error rate per filesystem per day goes up, so does the FSCK rate per filesystem per day, and with the cost per FSCK also going up proportional to the filesystem size, the FSCK cost per day goes up.

The 2006 Linux File Systems Workshop

Posted Oct 19, 2006 0:02 UTC (Thu) by eatsapizza (guest, #41199) [Link]

How do folks resolve the difference between MTBF for a disk and the
bit error rates?

I've read of MTBF's of about 1MHours, and bit error rates of about
10E-15 (so the Mean Bits between Failures is 1E15).

The reliability of a drive is R=exp(-t/MTBF), so the reliability
of a drive for a year is about...
R=exp(-8760/1E6) = about 99%

But if you're reading from your disk...... even say only 1MB/s ..
about 1 to 3% of its possible read rate,
you would read, in a year.. 2.523E14 bits,
and then:
R= exp(-2.523E14/1E15) = 77.7% .. a very different result, and
that's assuming the drive isn't really that busy.

I know there is RAID 0+1 and RAID5, and RAID5 (6+2), all of which
make things better, but how can the single disk result be so
much different?

eatsapizza

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 9:32 UTC (Thu) by llloic (subscriber, #5331) [Link]

"Fsck on multi-terabyte file systems today can easily take 2 days, and in the future it will take even longer!"

This is a bit imprecise, fsck of a given size file systems will be faster in the future that it is today, because the disks will have better bandwidth and smaller seek time.

But I agree that fsck will be generally longer because the average file system size will be greater and the bandwidth and seek time won't scale accordingly.

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 10:21 UTC (Thu) by arjan (subscriber, #36785) [Link]

actually... seek times are not really shrinking; they haven't for the last 5 years and if you look at the data in the article and the seagate pdf.. that'll not change really until 2013 at the minimum.

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 13:08 UTC (Thu) by hmh (subscriber, #3838) [Link]

This is really an wonderful article, and I can't wait to see what effect the workshop will have on ext4.

But while we wait for a production-stable fsck-friendly filesystem, we could teach MD RAID (and dm-raid for that matter) how to background scrub arrays. It is much faster to write, test and deploy such code than a new filesystem :)

Those without high-end RAID adapters and SANs (that do this by themselves) will welcome the feature, I think. I know I would...

It won't get logical errors, or save you any fsck wait. But it might avoid fscks in the first place, as it helps a great deal with IO errors due to bitrot.

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 14:06 UTC (Thu) by nix (subscriber, #2304) [Link]

md-raid can already do this.

Just stick in cron something like

2 5 15 * * echo check > /sys/block/md-$blah/md/sync_action

to check parity on array md-${blah} (which of course reads every block on every disk in the array).

(OK, so you might want something a bit more elaborate to prevent checking if sync_action for the array is not 'idle' so as not to interrupt a real resync.)

This will proceed in the background and respect other accesses to the block device by slowing down, just like md resyncs always have.

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 14:29 UTC (Thu) by hmh (subscriber, #3838) [Link]

Thanks! That does half the job already, and will at least find errors that have already happened and attempt to fix them.

It would be nice to have a "scrub" action that actually writes the entire array (all member devices, all sectors) to refresh aged sectors, though. "check" won't help there, and forcing a resync on every member device in turn is a very awkward (not to mention suboptimal) way to do it.

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 15:23 UTC (Thu) by nix (subscriber, #2304) [Link]

Forcing a resync won't rewrite everything, in any case, only the parity stripes: the non-parity stripes will only be read (unless you hit a write error in the parity stripe, of course).

The 2006 Linux File Systems Workshop

Posted Jul 6, 2006 15:54 UTC (Thu) by hmh (subscriber, #3838) [Link]

You have to set the member device to be "refreshed" to faulty, hot-remove and hot-add it back. This is, of course, dangerous depending on your array configuration.

As I said, it is very awkward, and thus a scrub function would be welcome.

The 2006 Linux File Systems Workshop

Posted Jul 7, 2006 9:26 UTC (Fri) by ekj (guest, #1524) [Link]

First, the author is stated as "Valerie Henson", whereas the organizer of the workshop is (among others) "Val Henson". I suppose the two are one and the same ? (allthough then it's a bit strange talking about oneself in the rhird person)

Second, the observation that i/o to discs aren't keeping up with the size of the discs is an obvious one. And that will certainly have effects on the design of filesystems.

I do have one question about the fsck-times.

I was under the impression that fsck "only" checks and corrects filesystem metadata, without touching the the actual contents of the files. If that's so, wouldn't it be reasonable to expect fsck-performance to scale according to the number of elements in the filesystem instead of according to the size of the filesystem ?

I'm thinking, the number of elements is growing much more slowly than the size of the filesystem, since the average size of an element is almost certainly growing.

Even for the same sort of data, 5 years ago a picture from my digicam was maybe 300K (3 mpix, jpeg) today it's still one file, but more like 10MB. (8mpix, raw) The same is true for that kernel-tarball, it's still one file, but it's a lot larger than it used to be.

But stronger than this effect is the effect caused by new types of files being storable (and thus stored) on larger discs. Noone stored a small collection of 10 handy DVD-isos in their homedirectory ten years ago. Today its trivial to do so, and basically everyone I know has one or more complete isos stored as a single file.

When I first started using linux, 1.2.13 kernel, P-75 even storing the single cdrom it was delivered on (Slackware, still divided up in "floppy sets") on hard-disc was unreasonable, as that would command a large fraction of the disc.

In short; the capacity of the discs will (as projected) go up by a factor of 16 or so until 2013. Isn't it reasonable to expect that the number of files stored will go up by a much smaller factor ?

The 2006 Linux File Systems Workshop

Posted Jul 7, 2006 17:07 UTC (Fri) by jzbiciak (✭ supporter ✭, #5246) [Link]

The file size histogram has always had a peak near very small files, with a long tail stretching to the right. Sure, the peak is shifting to the right, but it's still well below the current block size of 4K.

And as for fsck time... Some data structures, such as block bitmaps and inode tables are sized in proportion to the total filesystem capacity, not the number of files you currently have. So some component of fsck time is proportional to capacity, and some is proportional to space-in-use.

The 2006 Linux File Systems Workshop

Posted Jul 7, 2006 19:44 UTC (Fri) by oak (guest, #2786) [Link]

I think that the largest number of utilized Linux file systems are still
going to keep in the gigabytes range for many years to come. This is
because most of the Linux systems will be embedded ones and for mobile
embedded devices storage device without moving parts is preferable.

Currently largest Flash sizes are only few GBs, and they are so expensive
that most devices use much smaller sized Flash chips.

Backups

Posted Jul 8, 2006 22:20 UTC (Sat) by addw (guest, #1771) [Link]

One extra problem to put into the melting pot is "how do we back file systems up" ? The technology for this has not really advanced in decades, you basically choose: tar, cpio or dump (and I don't like dump 'cos the tape/... format is file system dependent).

It is nice to:

1) get a consistent file system backup
2) do full & incremental backups
3) backup extra metadata (dump does, the others don't)
4) backup sparse files w/out haveing to read zillions of NUL bytes

The old DEC AdvFS (on their OSF/1 boxes) had a nice feature where you could get a snapshot of the file system, via a sort of special mount, that allowed you to do (1).

The file system developers should not view this as being someone else's problem -- they can help.

Backups

Posted Jul 9, 2006 4:48 UTC (Sun) by thedevil (guest, #32913) [Link]

On desktops, I think most people nowadays make a full copy of the file system on another disk (maybe a CD). I use an external USB drive. Together with the wonderful rdiff-backup program which stores unchanged files as links it's quite an economical solution. For a server where you want to backup an expensive terabyte drive, that's quite a different story ...

Backups

Posted Jul 11, 2006 10:11 UTC (Tue) by alspnost (guest, #2763) [Link]

Glad you mentioned rdiff-backup - an excellent tool that I couldn't live without now. I use that to backup crucial parts of my filesystems to an external USB drive. Hopefully it'll still work in 10 years when I have a 16TB array in my machine and a 5TB external drive :-)

Backups

Posted Jul 9, 2006 13:53 UTC (Sun) by khim (subscriber, #9252) [Link]

The file system developers should not view this as being someone else's problem -- they can help.

Hmm... The only thing where filesystem can help is (4) - and it'll make spare files waaay too special IMNSO (today filesystem can add or remove holes to file at will). (1) is easily solveable elsewhere, (2) and (3) don't need any changes in filesystems...

Backups

Posted Sep 5, 2006 22:09 UTC (Tue) by anton (guest, #25547) [Link]

<blockquote>[snapshotting] is easily solvable elsewhere</blockquote>

To the best of my knowledge, even if you do this at the LVM level, you
need file system support, in order to get a snapshot of a consistent
file system. Also, snapshotting at the file system will often be more
efficient (e.g., no need to back up blocks for new files). So I
believe that the file system is a better place for snapshotting than
LVM.

Backups

Posted Jul 9, 2006 19:03 UTC (Sun) by hein.zelle (guest, #33324) [Link]

I'll heartily agree that making sufficient backups of large filesystems can become a serious problem, but perhaps you should also consider what kind of data will fill up a drive that is so large it actually becomes a problem.

If a disk becomes so large that write times (of e.g. the entire disk) are too slow to do a regular backup to a similar disk, then I think you can assume that the data is not very volatile either: writing the changes would take too much time as well. I think in many cases, huge databases could theoretically be split up in a small volatile part, and a large not-so-volatile part. This makes it possible to backup the non-volatile part at low frequencies, while the volatile part gets backed up more often.

We do something like this at work, where we have several terabytes (I suppose a relatively small dataset compared to others) of which about 500gb changes often, and the rest is relatively or completely static. We use external discs (usb) to backup the static or slowly changing part about once a month. The volatile part is backed up more often, in this case also using external usb disks but with incremental and full backups.

In our case we've chosen a raid5 main storage system with a hot-spare drive to provide some reliability by itself, apart from the backups. We have not had to fall back on the backups yet, but everything appears to work well and the backup times are not bothersome at all.

I suppose the problem will indeed get worse with the increasing drive sizes, and alternatives like tape may become impossible at some point. However, using a spare drive (in usb enclosure or similar) should remain a viable backup option, I think. If not, then I would seriously wonder if the owner of the disk shouldn't be considering a (or perhaps multiple) more expensive raid system(s) with redundancy to deal with the problem. And there will obviously be exceptions where people actually do store lots and lots of volatile information that must be backed up, but I highly doubt that those exceptions would not consider the more expensive redundant options anyway.

Why not ZFS?

Posted Jul 24, 2006 22:37 UTC (Mon) by wos (guest, #39371) [Link]

As an old Multics hand, I looked at this and saw problems--and
solutions--similar to what we did with NSS in 1976, and later
with the volume scavenger. Of course, disks were about
1/10,000 the size we're talking about, but the problem (of
fsck taking too long) was the same.

However, we've learned a lot about filesystems since then, much
of which seems to be embodied in ZFS: snapshots, live scrubbing,
built-in integrity protection, etc. I haven't used ZFS, or
studied it in any detail, but it sure looks like it's trying to
solve these problems in a much more comprehensive way than just
carving up a gigantic ext3 into chunks.

http://www.opensolaris.org/os/community/zfs/

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds