LWN.net Logo

Advertisement

GStreamer, Embedded Linux, Android, VoD, Smooth Streaming, DRM, RTSP, HEVC, PulseAudio, OpenGL. Register now to attend.

Advertise here

LCA: Why filesystems are hard

By Jonathan Corbet
January 20, 2010
The ext4 filesystem is reaching the culmination of a long development process. It has been marked as stable in the mainline kernel for over a year, distributions are installing it by default, and it may start to see more widespread enterprise-level deployment toward the end of this year. At linux.conf.au 2010, ext4 hacker Ted Ts'o talked about the process of stabilizing ext4 and why filesystems take a long time to become ready for production use.

In general, Ted says, people tend to be overly optimistic about how quickly a filesystem can stabilize. It is not a fast process, for a number of fairly clear reasons. In general, there are some aspects of software which can make it hard to test and debug. These include premature optimization ("the root of all evil"), the presence of large amounts of internal state, and an environment involving a lot of parallelism. Any of these features will make code more difficult to understand and complicate the testing environment.

Filesystems suffer from all of these problems. Users demand that a general-purpose filesystem be heavily optimized for a wide variety of workloads; this optimization work must be done at all levels of the code. The entire job of a filesystem is to store and manage internal state. Among other things, that makes it hard for developers to reproduce problems; specific bugs are quite likely to be associated with the state of a specific filesystem which a user may be unwilling to share even in the absence of the practical difficulties implicit in making hundreds of gigabytes of data available to developers. And parallelism is a core part of the environment for any general-purpose filesystem; there will always be many things going on at once. All of these factors combine to make filesystems difficult to stabilize.

What it comes down to, Ted says, is that filesystems, like fine wines, have to age for a fair period of time before they are ready. But there's an associated problem: the workload-dependent nature of many filesystem problems guarantees that filesystem developers cannot, by themselves, find all of the bugs in their code. There will always be a need for users to test the code and report their experiences. So filesystem developers have a strong incentive to encourage users to use the code, but the more ethical developers (at least) do not want to cause users to lose data. It's a fine line which can be hard to manage.

So what does it take to get a filesystem written and ready for use? As part of the process of seeking funding for Btrfs development, Ted talked to [Ted Ts'o] veterans of a number of filesystem development projects over the years. They all estimated that getting a filesystem to a production-ready state would require something between 75 and 100 person-years of effort - or more. That can be a daunting thing to tell corporate executives when one is trying to get a project funded; for Btrfs, Ted settled for suggesting that every company involved should donate two engineers to the cause. Alas, not all of the companies followed through completely; vague problems associated with an economic crisis got in the way.

An associated example: Sun started working on the ZFS filesystem in 2001. The project was only announced in 2005, with the first shipments happening in 2006. But it is really only in the last year or so that system administrators have gained enough confidence in ZFS to start using it in production environments. Over that period of time, the ZFS team - well over a dozen people at its peak - devoted a lot of time to the development of the filesystem.

So where do things stand with ext4? It is, Ted says, an interesting time. It has been shipping in community distributions for a while, with a number of them now installing it by default. With luck, the long term support and enterprise distributions will start shipping it soon; enterprise-level adoption can be expected to follow a year or so after that.

Over the last year or so, There have been something between 60 and 100 ext4 patches in each mainline kernel release. Just under half of those are bug fixes; many of the rest are cleanup patches. There's also a small amount of new feature and performance enhancement work still. Ted noted that the number of bug fixes has not been going down in recent releases. That, he says, is to be expected; the user community for ext4 is growing rapidly, and more users will find (and report) more bugs.

A certain number of those bugs are denial of service problems; many of those are system crashes in response to a corrupted on-disk filesystem image. A larger share of the problems are race conditions and, especially deadlocks. There are a few problems associated with synchronization; one does not normally notice these at all unless the system crashes at the wrong time. And there are a few memory leaks, usually associated with poorly-tested error-handling paths.

The areas where the bulk of these bugs can be found is illuminating. There have been problems in the interaction between the block allocator and the online resize functionality - it turns out that people do not resize filesystems often, so this code is not always all that heavily tested. Other bugs have come up in the interaction between block pre-allocation and out-of-space handling. Online defragmentation has had a number of problems, including one nasty security bug; it turned out that nobody had really been testing that code. The FIEMAP ioctl() command, really only used by one utility, had some problems. There were issues associated with disk quotas; this feature, too, is not often used, especially by filesystem developers. And there have been problems with the no-journal mode contributed by Google; the filesystem had a number of "there is always a journal" assumptions inherited from ext3, but, again, few people have tested this feature.

The common theme here should be clear: a lot of the bugs turning up in this stage of the game are associated with little-used features which have not received as much testing as the core filesystem functions. The good news is that, as a result, most of the bugs have not actually affected that many users.

There was one problem in particular which took six months to find; about once a month, it would corrupt a filesystem belonging to a dedicated and long-suffering tester. It turned out that there was a race condition which could corrupt the disk if two processes were writing the same file at the same time. Samba, as it happens, does exactly that, whereas the applications run by most filesystem developers do not. The moral of the story: just because the filesystem developer has not seen problems does not mean that the code is truly safe.

Another bug would only strike if the system crashed at just the wrong time; it had been there for a long time before anybody noticed it. How long? The bug was present in the ext3 filesystem as well, but nobody ever reported it.

There have also been a number of performance problems which have been found and fixed. Perhaps the most significant one had to do with performance in the writeback path. According to Ted, the core writeback code in the kernel is fairly badly broken at the moment, with the result that it will not tell the filesystem to write back more than 1024 blocks at a time. That is far too small for large, fast devices. So ext4 contains a hack whereby it will write back much more data than the VFS layer has requested; it is justified, he says, because all of the other filesystems do it too. In general, nobody wants to touch the writeback code, partly because they fear breaking all of the workarounds which have found their way into filesystem-specific code over the years.

Ted concluded by noting that, in one way, filesystems are easy: the Linux kernel contains a great deal of generic support code which does much of the work. But the truth of the matter is that they are hard. There are lots of workloads to support, the performance demands are strong, and there tend to be lots of processes running in parallel. The creation of a new filesystem is done as a labor of love; it's generally hard to justify from a business perspective. This reality is reflected in the fact that almost nobody is investing in filesystem work currently, with the one high-profile exception being Sun and its ZFS work. But, Ted noted, that work has cost them a lot, and it's not clear that they have gotten a return which justifies that investment. Hopefully the considerable amount of work which has gone into Linux filesystem development will have a more obvious payback.


(Log in to post comments)

LCA: Why filesystems are hard

Posted Jan 21, 2010 3:01 UTC (Thu) by yokem_55 (subscriber, #10498) [Link]

This reality is reflected in the fact that almost nobody is investing in filesystem work currently, with the one high-profile exception being Sun and its ZFS work.
Is Oracle and btrfs a low profile contributor? While probably still a couple of years out from stability, btrfs will have most of the functionality that makes zfs desirable.

LCA: Why filesystems are hard

Posted Jan 21, 2010 3:04 UTC (Thu) by yokem_55 (subscriber, #10498) [Link]

I should have done a better job of RTFA. Mea culpa.

LCA: Why filesystems are hard

Posted Jan 21, 2010 8:15 UTC (Thu) by dwmw2 (subscriber, #2063) [Link]

Ted is absolutely right. File systems are hard and take a long time to properly stabilise.

That's why I find it so strange that people are willing to trust the closed-source file system inside SSDs, which have such a long track record of losing data, and which you can't even fsck and recover when they go wrong.

LCA: Why filesystems are hard

Posted Jan 21, 2010 11:13 UTC (Thu) by epa (subscriber, #39769) [Link]

The 'filesystem' inside an SSD may be buggy and unreliable, but does its expected failure rate exceed the chance of mechanical failure you'd have with a hard disk? Nobody expects an SSD to work as a reliable backup medium, so it may not matter.

LCA: Why filesystems are hard

Posted Jan 22, 2010 1:16 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

the filesystem inside a SSD actually is MUCH simpler than a general purpose filesystem

it is always working in fixed size objects
it never sees more than one write to a object at a time (and if it gets conflicting data for a block, ethe latest version wins)
it also doesn't have to guess at the architecture and performance of the underlying storage
it has a very simple command set (store this block here) rather than there being many different ways to do things.
it has a single command queue where a general filesystem is getting reads and writes in parallel from many processes.
it does not need to be able to run on multiple cpu cores at the same time

all of these things make the resulting code drastically smaller, and therefor easier to check and test.

the complexity of the code in the filesystem climbs significantly faster than the complexity of the problem the filesystem is trying to address, and the chance of there being bugs climbs significantly faster than the complexity of the code in the filesystem

but the bottom line of why people are willing to trust the prioprietary filesystems in the SSDs is that so far the SSD vendors have been getting it right (or at least close enough to right) so the resulting failure rate is down in the noise of mechanical and electrical failure rates. definitely not noticeably higher than any other firmware (even firmware that doesn't contain an internal filesystem)

Testing

Posted Jan 21, 2010 11:17 UTC (Thu) by epa (subscriber, #39769) [Link]

Is there a fuzz tester for filesystems that starts a virtual machine image and thrashes it with randomly generated filesystem calls by user processes, online resizing and defragmenting, artificially generated 'disk hardware errors' and so on until it crashes?

Or, perhaps easier, a fuzz tester that corrupts disk images and tries to crash the kernel?

Testing

Posted Jan 21, 2010 17:36 UTC (Thu) by bcopeland (subscriber, #51750) [Link]

Not sure about VM-based tests (not really necessary) but there are fsfuzzer and various incarnations of fsx to name a couple of popular options.

Testing

Posted Jan 22, 2010 6:57 UTC (Fri) by cpeterso (guest, #305) [Link]

Perhaps the fuzz tests could be used on a filesystem in a loopback device? The host system would be (mostly) protected from corruption of the real filesystem.

Clarification Please!

Posted Jan 22, 2010 10:49 UTC (Fri) by ctg (subscriber, #3459) [Link]

The article talks exclusively about Ext4, except for the sentence where Ted was requesting funding for btrfs. Is this right? Or does it mean ext4? Or was Ted helping out btrfs in his capacity as the then Linux Foundation Technology Director?

Thank you!

Clarification Please!

Posted Jan 22, 2010 18:46 UTC (Fri) by corbet (editor, #1) [Link]

Ted was indeed trying to obtain resources for Btrfs development; it's a project he has always supported.

LCA: Why filesystems are hard

Posted Jan 22, 2010 15:18 UTC (Fri) by ricwheeler (subscriber, #4980) [Link]

I certainly disagree with the comment that no one is investing in file systems.

At Red Hat, we have a team of 15 file system developers who contribute actively to a host of projects - ext2/3/4, btrfs, gfs2, NFS, CIFS, xfs and others. Not to mention the qa & performance teams who turn our development code into hardened, high performance platforms for a huge number of businesses.

Let's not paint an overly bleak picture, I think that the Linux file system community has made substantial and significant gains in the past few years and we certainly match or exceed the investment in this area that is done by proprietary vendors.

Investing in filesystems

Posted Jan 22, 2010 18:49 UTC (Fri) by corbet (editor, #1) [Link]

I have probably not really conveyed what Ted said clearly here; he was saying that almost nobody else is really investing in filesystems. The investment from the Linux community is large and easily visible.

Investing in filesystems

Posted Jan 28, 2010 13:22 UTC (Thu) by dmaxwell (guest, #14010) [Link]

Well I do see news about DragonflyBSD's HAMMER from time to time.

LCA: Why filesystems are hard

Posted Jan 23, 2010 18:00 UTC (Sat) by marcH (subscriber, #57642) [Link]

Nice article. It would be very interesting to know how ext4 was/is tested by developers (even before being released).

Performances of ext4

Posted Jan 24, 2010 8:11 UTC (Sun) by patrick_g (subscriber, #44470) [Link]

According to benchmarks by Phoronix the performance of the ext4 file-system goes down with new kernel releases.
Is Ted Ts'o aware of that ?

Performances of ext4

Posted Jan 24, 2010 13:47 UTC (Sun) by nix (subscriber, #2304) [Link]

I'd trust those benchmarks to be indicative of problems in ext4 as far as
I could throw their designer.

e.g. one really obvious case they haven't checked is whether other
filesystems and raw block devices are also affected. If they are, this is
due to changes in the block layer (of which there have been many more in
the target time period than changes to ext4), and Ted is blameless.

TBH their PostgreSQL results are indicative of a misconfiguration of some
kind more than anything else. Do you think no PostgreSQL users anywhere
would have noticed a *factor of four slowdown* between .31 and .32 and
mentioned it on the kernel list?

Performances of ext4

Posted Jan 24, 2010 14:02 UTC (Sun) by kronos (subscriber, #55879) [Link]

> TBH their PostgreSQL results are indicative of a misconfiguration of some
> kind more than anything else.
AFAIK the difference is caused by barriers being enabled by default in .32, the change is deliberate.

Performances of ext4

Posted Jan 24, 2010 14:32 UTC (Sun) by nix (subscriber, #2304) [Link]

I can see no sign of that in the ext4 git changelog. AFAIK barriers were
always enabled for ext4... but commit
5f3481e9a80c240f169b36ea886e2325b9aeb745 causes an fdatasync() in the
middle of an already-allocated file to always flush its blocks out (with a
barrier). PostgreSQL would be 'bitten' by this hard (really bitten by the
bug it fixes): almost all its writes are in the middle of
already-allocated files, and before this change the fdatasync() wouldn't
actually have synced anything but the inode, AFAICS.

Performances of ext4

Posted Jan 24, 2010 16:07 UTC (Sun) by patrick_g (subscriber, #44470) [Link]

Perhaps...but from 1069 transactions per second (2.6.31) in pgbench to only 280 (2.6.32) the cost is gigantic!
See this page.

Who knows

Posted Jan 24, 2010 23:17 UTC (Sun) by man_ls (guest, #15091) [Link]

And who can tell if the change is really worth it after the previous ext4 fiasco?

Who knows

Posted Jan 25, 2010 8:12 UTC (Mon) by nix (subscriber, #2304) [Link]

IMNSHO, anything that fscks as fast as ext4 is worth it, no matter what
else has changed :)

Performances of ext4

Posted Jan 26, 2010 8:02 UTC (Tue) by kleptog (subscriber, #1183) [Link]

280 transactions per second sounds about right for a system with spinning disks attached. A transaction is committed when the data hits the log and in general you can do this once per revolution of the disk platter. If there are simultaneous transactions they can commit together.

Anyone who gets thousands of transactions per second either has a battery backed cache on the hard disk controller, or does not have the D in ACID. Or is running on SSD disks.

The fact that disks and operating systems have silently been ignoring fsync requests has gotten people used to completely unrealistic numbers.

Performances of ext4

Posted Jan 26, 2010 15:00 UTC (Tue) by ricwheeler (subscriber, #4980) [Link]

I agree in general with the comment, but have to point out that the transaction rate depends a lots of things.

You can get a rough idea of how many transactions your storage can do by timing the fsync()'s per second of a dirty file. On a S-ATA drive, that number is around 30-40 per second, on an enterprise class array it can jump up to 700/sec over fibre channel and with something like a PCI-e SSD device it can go beyond that.

Note that you can also try and batch multiple transactions into one commit - ext4 supports batching for multi-threaded writers for fsync for example.

Performances of ext4

Posted Jan 26, 2010 15:30 UTC (Tue) by dlang (✭ supporter ✭, #313) [Link]

SATA drives can do better than that.

for rotating media, figure the drive can do one fsync per rotation when writing to sequential file, for 7200 rpm drives this is ~120/sec.

if you are getting thousands of transactions/sec from a database test, you have some buffering going on, and unless that buffering is battery backed, you will loose it in power outages.

the one exception is that if you have multiple transactions going in parallel, you may be able to have different transactions complete their syncs in the same disk rotation, so you may get # threads * (rpm/60) syncs/sec.

enterprise storage arrays have large battery backed ram buffers, which do wonders for your transaction rate, up until the point where those buffers are filled (although even then they give you a benefit as multiple transactions can be batched and written at once, reducing the number of writes to the drives)

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds