By Jonathan Corbet
January 20, 2010
The ext4 filesystem is reaching the culmination of a long development
process. It has been marked as stable in the mainline kernel for over a
year, distributions are installing it by default, and it may start to see
more widespread enterprise-level deployment toward the end of this year.
At linux.conf.au 2010, ext4 hacker Ted Ts'o talked about the process of
stabilizing ext4 and why filesystems take a long time to become ready for
production use.
In general, Ted says, people tend to be overly optimistic about how quickly
a filesystem can stabilize. It is not a fast process, for a number of
fairly clear reasons. In general, there are some aspects of software which
can make it hard to test and debug. These include premature optimization
("the root of all evil"), the presence of large amounts of internal state,
and an environment involving a lot of parallelism. Any of these features
will make code more difficult to understand and complicate the testing
environment.
Filesystems suffer from all of these problems. Users demand that a
general-purpose filesystem be heavily optimized for a wide variety of
workloads; this optimization work must be done at all levels of the code.
The entire job of a filesystem is to store and manage internal state.
Among other things, that makes it hard for developers to reproduce
problems; specific bugs are quite likely to be associated with the state of
a specific filesystem which a user may be unwilling to share even in the
absence of the practical difficulties implicit in making hundreds of
gigabytes of data available to developers. And parallelism is a core part of the
environment for any general-purpose filesystem; there will always be many
things going on at once. All of these factors combine to make filesystems
difficult to stabilize.
What it comes down to, Ted says, is that filesystems, like fine wines, have
to age for a fair period of time before they are ready. But there's an
associated problem: the workload-dependent nature of many filesystem
problems guarantees that filesystem developers cannot, by themselves, find
all of the bugs in their code. There will always be a need for users to
test the code and report their experiences. So filesystem developers have
a strong incentive to encourage users to use the code, but the more ethical
developers (at least) do not want to cause users to lose data. It's a fine
line which can be hard to manage.
So what does it take to get a filesystem written and ready for use? As
part of the process of seeking funding for Btrfs development, Ted talked to
veterans of a number of filesystem development projects over the years.
They all estimated that getting a filesystem to a production-ready state
would require something between 75 and 100 person-years of effort - or
more. That can be a daunting thing to tell corporate executives when one
is trying to get a project funded; for Btrfs, Ted settled for suggesting
that every company involved should donate two engineers to the cause.
Alas, not all of the companies followed through completely; vague problems
associated with an economic crisis got in the way.
An associated example: Sun started working on the ZFS filesystem in 2001.
The project was only announced in 2005, with the first shipments happening
in 2006. But it is really only in the last year or so that system
administrators have gained enough confidence in ZFS to start using it in
production environments. Over that period of time, the ZFS team - well
over a dozen people at its peak - devoted a lot of time to the development
of the filesystem.
So where do things stand with ext4? It is, Ted says, an interesting time.
It has been shipping in community distributions for a while, with a
number of them now installing it by default. With luck, the long term
support and enterprise distributions will start shipping it soon;
enterprise-level adoption can be expected to follow a year or so after
that.
Over the last year or so, There have been something between 60 and 100 ext4
patches in each mainline kernel release. Just under half of those are bug
fixes; many of the rest are cleanup patches. There's also a small amount
of new feature and performance enhancement work still. Ted noted that the
number of bug fixes has not been going down in recent releases. That, he
says, is to be expected; the user community for ext4 is growing rapidly,
and more users will find (and report) more bugs.
A certain number of those bugs are denial of service problems; many of
those are system crashes in response to a corrupted on-disk filesystem
image. A larger share of the problems are race conditions and, especially
deadlocks. There are a few problems associated with synchronization; one
does not normally notice these at all unless the system crashes at the
wrong time. And there are a few memory leaks, usually associated with
poorly-tested error-handling paths.
The areas where the bulk of these bugs can be found is illuminating. There
have been problems in the interaction between the block allocator and the
online resize functionality - it turns out that people do not resize
filesystems often, so this code is not always all that heavily tested.
Other bugs have come up in the interaction between block pre-allocation and
out-of-space handling. Online defragmentation has had a number of
problems, including one nasty security bug; it turned out that nobody had
really been testing that code. The FIEMAP ioctl()
command, really only used by one utility, had some problems. There were
issues associated with disk quotas; this feature, too, is not often used,
especially by filesystem developers. And there have been problems with the
no-journal mode contributed by Google; the filesystem had a number of
"there is always a journal" assumptions inherited from ext3, but, again,
few people have tested this feature.
The common theme here should be clear: a lot of the bugs turning up in this
stage of the game are associated with little-used features which have not
received as much testing as the core filesystem functions. The good news
is that, as a result, most of the bugs have not actually affected that many
users.
There was one problem in particular which took six months to find;
about once a month, it would corrupt a filesystem belonging to a dedicated
and long-suffering tester. It turned out that there was a race condition
which could corrupt the disk if two processes were writing the same file at
the same time. Samba, as it happens, does exactly that, whereas the
applications run by most filesystem developers do not. The moral of the
story: just because the filesystem developer has not seen problems does
not mean that the code is truly safe.
Another bug would only strike if the system crashed at just the wrong time;
it had been there for a long time before anybody noticed it. How long?
The bug was present in the ext3 filesystem as well, but nobody ever
reported it.
There have also been a number of performance problems which have been found
and fixed. Perhaps the most significant one had to do with performance in
the writeback path. According to Ted, the core writeback code in the
kernel is fairly badly broken at the moment, with the result that it will
not tell the filesystem to write back more than 1024 blocks at a time.
That is far too small for large, fast devices. So ext4 contains a hack
whereby it will write back much more data than the VFS layer has requested; it
is justified, he says, because all of the other filesystems do it too. In
general, nobody wants to touch the writeback code, partly because they fear
breaking all of the workarounds which have found their way into
filesystem-specific code over the years.
Ted concluded by noting that, in one way, filesystems are easy: the Linux
kernel contains a great deal of generic support code which does much of the
work. But the truth of the matter is that they are hard. There are lots
of workloads to support, the performance demands are strong, and there tend
to be lots of processes running in parallel.
The creation of a new filesystem is done as a labor of love; it's generally
hard to justify from a business perspective. This reality is reflected in
the fact that almost nobody is investing in filesystem work currently, with
the one high-profile exception being Sun and its ZFS work. But, Ted noted,
that work has cost them a lot, and it's not clear that they have gotten a
return which justifies that investment. Hopefully the considerable amount
of work which has gone into Linux filesystem development will have a more
obvious payback.
(
Log in to post comments)