By Jonathan Corbet
March 17, 2009
One might well think that, at this point, there has been sufficient
discussion of the design decisions built into the ext4 "delayed allocation"
feature and the user-space implications of those decisions. And perhaps
that is true, but there should be room for a summary of the relevant
issues. The key question has little to do with the details of filesystem
design, and a lot to do with the kind of API that the Linux kernel should
present to its user-space processes.
As has been well covered (and discussed) elsewhere, the delayed
allocation feature found in the ext4 filesystem - and most other
contemporary filesystems as well - has some real benefits for system
performance. In many cases, delayed allocation can avoid the allocation
of space on the physical medium (along with the associated I/O) entirely. For longer-lived
data, delayed allocation allows the filesystem to optimize the placement of
the data blocks, making subsequent accesses faster. But delayed allocation
can, should the system crash, lead to the loss of the data for which space
has not yet been allocated. Any filesystem may lose data if the system is
unexpectedly yanked out from underneath it, but the changes in ext4 can
lead to data loss in situations that, with ext3, appeared to be robust.
This change looks much like a regression to many users.
Many electrons have been expended to justify the new, more uncertain ext4
situation. The POSIX specification says that no persistence is guaranteed
for data which has not been explicitly sent to the media with the
fsync() call. Applications which lose data on ext4 are not using
the filesystem correctly and need to be fixed. The real problem is users
running proprietary kernel modules which cause their systems to crash in
the first place. And so on. All of these statements are true, at least to
an extent.
But one might also argue that they are irrelevant.
Your editor recently became aware that Simson Garfinkel's Unix Hater's Handbook
[PDF] is available online. To say that this book is an aggravating
read is an understatement; much of it seems like childish poking at Unix by
somebody who wishes that VMS (or some other proprietary system) had taken
over the world. It's full of text like:
The traditional Unix file system is a grotesque hack that, over the
years, has been enshrined as a "standard" by virtue of its
widespread use. Indeed, after years of indoctrination and
brainwashing, people now accept Unix's flaws as desired
features. It's like a cancer victim's immune system enshrining the
carcinoma cell as ideal because the body is so good at making
them.
But behind the silly rhetoric are some real points that anybody concerned
with the value of Unix-like systems should hear. Among them are the "worse
is better" notion expressed by Richard Gabriel in 1991 - the year the Linux
kernel was born. This charge states that Unix developers will choose
implementation simplicity over correctness at the lower levels, even if it leads to
application complexity (and lack of robustness) at the higher levels. The
ability of a write() system call to succeed partially is given as
an example; it forces every write() call to be coded within a loop
which retries the operation until the kernel gets around to finishing the
job. Developers who cut corners like that are left with an application
which works most of the time, but which can fail silently in unexpected
circumstances. It is far better, these people say, to solve the problem
once at the kernel level so that applications can be simpler and more
robust.
The ext4 situation can be seen as similar: any application developer who
wants to be sure that data has made it to persistent storage must take
extra care to inform the kernel that, yes, that data really does matter.
Developers who skip that step will have applications which work -
almost all the time. One could well argue that, again, the kernel
should take the responsibility of ensuring correctness, freeing
application developers from the need to worry about it.
The ext3 filesystem made no such guarantees, but, due to the way its
features interact, ext3 provides something close to a persistence guarantee
in most situations. An ext3 filesystem running under a default
configuration will normally lose no more than five seconds worth of work in
a crash, and, importantly, it is not prone to the creation of zero-length
files in common scenarios. The ext4 filesystem withdrew that implicit
guarantee; unpleasant surprises for users followed.
Now the ext4 developers are faced with a choice. They could stand by their
changes, claiming that the loss of robustness is justified by increased
performance and POSIX compliance. They could say that buggy applications need to be
fixed, even if it turns out that very large numbers of applications need
fixing. Or, instead, they could conclude that Linux should provide a higher level of
reliability, regardless of how diligent any specific application developers might
have been and regardless of what the standards say.
It should be said that the first choice is not entirely unreasonable.
POSIX forms a sort of contract between user space and the kernel. When the
kernel fails to provide POSIX-specified behavior, application developers
are the first to complain. So perhaps they should not object when the
kernel insists that they, too, live up to their end of the bargain.
One could argue that applications which have been written according to the
rules should not take a performance hit to make life easier for the rest.
Besides, this is free software; it would not take that long to fix
up the worst offenders.
[PULL QUOTE:
There is a case to be made that
this is a situation where the Linux kernel, in the interest of greater
robustness throughout the system, should go beyond POSIX.
END QUOTE]
But fixing this kind of problem is a classic case of whack-a-mole:
application developers will continually reintroduce similar bugs. The kernel
developers have been very clear that they do not feel bound by POSIX when
the standard is seen to make no sense. So POSIX certainly does not compel
them to provide a lower level of filesystem data robustness than
application developers would like to have. There is a case to be made that
this is a situation where the Linux kernel, in the interest of greater
robustness throughout the system, should go beyond POSIX.
The good news, of course, is that this has already happened. There is a
set of patches queued for 2.6.30 which will provide ext3-like behavior in
many of the situations that have created trouble for early ext4 users.
Beyond that, the ext4 developers are considering an "allocate-on-commit"
mount option which would force the completion of delayed allocations when
the associated metadata is committed to disk, thus restoring ext3 semantics
almost completely. Chances are good that
distributors would enable such an option by default. There would be a
performance penalty, but ext4 should still perform better than ext3, and
one should not underestimate the performance costs associated with lost
data.
In summary: the ext4 developers - like Linux developers in general - do
care about their users. They may complain a bit about sloppy application
developers, standards compliance, and proprietary kernel modules, but
they'll do the right thing in the end.
One should also remember that ext4 is still a very young filesystem; it's not
surprising that a few rough edges remain in places. It is unlikely that we
have seen the last of them.
As a related issue, it has been suggested that the real problem is with the
POSIX API, which does not make the expression of atomicity and durability
requirements easy or natural. It is time, some say, to create an extended
(or totally new) API which handles these issues better. That may well be
true, but this is easier said than done. There are, of course, the
difficulties in designing a new API to last for the next few decades; one
assumes that we are up to that challenge. But will anybody use it?
Consider Linus Torvalds's
response to another suggestion for an API extension:
Over the years, we've done lots of nice "extended functionality"
stuff. Nobody ever uses them. The only thing that gets used is the
standard stuff that everybody else does too.
Application developers will be naturally apprehensive about using
Linux-only interfaces. It is not clear that designing a new API which will
gain acceptance beyond Linux is feasible at this time.
Your editor also points out, hesitantly, that Hans Reiser had designed -
and implemented - all kinds of features designed to allow applications to
use small files in a robust manner for the reiser4 filesystem. Interest in
accepting those features was quite low even before Hans left the scene.
There were a lot of reasons for this, including nervousness about a
single-filesystem implementation and nervousness about dealing with Hans,
but the addition of non-POSIX extensions was problematic in its own right
(see this article for
coverage of this discussion in 2004).
The real answer is probably not new APIs. It is probably a matter of
building our filesystems to provide "good enough" robustness as a default,
with much stronger guarantees available to developers who are willing to do
the extra coding work. Such changes may come hard to filesystem hackers
who have worked to create the fastest filesystem possible. But they will
happen anyway; Linux is, in the end, written by and for its users.
(
Log in to post comments)