September 9, 2009
This article was contributed by Valerie Aurora (formerly Henson)
Sure, programmers (especially operating systems programmers) love
their specifications. Clean, well-defined interfaces are a key
element of scalable software development. But what is it about file
systems, POSIX, and when file data is guaranteed to hit permanent
storage that brings out the POSIX fundamentalist in all of us? The
recent
fsync()/rename()/O_PONIES
controversy was the most heated in recent memory but not out of
character for
fsync()-related discussions. In this
article, we'll explore the relationship between file systems
developers, the POSIX file I/O standard, and people who just want to
store their data.
In the beginning, there was creat()
Like many practical interfaces (including HTML and TCP/IP), the POSIX file system
interface was implemented first and specified second. UNIX was
written beginning in 1969; the first release of the POSIX
specification for the UNIX file I/O interface (IEEE Standard 1003.1)
was released in 1988. Before UNIX, application access to non-volatile
storage (e.g., a spinning drum) was a decidedly application- and
hardware-specific affair. Record-based file I/O was a common paradigm,
growing naturally out of punch cards, and each kind of file was treated
differently. The new interface was designed by a few guys (Ken
Thompson, Dennis Ritchie, et alia) screwing around with their new
machine, writing an operating system that would make it easier
to, well, write more operating systems.
As we know now, the new I/O interface was a hit. It turned out to be a
portable, versatile, simple paradigm that made modular software
development much easier. It was by no means perfect, of course: a
number of warts revealed themselves over time, not all of which were
removed before the interface was codified into the POSIX
specification. One example is directory hard links, which permit the
creation of a directory cycle - a directory that is a descendant of
itself - and its subsequent detachment from the file system hierarchy,
resulting in allocated but inaccessible directories and files.
Recording the time of the last access time - atime - turns every read
into a tiny write. And don't forget the apocryphal quote from Ken
Thompson when asked if he'd do anything differently if he were
designing UNIX today: "If I had to do it over again? Hmm... I guess
I'd spell 'creat' with an 'e'". (That's the creat()
system call to create a new file.) But overall, the UNIX file system
interface is a huge success.
POSIX file I/O today: Ponies and fsync()
Over time, various more-or-less portable additions have accreted
around the standard set of POSIX file I/O interfaces; they have been
occasionally standardized and added to the canon - revelations from
latter-day prophets. Some examples off the top of my head include
pread()/pwrite(), direct I/O, file preallocation, extended attributes,
access control lists (ACLs) of every stripe and color, and a vast
array of mount-time options. While these additions are often debated
and implemented in incompatible forms, in most cases no one is trying
to oppose them purely on the basis of not being present in a standard
written in 1988. Similarly, there is relatively little debate about
refusing to conform to some of the more brain-dead POSIX details, such
as the aforementioned directory hard link feature.
Why, then, does the topic of when file system data is guaranteed to be
"on disk" suddenly turn file systems developers into pedantic
POSIX-quoting fundamentalists? Fundamentally (ha), the problem comes
down to this: Waiting for data to actually hit disk before returning
from a system call is a losing game for file system performance. As
the most extreme example, the original synchronous version of the UNIX
file system frequently used only 3-5% of the disk throughput. Nearly
every file system performance improvement since then has been
primarily the result of saving up writes so that we can allocate and
write them out as a group. As file systems developers, we are going
to look for every loophole in fsync() and squirm our way
through it.
[PULL QUOTE:
As file systems developers, we are going
to look for every loophole in fsync() and squirm our way
through it.
END QUOTE]
Fortunately for the file systems developers, the POSIX specification
is so very minimal that it doesn't even mention the topic of file
system behavior after a system crash. After all, the original
FFS-style file systems (e.g., ext2) can theoretically lose your entire
file system after a crash, and are still POSIX-compliant. Ironically,
as file systems developers, we spend 90% of our brain power coming up
with ways to quickly recover file system consistency after system
crash! No wonder file systems users are irked when we define file
system metadata as important enough to keep consistent, but not file
data - we take care of our own so well. File systems developers have
magnanimously conceded, though, that on return
from fsync(), and only from fsync(), and
only on a file system with the right mount options, the changes to
that file will be available if the system crashes after that point.
At the same time, fsync() is often more expensive than it
absolutely needs to be. The easiest way to
implement fsync() is to force out every outstanding write
to the file system, regardless of whether it is a journaling file
system, a COW file system, or a file system with no crash recovery
mechanism whatsoever. This is because it is very difficult to map
backward from a given file to the dirty file system blocks needing to
be written to disk in order to create a consistent file system
containing those changes. For example, the block containing the
bitmap for newly allocated file data blocks may also have been changed
by a later allocation for a different file, which then requires that
we also write out the indirect blocks pointing to the data for that
second file, which changes another bitmap block... When you solve the
problem of tracing specific dependencies of any particular write, you
end up with the complexity
of soft updates. No
surprise then, that most file systems take the brute force approach,
with the result that fsync() commonly takes time
proportional to all outstanding writes to the file system.
So, now we have the following situation: fsync() is
required to guarantee that file data is on stable storage, but it may
perform arbitrarily poorly, depending on what other activity is going
on in the file system. Given this situation, application developers
came to rely on what is, on the face of it, a completely reasonable
assumption: rename() of one file over another will either
result in the contents of the old file, or the contents of the new
file as of the time of the rename(). This is a subtle
and interesting optimization: rather than asking the file system to
synchronously write the data, it is instead a request to order the
writes to the file system. Ordering writes is far easier for the file
system to do efficiently than synchronous writes.
However, the ordering effect of rename() turns out to be
a file system specific implementation side effect. It only works when
changes to the file data in the file system are ordered with respect
to changes in the file system metadata. In ext3/4, this is only true
when the file system is mounted with the data=ordered
mount option - a name which hopefully makes more sense now! Up until
recently, data=ordered was the default journal mode for
ext3, which, in turn, was the default file system for Linux; as a result,
ext3 data=ordered was all that
many Linux application developers had any experience with. During the
Great File System Upheaval of 2.6.30, the default journal mode for
ext3 changed to data=writeback, which means that file
data will get written to disk when the file system feels like it, very
likely after the file's metadata specifying where its contents are
located has been written to disk. This not only breaks
the rename() ordering assumption, but also means that the
newly renamed file may contain arbitrary garbage - or a copy
of /etc/shadow, making this a security hole as well as a
data corruption problem.
Which brings us to the present
day fsync/rename/O_PONIES
controversy, in which many file systems developers argue that
applications should explicitly call fsync() before
renaming a file if they want the file's data to be on disk before the
rename takes effect - a position which seems bizarre and random until
you understand the individual decisions, each perfectly reasonable,
that piled up to create the current situation. Personally, as a file
systems developer, I think it is counterproductive to replace a
performance-friendly implicit ordering request in the form of
a rename() with an impossible to
optimize fsync(). It may not be POSIX, but the
programmer's intent is clear - no one ever, ever wrote
"creat(); write(); close(); rename();" and hoped they
would get an empty file if the system crashed during the next 5
minutes. That's what truncate() is for. A generalized
"O_PONIES do-what-I-want" flag is indeed not possible,
but in this case, it is to the file systems developers' benefit to
extend the semantics of rename() to imply ordering so
that we reduce the number of fsync() calls we have to cope
with. (And, I have to note, I did have a real, live pony when I was a
kid, so I tend to be on the side of giving programmers ponies when
they ask for them.)
My opinion is that POSIX and most other useful standards are helpful
clarifications of existing practice, but are not sufficient when we
encounter surprising new circumstances. We criticize applications
developers for using folk-programming practices ("It seems to work!")
and coming to rely on file system-specific side effects, but the bare
POSIX specification is clearly insufficient to define useful system
behavior. In cases where programmer intent is unambiguous, we should
do the right thing, and put the new behavior on the list for the next
standards session.
(
Log in to post comments)