[
Editor's note: this is the second page of Valerie Henson's report from 2006 Linux Filesystems
Workshop.]
Day One: Data
The first day of the workshop was devoted to reviewing the data about
hardware, existing file system design, and the problems our users are
facing. We began with introductions of the workshop participants:
- Val Henson, Intel - ZFS developer
- Zach Brown, Oracle - OCFS2 developer
- Arjan van de Ven, Intel - former distribution kernel maintainer and all around Linux hacker
- Andreas Dilger, Cluster File Systems - Lustre developer
- James Bottomley, SteelEye - Linux SCSI maintainer
- Chris Mason, SuSE - Reiserfs developer
- Christoph Hellwig, representing himself - XFS, JFS, VxFS, and all-around file system developer
- Mark Fasheh, Oracle - OCFS2 developer
- Ric Wheeler, EMC2 - Storage system architect
- Theodore Ts'o, IBM - ext2/3 developer
- Mingming Cao, IBM - ext3 developer
- Felix Blyakher, SGI - XFS developer
In addition to the above, Linus Torvalds dropped in on the second day
of the workshop.
File system repair
We began by spelling out the coming fsck time crunch and the
increasing frequency of unrecoverable I/O errors, as described in the
introduction of this article. First, let's review what fsck, or the
File System ChecK program, actually does. Fsck can be divided into
three main tasks: recovery, repair, and refresh of the file system.
|
Sidebar: How to pronounce fsck
How do you tell where a file systems developer is from? Listen to how
they pronounce fsck! Here are some common pronunciations and their sources.
|
- "F-S-C-K" (Finland, Florida, California, France (in French),
Germany (in German))
- "fisk" (California, Portland)
- "F-suck" (New Mexico, Michigan)
- "F-sock" (Ireland)
- "F-sack" (Unknown)
- "F-S check" (Berkeley)
- "Arrrgh, again?" (Dave Jones)
- [censored] (attributed to Al Viro)
|
Recovery: Fsck was originally written to recover from an
unclean mount of the file system, perhaps caused by a system crash or
loss of connection to the disk. Half-finished in-progress updates
have to be cleaned up in order to make the file system consistent
again. For example, a block may be referenced by an indirect block,
but not marked as allocated in the block bitmap. Journaling file
systems formalized the process of recovering in this case,
occasionally moving the recovery code out of the fsck program itself.
For example, XFS separated fsck into in-kernel journal replay code and
a utility called xfs_repair.
Repair: A natural extension of fsck is to repair file system
inconsistency caused by hardware error, software bugs, or other data
corruption. The errors are less predictable than in the pure recovery
case, but many commonly encountered errors (such as loss of the data
in a particular block or errors in reference counts) can be corrected.
In order to discover and correct these errors, fsck must read all the
metadata in the entire file system.
Refresh: The final use of fsck is to traverse the entire file
system looking for latent errors and attempting to fix them before
they become too bad. This is the purpose of the various ext2/3 fsck
timeouts (by default, fsck runs every time the file system has been
mounted a fixed number of times, or if it has been more than a certain
number of days since it was last checked). This use of fsck is a lot
more acceptable if it can be done in the background.
Most file systems have concentrated on speeding up the recovery task
of fsck, but the repair and refresh tasks are becoming more and more
important. Today, repairing file system corruption can make all of
the file system data inaccessible for hours or days, and hardware
trends are only increasing fsck time. In many ways, data that you
can't read is as bad as data loss - which is worse, losing a file and
restoring from backup after a few hours, and corrupting your file
system and being unable to read any of your files for several days?
One workaround, in use today, is to split the disk up into many
smaller file systems, which can be repaired individually.
Unfortunately, this results in greater administrative overhead and
more frequent ENOSPC errors (out of space errors), since files and
directories can't span file systems. Worse, an I/O error on one file
system can sometimes cause the entire kernel to hang or panic,
resulting in all of the data being unavailable anyway.
A frequently suggested workaround is to further optimize the fsck
program itself. Perhaps the bottleneck is in the fsck code itself,
and all we need to do is add some some prefetching and block
reordering, and spruce up the algorithms a bit. However, much of the
obvious fsck optimization work has already been done, especially for
ext2 and ext3. For this to be a long-term solution, we would need to
continue improving the performance of fsck by a factor of 2 every year
or so to keep up with the gap between disk capacity, bandwidth, and
seek time - an unsustainable goal.
Our conclusion from this data is that file systems need to support
partial, concurrent, on-line file system repair. File systems also
need to be written to deal with frequent I/O errors without panicking
or hanging the system - and need to be tested this way. One
recommendation was that people should keep bad disks, as there's
nothing like the real thing for testing your error handling.
Disks and errors
The type of errors disks make affect file system design as well. We
are familiar with errors that happen from something physical going
wrong with the media, resulting in bit flips. These days, the disk
has internal checksums that detect this kind of error. When the
checksum doesn't match, the disk retries the I/O itself several
different ways before giving up and returning an error. Generally, it
is pointless for any OS-level software to retry an I/O that resulted
in the error, since by the point the OS sees an error, it is very
likely to be unrecoverable.
However, another class of errors can't be detected or fixed by the
disk's internal checksums. The terms for these kinds of errors are
"high-flying writes" and "over-powered seeks", "phantom writes" and
"misdirected reads and writes." Basically, the drive either fails to
write data but claims it did (in a "high-flying write," the head is
too far above the platter to actually flip the bits), or it completes
the I/O, simply on the wrong block ("over-powered seek" means that the
head went too far and is positioned over the wrong track when the I/O
starts). The internal checksums (indeed, any checksum embedded within
the block it is checksumming) can't detect this error, because the
checksum matches the data in that location - it is simply old data or
the wrong data. As a result, the OS needs to have checksums at a
higher level as well. Sun's ZFS takes the approach of putting
checksums in the indirect blocks, which solves most of the problems
with misdirected or phantom I/O's without imposing an additional
performance penalty to read a separate checksum block. Checksums like
this will solve many of the panic/hang problems that come from
trusting corrupted file system metadata.
Another interesting class of errors are the result of failures in an
underlying RAID device or volume manager. Often failures will occur
in stripe sizes - 64KB of data ends up zeroed, for example. Some
systems actually byte stripe data, which results in single bytes of
lost data at regular intervals.
Since I/O errors are common, it makes sense for file systems to
replicate important data - the more important the data is, the more
copies should exist. Many file systems already make multiple copies
of the file system superblock; this is an extension of that concept to
other important data, such as block group summary information.
However, to make copies effective, the data must be "scrubbed" -
regularly read to discover whether a copy has gone bad, and then
rewritten with the good data before the second copy also goes bad. A
talk
on data preservation by Mary
Baker really brings this home with a model used to predict mean
time to data loss with and without scrubbing. In the system she
modeled, mean time to data loss went from less than a year without
scrubbing to as much as 16 years with scrubbing, depending on the
frequency of scrubbing (see slide 45 in the
presentation). Even with two copies of data, data loss is
surprisingly common due to the birthday paradox, exemplified by the
question "How many people do you need to have a 50% chance that two or
more them share a birthday?" The answer is only 23, because the
chance of the same date coming up twice goes up approximately
proportional to the square of the number of people. Similarly, it is
surprisingly likely that, given a system that makes two copies of all
data, both copies of some piece of data will go bad. The "birthday
surprise," when several disk drives in the same lot fail within a few
days of each other due to similar defects in manufacturing, also
contributes to surprisingly high data loss even with multiple copies.
One problem implementing file systems is the wide variation in quality
and features available on different disk drive lines. However, just
as the disk drive industry is consolidating at the corporate level
(think Seagate-Maxtor-Quantum), it is also consolidating at the level
of disk drive lines. Economies of scale encourage production of fewer
different kinds of disks, possibly with minor differences such as
firmware version. This is positive for the low end of disks, since
large enterprise customers are buying low-end disks and insisting on
better quality.
Another change in disk hardware is in caching and reordering of I/O's.
Back when FFS was originally designed, the maximum bandwidth was only
possible when reading every other block on disk, since the disk would
rotate one block underneath the head before the OS could submit
another I/O (hence the superblock field specifying "rotational delay"
in many implementations of FFS). Then disks began supporting multiple
outstanding I/O's, and both disks and OS's did readahead of one or
more blocks, making pure sequential layout the fastest game in town.
Now disks read whole tracks (on the order of 256KB) into the internal
disk memory (on the order of 8MB), which make any I/O's that occur
within the same track or group of nearby tracks behave, practically
speaking, as if they were sequential on-disk. Optimal layout has gone
from every-other-block to sequential to nearby is good enough.
We discussed the effect of having some form of flash or non-volatile
memory available either on the disk or in the main system. This is
part of what makes Network Appliances' file servers so speedy - they
cache writes on flash and return success quickly, instead of having to
wait until the write hits disk. The easiest way to take advantage of
flash is to use it as a write cache for blocks - physical journaling
on a separate device. One suggestion is to use the flash for metadata
and store data on disk, but that makes flash a single point of
failure, the loss of which is almost equivalent to losing the entire
file system, and also constrains the ratio of metadata to data in the
system. Designing a file system that assumes the presence of
OS-visible flash somewhere on the system seems like a risky
proposition, given that flash has been available for many years but
not integrated as a standard system component. Concerns about write
speed and number of write cycles before failure (order of 100,000 to
1,000,000) also limit file system dependency on flash. Overall, a new
file system should be able to take advantage of flash but not require
its presence.
A recurring theme of our discussions was the desire to broaden the
interface between the operating system and the disk drive. Even the
most basic disk drive is a small computer now, let alone hardware RAID
devices or multi-terabyte SAN servers, and can easily handle more
sophisticated requests than "write block 37." This is too broad of a
topic to cover here, but the key step we need to take is getting file
systems architects and disk manufacturers together to talk about what
interfaces would be useful.
Lessons learned from existing file systems
We discussed some lessons learned from existing file system
implementations. One of the primary lessons is that file system
repair tools need to reliably distinguish between metadata and things
that simply look like metadata. One advantage of having a fixed
number of inodes in a static location is that it's hard to mistake
other data for an inode, and consequently much easier to repair the
file system. Identifying potential inodes solely by an embedded magic
number is a recipe for failure: consider the case of a file containing
a loopback-created file system image - the file data will look like
inodes if we depend only on embedded magic numbers. Metadata should
be identified by some kind of out-of-band mechanism, such as block
group descriptors or headers of some sort. When it comes to file
system repair, static metadata location has some serious advantages to
offset its disadvantages (occasional wasted space or insufficient
inodes).
Another lesson learned is that file system generation numbers help
distinguish between metadata and data that was once metadata in a
previous incarnation of this file system on this device. If the
metadata is static, it can be zeroed on file system creation, but this
makes file system creation expensive and slow. One technique to speed
up ext2/3 file system creation time without adding a generation number
is to add a parameter to the block group information describing how
much of the inode table has been initialized, so that fsck knows to
ignore the rest of the inode table. An "initialized up to here"
marker block adds a layer of redundancy and safety to this scheme.
Often a major goal of file system design is to reduce the on-disk size
of metadata. However, several studies over the last decade have shown
that many disks are only half-full. Given that redundancy of metadata
helps with file system repair, it seems like a reasonable trade-off to
use more on-disk space for metadata to reduce the overall chance of
data loss.
| File size | % total |
| <= 1 KB | 25-30% |
| <= 2 KB | 40-45% |
| <= 3 KB | 50-60% |
| <= 4 KB | 55-65% |
One exception to the rule of disk-space-is-cheap is small files. Many
file systems are inefficient at storing small files, yet small files
make up the majority of files in many file systems. We heard about a
large government customer that wants to store millions of files on the
order of 1KB in size; another person pointed out that many source code
files are smaller than 4KB. Several of us checked the file sizes on
our laptop file systems and found the following results shown on the
right.
In other words, about one quarter of the files took up less than 1KB,
and more than half were less than 4KB. In every case, we could find
some sort of explanation for why there were so many small files, but
when everything is a special case in the same way, it becomes the
common case. Indeed, current application programming styles seem to
only be increasing the number of small files; examples include Gnome's
habit of storing one configuration item per file. With a file system
block size of 4KB, easily a quarter of the stored files are wasting
more space than they are using. This is similar to what the
implementers of FFS found back in 1984; using a 4KB block size wasted
about 45% of the disk - which is why they implemented 1KB fragments
(which remain unimplemented in ext2 and ext3). New file systems must
be designed to efficiently store small files.
One existing solution is to pack small files together into one block.
However, one of the problems with this solution is that it ends up
rewriting data belonging to files which would otherwise be read-only,
increasing the likelihood of corruption. It also tends to be
difficult to implement and bug-prone.
Major file system architectures
We briefly reviewed the major file system architectures to summarize
what worked and did not work in each case. Simple FFS-style file
systems such as the original Berkeley FFS and ext2 have the advantages
of simplicity, high performance, and easy repair and recovery of data,
but they require a full fsck every time the system crashes, have no
data consistency guarantees, and no formal defenses against disk
corruption.
The most popular modern file system architecture is the journaling
file system, such as ext3, reiserfs, and many others. Journaling file
systems added a log of recent transactions to the file system, which
are written sequentially to a reserved location on disk. The main
file system is not modified until the complete transaction is written
to the log. This allows fast recovery from an unexpected crash, as we
can replay the log and complete any half-finished operations. The
problems with journaling include the double write problem (each
operation must be written to disk twice, once to the log, once to the
final location), and various performance problems stemming from
limited, contended journal space. Also, journaling makes no
improvement in the disk corruption case.
Log-structured file systems caused a great stir in the research file
systems community but never made the leap into major production use.
The insight behind log-structured file systems is first, writing out
updates as a log turns a group of random write I/O's into one large
streaming write, which is much more efficient, and second, once we
have written the transaction to a log, why do we need to write a
second time? The file system is essentially a giant sequential
transaction log, with updates appended to the end. Data is never
overwritten in place - log structured file systems are one kind of
copy-on-write (COW) file system. The main problem with log-structured
file systems is that they require large free segments of disk space,
created by a "cleaner" thread. Allocated blocks must be moved out of
partially free segments and into other segments. The overhead of the
cleaner thread turns out to be quite high, despite many years of
research on cleaner optimization. Also, calculating the amount of
free space needed to complete an update (even a write to a previously
allocated block) turns out to be hard, since a COW file system cannot
free the old copy of a block until the new copy is written, and the
number of block copies required to complete an operation is
unpredictable. Finally, forced reallocation of blocks requires that a
"good" allocation decision be made on every write, whereas
update-in-place file systems need only make a good allocation decision
once.
Soft updates was a refinement to Berkeley FFS which preserved the
on-disk format while removing the need to run fsck on the file system
before it could be mounted after a crash. Soft updates carefully
orders updates to the file system so that if the system crashes at any
time, the file system is consistent with the exception that some
blocks and inodes are "leaked" - marked allocated when they are free.
A background fsck, run on a file system snapshot, finds these
unreferenced blocks and marks them free again. The downside of soft
updates is mainly that it is extremely complex to understand and
implement, and each file system operation requires its own specially
designed update code. To our knowledge, there is only one
implementation of soft updates in existence.
The most recent trend in file systems architecture is the
copy-on-write (COW) file system, as typified by WAFL (Write Anywhere
File Layout, Network Appliance's internal file system), and ZFS (the
new Solaris file system). These file systems are constructed as a
tree of blocks. Every time a block is updated, a new block is
allocated and the chain of block pointers pointing to it is updated -
also causing copies of those blocks. When a consistent set of updates
is written to disk, the root block is atomically updated to point to
the new tree of blocks, which includes up-to-date allocation
information. This structure makes snapshots extremely simple to
implement and centralizes file system consistency code. The
disadvantages are similar to some of log-structured file systems
disadvantages - forced reallocation on every write and uncertainty
about how much space is needed to complete an update. Also, good
synchronous performance requires an added journal of some sort,
complicating the implementation.
Overall, previous file systems focused on solving the problem of
repairing the file system in the case of system crash, and considered
actual data corruption to be rare enough to handle with a full fsck.
File system repair was not a major consideration in the design of the
on-disk layout. Saving on-disk space was a major goal early on, but
is less important now, except for the case of small files, which are
still make up a large portion of stored data.
Thus ended the first day of the workshop, in doom and gloom. Only the
hope of the second day of the workshop and the promised brainstorming
session kept our spirits up.
[ Continue on to Page 3 ]
(
Log in to post comments)