[
Editor's note: this is the third and final page of Valerie Henson's report from 2006 Linux Filesystems
Workshop.]
Day Two: Ideas
The second day of the workshop began with the discovery that the bulb
in the projector had burned out, which was fine with us as it further
discouraged the use of mind-numbing slideware. We attributed part of
the amazing productivity of the workshop to the lack of a projector
and the fact that most of us also lacked any form of network
connectivity from the conference room, forcing us to pay attention to
what was going on instead of reading our email.
The second day of the workshop was devoted to talking about
interesting file system architecture ideas. We were fortunate to have
experts in most of the major Linux file systems (ext2/3, reiserfs,
XFS, JFS) present to share their experiences, as well as people with
expertise in other file systems (OCFS2, Lustre, ZFS), and of course,
our token I/O subsystem representative (SCSI).
First, a quick review of our goals. We want a file system that can
easily detect and correct disk corruption without taking the file
system offline for hours or days. We especially don't want to lose
the entire file system or large segments thereof when a relatively
small part of the disk is corrupted. File system repair should take
less time than restoring the file system from backup; in fact, we
should design our file system layout with fsck performance in mind,
rather than just normal file system performance. These goals can be
summarized as "repair-driven file system design" - designing our file
system to be quickly and easily repaired from the beginning, rather
than bolting it on afterward.
We also want to avoid double writes, reallocating blocks on every
write, and overly complex implementations which are difficult to
understand and prone to bugs. We want to efficiently store small
files but otherwise are willing to trade disk space for greater
reliability. We can take advantage of disk track caches and focus on
reducing seeks instead of laying out data exactly sequentially.
Chunkfs and continuation inodes
Chunkfs is an idea that Arjan van de Ven and Val Henson came up with
in order to combat the coming fsck crunch. It is similar to an idea
for improving Lustre at the distributed file system level currently in
development at Cluster File Systems, Inc. In order to explain it,
we'll start with a version of the idea that didn't work.
In early 2006, Val Henson began working on an idea to reduce fsck time
in ext2. Ext2 is a wonderful file system - fast, reliable, simple -
but it has this irritating requirement of a full fsck after a system
crash. Her initial idea was to mark individual block groups as dirty
or clean, depending on whether it was being modified, and then modify
fsck to not bother checking clean block groups, thereby reducing
average fsck time. Initial measurement showed that the
divide-and-conquer approach to be promising, as only a maximum of 25%
of block groups were dirty at any given time under a heavy file system
load, and most block groups were clean more than 90% of the time
under a normal development laptop workload. However, this scheme
failed because file system references are not contained within block
groups. If fsck wants to know if the block group allocation bitmap is
correct for a particular block, it still has to build the block bitmap
by traversing the entire file system, since any inode in any block
group can reference any block in any block group. Instead, she
implemented a file system-wide dirty bit indicating whether the entire
file system was dirty or not, allowing the system to skip fsck if the
system crashed while the file system was clean. (For details, see the
talk at OLS, reducing
fsck time on ext2.)
Arjan van de Ven got to wondering: what if we divided the file system
into many small file systems (chunks), each with its own dirty bit, so
that fsck time would be more predictable? This quickly evolved into a
scheme for making a collection of small file systems (with any form of
consistency guarantee - dirty bit not required) look like one big file
system to the user, while still being individually repairable. The
basic insight is to allow files and directories to span file systems
by creating "continuation inodes" in other file systems. For example,
if a file grows too large for its chunk, we create a continuation
inode in another chunk and point to it. The new inode has a back
pointer to the original inode. For a directory that has a link to a
file in another chunk, we allocate a continuation inode for the
directory in the new chunk and allocate the directory entry for that
link in the same chunk as the file. This has the downside that the
number of hard links to a file are limited by the free space in the
file's chunk; however, large numbers of hard links are uncommon. This
problem can be
mitigated by reserving space for minimum number of hard links per
inode created. Another option is to allocate a continuation inode for
the file being linked to in the same chunk as the directory, and chain
the various hard links together using a linked list to handle the case
of removing the original link.
The important invariants are that blocks are only referenced by inodes
inside the chunk, and any inodes with cross-chunk references have
explicit back pointers to the parent inode. Fsck only has to check
the local inodes to repair the block bitmap, and the local inodes plus
the inode back pointers to check inode reference counts. A memorable
quote from this discussion was: "If you want to check it, you have to
chunk it."
Ideally, fsck on chunkfs would be implemented on a per-file system,
on-line, on-demand basis. If a chunk suffers an I/O error or the file
system discovers a checksum mismatch, that chunk can be taken offline
individually and checked with less impact on the rest of the file
system (directory hierarchy has to be taken into account here; it may
be reasonable to replicate directories near the root of the file
system or allow independent mounts of files in particular chunks).
Chunkfs has many useful properties. Chunks can be created with
different inode ratios and sizes. Files can be easily segregated into
chunks based on size, reducing fragmentation. One possibility is that
file start out in a small file chunk, and when they grow into larger
files, a continuation inode is allocated in a large-file chunk. This
might be useful for workloads that only access the first few bytes of
a file, such as the "file" command and some image viewers. With
chunkfs, 32-bit block pointers are enough, since it is unlikely we
will want chunks that are in the terabyte size range. Chunkfs puts
few limitations on the implementation of the file system inside each
chunk and may possibly be implemented as a generic layer usable with
any local file system. Growing file systems is trivial - add new
chunks - and shrinking is simplified, since data can be moved one
chunk at a time.
Overall, chunkfs was one of the rare ideas that looked better the more
we thought about it. We will be discussing it more on the Linux
fsdevel list in the near future.
Doublefs
Doublefs is an idea that Theodore Ts'o and Val Henson came up with a
couple of years ago, trying to get the consistency benefits of
copy-on-write file systems without the pain of constantly allocating
new blocks and figuring out space accounting for updates. Since disk
space is cheap, why not pre-allocate two copies of all metadata,
including directory entry blocks? When updating the file system,
overwrite only one copy of the metadata, marking it with a transaction
number. When the full set of updates is on disk, go back and update
the second, older copy of the metadata at your leisure. Half-updated
metadata is tracked by some form of dirty map or list on disk, so if
the system crashes in the middle of an update, we can go back and
clean it up, by copying either the old or the new copy of the metadata
over the second copy, depending on when we crashed. Disagreements
between the copies are settled by checking the transaction id and a
checksum.
One nice property of this system is that the two copies do not need to
be located next to or even near each other, since the writes are
temporally separated. This increases the chance that only one copy is
corrupted by an I/O error. Another nice property is that the second
copy need only be written in order to reduce recovery time on reboot
(and lower the chance of the good copy being corrupted before the
second copy is updated). Zach Brown and Aaron Grier put some effort
into prototyping doublefs.
Inodes in directory entries
Another major idea discussed at the workshop is making the directory
entry the primary repository of file metadata rather than a separate
inode, as suggested by Linus Torvalds. Currently, the directory entry
contains only two important pieces of information: the name of the
link to the file, and the inode number of the file. It might make
sense to include the inode in the directory entry as well. A paper
describing one implementation of this scheme is
Embedded Inodes
and Explicit Grouping: Exploiting Disk Bandwidth for Small Files,
by Greg Ganger, et alia.
One difficulty is what happens when a file is created in one
directory, another hard link is created from another directory, and
the original link is removed. The inode must be then migrated to the
new directory, via a linked list of some sort. This means that a
particular inode number is not located in a static place, complicating
anything that must look up files by inode number. Currently, only NFS
servers need to look up files by inode number, due to the
implementation of NFS file handles. Also, we may want to change the
inode number when we move it to another directory, for internal file
system architecture reasons. One solution is to have a user-visible
inode number that stays the same over the lifetime of the file, and an
internal inode number which can change.
Putting inodes in the directory entry opens the way to easily include a small
amount of file data in the directory entry. Up to a certain
threshold, file data could be packed into the directory entry along
with the inode. This will greatly reduce on-disk fragmentation, due
to partly-used data blocks. Having all the file information readable
with one I/O will improve performance of many workloads, since
normally a cold cache read of file data requires three dependent
I/O's: First the look up of the directory entry, then the read of the
inode, then the read of the file data pointed to by the inode. The
major concern with packing file data is that the implementation tends
to be bug prone and difficult to get right, due to the need to
correctly repack the data if the file size changes. However, we
observe that very small files are usually created and deleted
atomically and rarely modified in place. One rule that might work is
to only pack the file data into the directory entry when it is first
created (using delayed allocation to avoid creating the file until the
data is written). If the file changes or grows, immediately move the
data out of the directory entry - and never move it back. Christoph
Hellwig suggested that this might be easy to prototype in XFS.
The performance implications of including inodes and/or file data
in-line with the direct entry are complex. One fundamental file system
performance trade-off is the performance of readdir() versus
stat() versus
cat on a large number of files. If all you are doing is listing all
the entries in a directory, you want that information to be packed as
tightly as possible, so you don't have to read a bunch of extraneous
data off the disk. However, if you commonly read the directory entry
and then immediately stat the file, reading the inode, then it would
make sense to include the inode in-line with the directory entry, even
though it slows down the performance of the pure readdir workload.
The same argument goes for a stat followed by a read of the file data
- including some amount of file data in-line can speed up the stat
followed by cat workload, at the expense of slowing down the pure stat
workload.
One major point is that the readdir-only workload is actually
extremely uncommon. You may think you are only doing a readdir when
you type "ls", but most systems ship with color ls enabled, which
automatically checks file permissions and the like in order to
visually distinguish different kinds of files, like executables and
hard links. From a performance standpoint, including the inode in-line
with the directory entry looks like a sensible trade-off.
The second trade-off, which you can call stat-vs.-cat, is less
obvious. How often do we stat files and how often do we cat files?
While working on ZFS, we did some performance measurements and found
that increasing the size of inodes enough to include a few hundred
bytes of data severely impacted the performance of the stat workload,
because it decreased the density of the data we cared about and
required us to read many times more data off disk than necessary. In
the end, the most efficient size of inodes depends on the file system
workload, and will have to be carefully researched.
One suggestion that came out of in-line inode discussion is adding
flags to the stat() system call to request only a subset of the data
usually returned in the stat structure. This could allow significant
performance optimizations, since calculating some of the values in the
stat structure can be expensive.
Allocation
We discussed several techniques for improving allocation of file data
on disk. A naive allocation implementation allocates data block by
block as it is written, with no knowledge about how large the file
will eventually be. This worsens disk fragmentation by mixing large
files with small files, and by making it harder to reserve large
contiguous ranges of blocks for larger files. What we need is some
way to help predict how large a file will be before we begin
allocating blocks.
The simplest technique is delayed allocation - waiting some time after
a write occurs to see if more data is written. Another solution, used
in BSD's implementation of FFS for several years now, is block
reallocation. When a file is written to, the entire file may be
moved, including blocks previously allocated elsewhere on disk, to a
better location. Another technique is the POSIX fallocate() function,
which tells the file system to allocate N bytes of disk space for this
file, which is useful for programs like tar and package managers which
know exactly how large a file will be. One of the criticisms of
fallocate() is that it is not as useful for programs that are guessing
at the total length of the file. If the program allocates more space
than it needs, it must explicitly truncate the file. A flag to
fallocate() automatically implementing this would be convenient.
Much of the discussion centered around getting hints from the
application about how files will be created and used. One point of
view is that applications are already giving us hints about file size,
permissions, and other attributes: they are called "file names." If a
file is named "*.log", it's a pretty fair guess it will start out
small, grow slowly, become very large, and be append-only. If a file
has the string "lock" in its name, it is likely to be zero length,
frequently stat'd, and deleted in the future. If it is named "*.mp3",
it is probably going to be written once, read sequentially many times,
and several megabytes long. Other file attributes, such as
permissions, give valuable hints for predicting the future of a file.
Daniel Ellard, et al., wrote several papers on this topic, The
Utility of File names [PDF] and Attribute-based
Prediction of File Properties [PDF]. The second paper describes a
system which automatically deduces correlations between file name
patterns and file attributes and future file activity, given file
system activity traces, so you don't have to manually update your file
name-based optimizations when, for example, a new audio file format becomes
popular.
Many of these allocation policies could be implemented at the VFS layer
and optionally used by file systems, rather than being implemented in
individual file systems. Many people would like to see the ext3
reservation code, written by Mingming Cao, available in some kind of
portable form for use by other file systems.
One change in our assumptions that affects allocation is the fact that
we have to do scrubbing in order to find and repair errors before they
become unrecoverable. Since we are required to pass over the file
system, we may be able to do other things while we are there, such as
continuously defragment the file system. One technique that could
significantly lessen the cost of scrubbing and defragmentation is
called freeblock
scheduling. The idea is that while the disk is servicing
important "foreground" requests, the read head is frequently idle
during seeks or waiting for the disk to rotate to the desired data.
If we have a queue of outstanding "background" requests, we can read
these blocks as we happen to pass over them while servicing foreground
requests, causing little or no degradation in performance of the
foreground requests. One drawback of this system is that it works
best with detailed low-level knowledge of the disk hardware; however
the ideas are still useful for reducing the cost of background disk
maintenance tasks.
Near-term needs for Linux file systems
We devoted part of the afternoon of day two of the workshop to talking
about features Linux file systems need in the next year. High on the
list was any method of mitigating the fact that file system repair is
extremely slow. Reducing the amount of data that needs to be checked
by marking some parts of the metadata as unused or clean would help,
as in the approach to progressively mark ext2/3 inodes in use as
needed. Many implementations of fsck are highly optimized but there
may be more room for improvement. Simply having an estimate of how
long file system repair will take lets administrators decide whether
to attempt repair or simply restore from backup.
Another short-term approach is to strategically add checksums and
error handling to existing file systems. One example is the IRON
ext3 work [PDF], done at University of Wisconsin, Madison. This led to
a discussion of the difficulty of transferring code written by
researchers into the mainline open source development tree. One
suggestion was to get researchers to focus on producing data that can
be used by mainline developers to produce code, since getting code
accepted is generally irrelevant to the average file system
researcher's career.
Next on the list was forced unmount support, the ability to forcibly
unmount a file system, even if it has open files. This is a feature
of several existing file systems and is useful for unmounting file
systems without shutting down the entire system. A large part of
forced unmount can be implemented at the VFS layer (e.g., replacing
VFS ops of open files with a set of VFS ops that always return EIO),
but specific file systems will still need some testing and work to be
able to unmount safely in the face of errors. Forced unmount can
probably implemented with a few months of work and available for
production use in a year. This feature must go hand-in-hand with
better testing with I/O errors, otherwise our effort to isolate file
system errors is not particularly successful.
Another issue is the fact that file system performance is usually only
tested on fairly young, unfragmented file systems. The file systems
development community should work with researchers on better ways of
creating aged file systems quickly. One idea was to insert plenty of
sync() calls during accelerated replay of file system traces, so that
the file system can't use smart temporal allocation policies that
would be unavailable in a truly aged file system. The latest research
on replaying file system traces is by Ningning Zhu, et alia, in a
paper their FAST '05 paper, Scalable and
Accurate Trace Replay for File Server Evaluation [PDF]. A good paper on
aging file systems is File
System Aging - Increasing the Relevance of File System Benchmarks [PDF]
by Keith Smith, at alia.
One of the difficulties with multiple file systems on a system is the
amount of memory used for file system journals, since it is usually
allocated on a per-file system basis without taking into account
global memory usage for file system data. Some method of coordinating
memory use might be helpful. In general, file systems developers tend
to optimize for file system performance without fully taking into
account overall system performance.
Day Three: Plans
The last day of the workshop was devoted to planning the next steps
for making the next generation of Linux file systems a reality - in
other words, we were assigning work. It was also Saturday morning
after two grueling days and nights of non-stop file system
discussion. Miraculously, nearly all the workshop participants
returned for the final day.
First on our to-do list is convincing the Linux community, including
companies, to work together on developing new file systems. We need
to educate the community about the coming fsck time crunch and the
necessity to start developing new file systems now, since file systems
generally take about 5 years to go from concept to production use.
Workshop participants are working on slide decks, collecting
requirements, and writing articles, in the hopes of influencing the
community and companies to devote resources to Linux file system
development.
We need to continue working on existing file systems such as ext2/3,
reiserfs, and XFS so that they can bridge the gap while new file
systems are in development. We are organizing meetings between
hardware vendors, researchers, and file systems developers to share
data and ideas. A follow-up Linux file systems workshop is
tentatively planned to be co-located with the 5th USENIX FAST
conference in San Jose, CA in February 2007. We have nothing
formal planned for Ottawa Linux Symposium,
but you might want to attend Val Henson's talk on reducing
fsck time on ext2, as it will cover other topics in Linux file
systems.
When it comes to code, we have several ideas to prototype:
- Forced unmount support
- Variable inode size and effect on stat vs. cat workloads
- Progressive inode table initialization in ext3
- Chunkfs prototype using existing file systems
- Doublefs prototype
- Inode and file data in the directory entry
- Freeblock scheduling
- Adding flags to stat() to request a subset of the stat data
If you or your organization is interested in working on any of these
ideas or getting involved in file systems development in any other
way, email the
Linux fsdevel list.
The formal workshop was over at noon, but several hardy souls accepted
an invitation to attend the Linux Beer Summit, organized by Kristen
Carlson Accardi (docking
station maintainer) and sponsored by Google. File system
discussions continued well into the evening over several excellent
home brewed beers, and Inaky Perez-Gonzalez (Ultra-wide-band maintainer) supplied
exciting low-speed tractor rides for all.
(
Log in to post comments)