September 5, 2007
This article was contributed by Valerie Henson
When people talk about fsck they not only pronounce it in wildly
different ways, but they also mean wildly different actions. For example, they
might mean "traverse the entire file system looking for obvious
errors," "run a full consistency cross-check of file system metadata,"
"repair corruption from a disk error," "repair half-finished writes
leftover from a system crash," "reconstruct a consistent file system
hierarchy starting from the inodes alone," or "I'm so geeky I think
it's funny to say 'fsck' instead of swearing. Is there a new xkcd up yet?" As different as all these
meanings are, every one of them (except the last) has been implemented
by a program referred to as fsck. The question, "Does this file
system require fsck?" then becomes anything from "Does this file
system need to check and repair the entire file system after every
crash before mounting read-write?" to "Can this file system recover
from any disk corruption event while still mounted?" In this article,
we'll review the history and the various meanings of that complicated,
least-beloved of file system utilities, fsck.
fsck tasks
First, what exactly does fsck - the "file system check"
program - do? Many Linux users experience it as that annoying 10
minute delay in booting that happens every 180 days or 30 mounts,
whichever comes first (the default ext3 "paranoia"
fsck parameters). When we do run fsck, most of us
run it in automatic mode. After all, how many of us can out-guess
fsck when it comes to repairing internal file system
structures? Probably the top 10 developers for each file system,
which leaves the other 99.99% of us with the -y switch. But
before we can understand the differences between fsck
implementations, we have to have some idea of what it does.
The most important job of fsck is to find out whether the
file system makes a consistent, correctly formatted whole. This is
not as simple as traversing all of the file system and incidentally
making sure the metadata is good enough for reading along the way.
fsck also has to do more involved cross-checks on the
metadata than simply reading it, and make sure that the parts of the
file system it believes are unused are in fact unused. This is the difference
between having a file system that is consistent enough to read, and
one that is consistent enough to write. A file system that
can be read may be chock-full of reference count bugs and errors which
will only cause trouble when the system attempts to actually change
the file system. A car may be in good enough repair to start and
idle, but then fall apart once it leaves the garage.
During consistency checking, fsck double-checks the metadata
describing which blocks and inodes are free, and which are allocated.
Usually, some sort of allocation bitmap or tree of extents is
maintained to speed up searching for free blocks or inodes -
otherwise, the file system would have to check every file to see if it
used a particular block, very slow going indeed. This bitmap is a
distilled copy of the metadata in individual block pointers or inodes
describing whether a block or inode is in use. The upside of this
second copy is speed (or lack of glacial slowness, more accurately);
the downside is possible inconsistency. If corruption occurs, the two
copies can disagree with each other, leading to further file system
corruption. The kinds of errors fsck looks for here are
double-use (a block with more than one pointer to it), leaked inodes
or blocks (an inode or block is marked as used but nothing refers to
it), and disagreement (a block pointer points to a block or a
directory entry points to an inode but it is marked as free).
Orphan inodes, inodes marked as allocated but not pointed to by any
directory entry deserve extra discussion. Orphan inodes are
surprisingly common, due to a UNIX convention that allows a file to be
unlinked (removed from the directory tree) but still open. Many
programs create temporary files and unlink them in this way so they
are guaranteed to be deleted even if the program doesn't shut down
properly. The file system has the honor of implementing this
guarantee. Many modern file systems maintain some form of on-disk
delete queue - a list of inodes which need to be deleted when their
reference count drops - for quick deletion in case of crash, instead
of searching the entire file system for orphan inodes. Even
journaling file systems must kick-start this deletion after an unclean
unmount, though it is not crucial to using the file system
immediately.
Free/allocated consistency is particularly hard especially when it
comes to blocks. Most file systems do not have any way to have back
pointers for blocks to their parent, so the only way to find out if a
block is really part of a file is to traverse the entire file system.
Detecting duplicate block allocations requires keeping a block
allocation bitmap and checking if a block is already marked before
marking a block as allocated. Fixing the duplicate allocation
requires keeping a list of which inode points to a block which can
take a lot of memory; the ext2/3/4 fsck doesn't record this
information until it detects a duplicate block, at which point it
starts over and finds this information.
UNIX file systems have the wonderful quality of allowing more than one
hard link to an inode (which can be file or directory). The inode is
not deleted until all the hard links are gone. Each inode must
maintain a link count, and fsck has to check that the number
of directory entries referencing an inode is exactly the same as the
link count. This is checked by walking the entire directory tree and
recording each link to an inode.
The structure of the directories in a file system has to obey certain
rules. No directory cycles can exist (e.g., directory A -> directory
B -> directory A), and each directory must be reachable from the root
directory of a file system.
The above are the most important, generic UNIX rules for file system
consistency, but there are many more things to check. Each file
system then also needs to check the internal structure of its
metadata. For example, if the file system uses extents, the file
system must check that the extents of a file are correctly formatted
and refer to plausible blocks. The superblock and the summaries for
groups of blocks must be checked. Some file systems use B-trees
extensively and must check them for consistency too, and so forth.
One paper that may help with understanding some of the more subtle
issues of file system checking is Fast
Consistency Checking for the Solaris File System [PDF]. The authors
implement a scheme for fast fsck with relatively minor
changes to the Solaris UFS file system, in the process describing the
most difficult tasks in file system consistency checking.
Primordial fsck: check the file system and repair
in-progress updates
For the purposes of UNIX, the first fsck was designed for the
Fast File System. (Original
fsck paper in text gzipped format) As is well known, FFS
had no formal method of maintaining file system consistency if the
file system was not cleanly unmounted. (In fact, in the earliest days,
the operator had to sync the file system by hand before shutting the
system down.) Many write operations require writing more than one
block on disk. If a system crash occurred, some random subset of the
outstanding writes would be on disk, and the rest would not. When the
system booted again, the file system would be in an inconsistent state
and not usable - perhaps an inode had zero links to it, but was still
marked as allocated, and therefore could never be freed. As well,
corruption might occur for other reasons - a bad disk, or a file
system bug - and not be found until the whole file system was checked.
fsck in this earliest incarnation therefore did the following
things: It checked the whole file system for inconsistencies, both
from an unclean mount and other source of corruption, and in the
process attempted to repair any inconsistencies it found. (Repair here
means, as it does in the rest of the article, returning the file
system to a usable consistent state, rather than to some platonic
ideal of what the file system would have been without the corruption.)
The majority of the inconsistencies were the result of an unclean
unmount, and the steps to fixing them were fairly well known. The
first use of fsck meant "check the file system and fix any
in-progress writes that didn't complete so that the file system can be
mounted." This is the use that carried over to the ext2 file system
in Linux.
fsck and journaling file systems
Running fsck after every unclean unmount was an unpleasant,
time-consuming, and dangerous experience. Many a sysadmin has
distinct memories of lines of unintelligible gobbledygook scrolling off
the screen, each ending with "Fix? <y>", and a sore
finger from holding down the enter key (this was before the
-y switch). The new journaling file systems, like XFS, VxFS,
Reiserfs, and ext3, made running fsck after an unclean
unmount unnecessary.
Journaling file systems keep an on-disk log of write operations to the
file system. When the entirety of a write operation is in the log,
then the file system begins rewriting the changes to their final
location on disk. If the system crashes or something else goes wrong,
then the journal entry is still on-disk on the next mount, and the
file system will finish replaying the entry, so that the entire
self-consistent set of changes to the metadata will go to disk.
fsck no longer had to clean up after half-finished writes,
and the file system only had to replay the journal after an unclean
unmount.
Some file system developers initially took this to mean that no
fsck was needed at all. In part, this was true - the system
no longer needed to repair half-finished writes by scanning the entire
file system, it only had to replay the log. But fixing half-finished
writes was only one part of what fsck did. It also checked
for and repaired corruption caused by disk errors, file system bugs,
administrator error, and any other source. These sources of errors
are less common and can be ignored in development, but become a major
problem in production use. Nobody wanted to repair a journaling file
system by hand any more than any other file system. fsck in
the sense of "repair half-completed writes" is unnecessary for
journaling file systems (or copy-on-write file systems) but it is
still necessary in the sense of "check for and repair file system
corruption when something unexpected goes wrong."
The XFS developers decided to head off the fsck naming
confusion at the pass and created two commands, xfs_check,
which checks the file system for corruption, and xfs_repair,
which repairs corruption. The xfs_check man page immediately
clears up any confusion about when to run it:
xfs_check checks whether an XFS filesystem is consistent. It
is normally run only when there is reason to believe that the
filesystem has a consistency problem.
The Reiser version 3 file system, reiserfs, tried something radical
and new with its file system check and repair program. It had three
major modes: "check," "fix fixable," and "rebuild tree." It divided
file system corruption into two kinds: that which is easily fixable,
and that which was handled by throwing away most of the metadata and
rebuilding the entire file system tree using only the leaves as a
starting point (reiserfs puts all of the file system metadata and data
into one "balanced tree" structure). The file system repair program
only had to deal with a limited set of "easy" corruption repairs.
Anything harder just threw away all the "secondary" metadata that
could be conflicting and then did a brute force search for the
"primary" metadata - the leaves of the tree - and rebuilt a tree out
of them. The downside of this approach is that there is no
out-of-band signal to say what blocks are metadata and which are not,
so it used a magic number present in reiserfs metadata to decide what
should be part of the tree. Unfortunately, regular file data can have
this magic number, and one common use case was to keep a reiserfs file
system image in a file (to mount using the loop device) on a reiserfs
file system. The result was that file systems became trivially
corrupted during a tree rebuild, since the metadata leaves in the
loopback became incorporated into the parent file system.
fsck and soft updates
Soft updates, implemented on FFS for BSD, introduced another meaning
of fsck. Soft updates is a method of recording and ordering
metadata writes to the disk so that if a system crash occurs, the file
system is consistent, with the exception of possible leaked inodes and
blocks. When the system boots after an unclean unmount, fsck
takes a snapshot of the file system (using an interesting file-based
copy-on-write mechanism) and checks it, looking for leaked inodes and blocks.
As soon as the snapshot is taken, the system goes forward with the
normal boot process, mounting the file system read-write. When
fsck finishes, it releases the leaked inodes and blocks it
found and lets go of its snapshot. Soft updates gave immediate access
to the file system after unclean unmount, without changing the on-disk
format of the original FFS file system. fsck in this case
meant two things: search for and free leaked inodes and blocks, and
repair unexpected corruption.
fsck and copy-on-write file systems
Copy-on-write file systems use an atomic rewrite of the top block in
the file system hierarchy to switch between one consistent file system
state and another. Copy-on-write file systems may have some form of
logging, but this is for the purpose of swiftly recording recent
changes to the file system rather than being necessary for the
consistency of the file system as in journaling. For example, Write
Anywhere File Layout (WAFL)
keeps a log of recent writes in an NVRAM device, and ZFS keeps an
intent log of recent operations. fsck for copy-on-write file
systems is then restricted to the role of checking for and repairing
unexpected, unlooked-for file system corruption. fsck is
only run as a paranoia check or in response to some sign of
corruption.
Not much information is available on the file system check and repair
tools for WAFL, other than that they exist. Searching for the file
system check and repair tool for WAFL, wafl_check, only gives
about 100 results from Google. The online consistency check tool is
named wafliron (ha!) and had about 100 results as well.
ZFS's file system check and repair facilities don't follow the usual
interface boundaries. The zdb command, used for debugging
ZFS, has an undocumented option which will cause it to traverse the
entire file system tree, checking checksums as it goes, for a basic
consistency check. (Undocumented, because, as the man
page says, "The zdb command is used by support engineers to
diagnose failures and gather statistics. Since the ZFS file system is
always consistent on disk and is self-repairing, zdb should only be
run under the direction [of] a support engineer.") Checks and fixes for
some problems the developers have observed in the wild are implemented
in-kernel. The best known of these in-kernel repair facilities is the
automatic repair of a damaged block with two copies, replacing the
copy which does not match the block's checksum with the good copy if
available. Since all metadata has at least two copies, this fixes
most data corruption (the exceptions include things like in-memory block corruption). This collection of features definitely qualifies
as file system check and repair, but people will argue whether they
should be called fsck or not.
Which fsck do you mean?
We've seen fsck in all its infinite glory, everything from a
simple traversal of the file system metadata to groveling through the
entire file system cleaning up after a simple-minded file system.
Sometimes the names of the programs implementing file system check and
repair have improved on unpronounceable fsck
(xfs_repair), and sometimes they are just funny
(wafliron). One thing is for sure: fsck is an
overloaded word, with as many interpretations as there are listeners.
Until the file systems community comes up with new terminology, you'll
be best served by defining exactly what you mean by "fsck" - "file
system consistency check," "file system inconsistency repair," or
other unwieldy descriptions.
(Note to readers: Lots more kinds of fsck exist - for
example, I didn't cover any flash file systems, which tend to be
different in very interesting ways. Please add comments about other
kinds of fsck, or details on the ones described here. And of
course, your fsck war stories. - V.H.)
(
Log in to post comments)