Rotating magnetic storage technology has served us well for a long time.
It offers high capacities (for an ever-increasing value of "high"),
relatively fast and relatively uniform access times, and relatively good
reliability. It is generally accepted that rotating disks will be part of
our systems for some time yet. For smaller sizes, however, disks are
increasingly being pushed aside by solid state flash memory - and "smaller"
is an ever-increasing value as well. Flash is more compact, requires less
power, and offers truly random access, so it's not surprising to see it
being deployed in more situations.
Flash is not without its drawbacks. Its relatively high cost limits its
applications and it brings its own set of quirks which must be understood
and addressed by filesystem developers. Even so, some special-purpose
laptops rely on flash for their persistent storage needs now, and there are
rumors of more flash-based systems in the near future.
The most significant of the "quirks" mentioned above are:
- Flash storage cannot be simply overwritten like magnetic storage;
instead, a flash block must be explicitly erased and rewritten in two
separate steps. The size of the "erase blocks" may not match the
block size as understood by the operating system; often, the erase
blocks are relatively large.
- There are limits to the number of times a block of flash memory can be
erased and rewritten before it loses the ability to reliably store
data. That limit is generally around 100,000 cycles.
These hardware features have some interesting implications. What, for
example, happens when the operating system decides to rewrite a single
block within a larger flash erase block? A naive implementation would read
the entire erase block, perform the erase operation, then write the data
back with the new portion included. Should the system go down in the
middle of this operation, however, all of the data within the erase block
may be lost forever. If the operating system ignores the block lifetime
issues, it is likely to cycle some erase blocks much more frequently than
others, significantly shortening the overall life of the device. When one
is dealing with a low-duty-cycle device, such as a USB thumb drive, it's
possible to get away with ignoring the limitations that flash has. When a
flash drive is the primary storage device, though, a smarter approach is
Being smarter is usually a matter of using a filesystem which was
explicitly designed to work well with flash hardware. These filesystems
can dispense with the great care that other filesystems must take in how
blocks are laid out - there are no seek time or rotational latency issues
with flash drives. On the other hand, flash-aware filesystems must be
written with erase cycles in mind; they must not risk losing data during
these cycles and they should endeavor to spread these cycles across the
drive to maximize its lifetime.
The end result is that filesystems designed for flash devices take the
log-structured approach. The device is treated like a sort of circular
buffer, with new data always being written to the end. This approach makes
for fast write operations, but the read side can be a more complex story.
One approach taken is to attach some metadata to each erase block describing
which file that block belongs to and its version number. When an erase
block is to be rewritten, a new copy is made at the end with a higher
version number; reading the file is simply a matter of finding the erase
block with the highest version number.
Finding that block requires scanning the disk - something which, most
likely, one does not want to do for every read operation. The in-kernel
JFFS2 filesystem solves this problem by performing a scan when the
filesystem is mounted. It builds an in-memory data structure which speeds
subsequent accesses considerably. There is a cost, though: the initial
scan can make mounting slow, and the in-memory tree can take a considerable
amount of space. Given that flash filesystems are often used on small,
embedded systems - where both boot time and memory are at a premium - these
costs are significant.
Jörn Engel thinks he has a better way in the form of the LogFS filesystem, currently
proposed for inclusion into the mainline. The core idea behind LogFS is
that, rather than building the filesystem tree at mount time, the
filesystem code should store the tree
on the device itself, much like traditional filesystems do. Putting
the tree on the flash device reduces mount times (Jörn says that an
OLPC system goes from 3.3 seconds under JFFS2 to 60ms under LogFS) and
should reduce the runtime memory requirements considerably.
The on-flash tree looks much like the structure used by ext2. There are
some differences in how it is managed, however. The log structure of the
filesystem implies that blocks cannot be rewritten in place; any time a
block is changed it must be moved and written to a new location. If there
are pointers to the moved block (think about the usual indirect blocks used
to store the layout of larger files), the blocks containing the pointers
must also be changed, and thus moved. That, in turn, will require changes
at the next level up in the tree. Thus changes at the bottom of the tree
will propagate upward all the way to the root. This is the "wandering
tree" algorithm. One of the advantages is that the old filesystem
structure remains valid until the root is rewritten - a crash could cause
the loss of the last operation, but it will leave previous data and the
structure of the filesystem intact.
Actually managing the entire directory tree as a wandering tree would be
expensive; beyond that, files with multiple hard links break the tree
structure and make wandering trees much harder to implement. So the actual
tree implemented by LogFS just has two levels. There is an "inode file"
containing the inode structures for every file and directory existing
within the filesystem; each inode then points to the associated blocks
holding the file's data. Directory entries contain a simple integer index
giving the inode offset within the inode file. So changes to an inode only
require writing the inode itself and the inode file; the rest of the
directory structure need not be touched.
To tie it all together, LogFS sets aside a group of blocks as the "anchor
area," where versioned pointers to the root inode can be found. Mounting
the filesystem requires scanning this anchor area to find the current
version of the root inode, at which point the rest of the filesystem
becomes accessible. This mechanism allows the root to be found in constant
time without the need to scan the entire device.
LogFS has been through a couple rounds of review, with significant changes
each time. Barring significant problems, it should be getting close to
ready, perhaps it will be merged in time for 2.6.23.
(See also: Jörn's
LogFS paper from which much of the above was cribbed).
to post comments)