September 18, 2009
This article was contributed by Valerie Aurora (formerly Henson)
When you say "log-structured file system," most storage developers
will immediately think of Ousterhout and Rosenblum's classic paper,
The
Design and Implementation of a Log-structured File System - and
the nearly two decades of subsequent work attempting to solve the
nasty segment cleaner problem (see below) that came with it. Linux
developers might think of JFFS2, NILFS, or LogFS, three of several
modern log-structured file systems specialized for use with solid
state devices (SSDs). Few people, however, will think of SSD firmware. The
flash translation layer in a modern, full-featured SSD resembles a
log-structured file system in several important ways. Extrapolating
from log-structured file systems research lets us predict how to get
the best performance out of an SSD. In particular, full support for
the TRIM command, at both the SSD and file system levels, will be key for
sustaining long-term peak performance for most SSDs.
What is a log-structured file system?
Log-structured file systems, oddly enough, evolved from logging
file systems. A logging (or journaling) file system is a normal
write-in-place file system in the style of ext2 or FFS, just with a
log of write operations bolted on to the side of it. (We'll use the
term "journaling file system" in the rest of the paper to avoid
confusion between "logging" and "log-structured" file systems.) A
journaling file system keeps the on-disk state of the file system
consistent by writing a summary of each write operation to the log,
stored somewhere non-volatile like disk (or NVRAM if you have the
money), before writing the changes directly to their long-term place
in the file system. This summary, or log record, contains enough
information to repeat the entire operation if the direct write to the
file system gets interrupted mid-way through (e.g., by a system
crash). This operation is called replaying the log. So, in short,
every change to the file system gets written to disk twice:
once to the log, and once in the permanent location.
Around 1988, John K. Ousterhout and several collaborators realized
that they could skip the second write entirely if they treated the
entire file system as one enormous log. Instead of writing the
operation to the log and then rewriting the changes in place somewhere
else on the disk, it would just write it once to the end of the log
(wherever that is) and be done with it. Writes to existing files and
inodes are copy-on-write - the old version is marked as free space,
and the new version is written at the end of the log. Conceptually,
finding the current state of the file system is a matter of replaying
the log from beginning to end. In practice, a log-structured file
system writes checkpoints to disk periodically; these checkpoints describe the
state of the file system at that point in time without requiring any
log replay. Any changes to the file system after the checkpoint are
recovered by replaying the relatively small number of log entries
following the checkpoint.
One of the interesting benefits of the log-structured file system (LFS)
structure is that most
writes to the file system are sequential. The section describing the
motivation for Sprite LFS, written nearly 20 years ago, demonstrates
how little has changed in the storage world:
Over the last decade CPU speeds have increased dramatically while disk
access times have only improved slowly. This trend is likely to
continue in the future and it will cause more and more applications to
become disk-bound. [...] Log-structured file systems are based on the
assumption that files are cached in main memory and that increasing
memory sizes will make the caches more and more effective at
satisfying read requests. As a result, disk traffic will become
dominated by writes.
But wait, why are we still talking about disk seeks? SSDs have
totally changed the performance characteristics of storage! Disks are
dead! Long live flash!
Surprisingly, log-structured file systems are more relevant than ever
when it comes to SSDs. The founding assumption of log-structured file
systems - that reads are cheap and writes are expensive - is
emphatically true for the bare-metal building blocks of
SSDs, NAND-based
flash. (For the rest of this article, "flash" refers to NAND-based
flash and SSD refers to a NAND-based flash device with a
wear-leveling, write-gathering flash translation layer.) When it comes
to flash, reads may be done at small granularities - a few hundreds of
bytes - but writes must be done in large contiguous blocks - on the
order of tens of thousands or hundreds of thousands of bytes. A write
to flash takes two steps: First the entire block is cleared, setting
all the bits to the same value (usually 1, counter-intuitively).
Second, individual bits in the block are flipped back to 0 until you
get the block you wanted.
Log-structured file systems turn out to be a natural fit for flash.
One of the details of the log-structured design is that the log is
written in large contiguous chunks, called "segments," on the
order of several megabytes in size. To cut down on metadata overhead
and get the best performance, log entries are gathered and written out
sequentially to a completely free segment. Most segments are
partially in use and partially free at any given time, so the file
system has to collect all the in-use data from a segment and move it
elsewhere before it can start writing to it. When the file system
needs a fresh segment, it first
cleans an existing partially-used segment by moving all the
in-use, or live data to another free segment - basically, it
garbage-collects. Now that everything is arranged properly, the file
system can do one big streaming write to the empty segment. This
system of segments and cleaning is exactly what is needed to
efficiently write to a flash device, given the necessity to erase
large contiguous blocks of flash before writing to them.
[PULL QUOTE:
Sadly, many
thousands of people probably now associate the Tux penguin bootup logo
with the inability to watch TV on long distance flights.
END QUOTE]
The match between log-structured file systems and flash is obvious
when you look at file systems written for the bare flash programming
interface - that is, for devices without built-in wear-leveling or
write-gathering. File systems that know about and have to manage
erase blocks and other details of the flash hardware are almost
invariably log-structured in design. The most widely used such file
system for Linux is JFFS2, used in many embedded devices, such as
ticket machines and seatback airline entertainment systems. More than
once, I've boarded a plane and seen a JFFS2 error message reporting
flash corruption on a hung seatback entertainment system. (Sadly, many
thousands of people probably now associate the Tux penguin bootup logo
with the inability to watch TV on long distance flights.)
For SSDs that export a disk-style block interface - most
consumer-grade SSDs these days - the operating systems uses a regular
file system to talk to the SSD via the block interface (that is, read
block #37 into this buffer, write this buffer into block #42, etc.).
However, this system still contains the logical equivalent of a
log-structured file system; it's just hidden inside the SSD. The
firmware that implements wear-leveling, write-gathering, and any other
features has to solve the same problems as a log-structured file
system.
Most SSD manufacturers refuse to reveal any details of their internal
firmware, but we can be fairly confident that it has a lot in common
with log-structured file systems. First, the only way to implement
efficient random writes is to buffer them and write them out to a
single erase block together. This requires clearing an erase block,
moving all the in-use blocks to another area, and keeping a mapping
between the logical location of blocks and their physical locations -
exactly what a log-structured file system does. Second, when we do
get SSD implementation details
from research
publications, they look like log-structured file systems. Third,
when we look at long-term performance testing of SSDs, we see the same
pattern of performance degradation over time that we do with
log-structured file systems. We'll talk about this in detail in the
next section.
Log-structured file system performance
Log-structured file systems are a natural fit for flash-based storage
today, but back in 1990, they appeared to have great potential for
disk-based file systems as well. Yet, as we all know, we're not using
log-structured file systems on our disk-based laptops and servers.
What happened?
In short, log-structured file systems performed relatively well as
long as most of the segment cleaning - movement of live data out of a
segment so it can be re-used - could be done in the background when
the file system wasn't busy with "real" work.
The first
major follow-up paper on LFS [PDF] found performance of LFS degraded by
up to 40% from the best case at real-world levels of disk utilization,
memory-to-disk ratio, and file system traffic. In short, in the
steady state the file system was spending a significant amount of disk
access time cleaning segments - moving old data out of a segment so it
could be used for new writes. This segment cleaning problem
was the subject of active research for at least another decade, but
none of the solutions could consistently beat state-of-the-art
write-in-place file systems at practical levels of disk utilization.
It's a little bit like comparing garbage collection to explicit
reference counting for memory management; when memory usage is low and
the occasional high latency hit is okay, the convenience of garbage
collecting outweighs the performance benefits. But at "high" levels
of disk utilization - as little as 50% - the cleaning cost and
periodic high latencies waiting for space to be freed up become a
problem.
As the
first LFS
paper showed, the key to good performance in a log-structured file
system is to place data such that nearly empty segments are created
about as quickly as they are used. The file system write bandwidth is
limited by the rate at which it can produce clean segments. The worst
case happens when, in a file system that is X% full, every segment is also X%
full. Producing one clean segment requires collecting the live data
from:
N = ceiling(1/(1 - X))
segments and writing out the
old data to N - 1 of those segments. For a disk
utilization of 80%, we get:
N = ceiling(1/(1 - .80)) = 1/.20 = 5
segments to clean.
If segments were 1MB in size, we'd have to read
5 * 800KB = 4MB
of data seekily and write 4MB sequentially before we could
write 1MB of new data. (Note to pedants: I'm using MB/KB in powers of
10, not 2).
The best case, instead, is a file system with two kinds of
segments, completely full and completely empty. The best case write
pattern is one that changes all of the metadata and data in a single
segment, so that when the new versions are written out, the old
versions are freed and the entire segment becomes free again. Reality
lies somewhere between these two cases. The goal for a log-structured
file system is to create a bimodal segment usage distribution: Most
segments are either very full or very empty, and full segments tend to
be unchanged. This turns out to be difficult to achieve.
SSDs have an extra interesting constraint: wear-leveling. Even in the
best case in which most segments are 100% full and no writes ever
change the data in them, the SSD must still move those segments around
occasionally because it has to spread writes out over every available
flash block. This adds an extra segment move in some cases and makes
achieving good performance even harder than in a disk-based
log-structured file system.
Lessons - learned?
It's great that SSD manufacturers can learn from two decades of prior
work on log-structured file systems. What's not clear is whether they
are doing so. Most manufacturers take a very closed approach to SSD
firmware development - it's the secret sauce that turns cheap
commodity flash with very low margins into extremely expensive,
reliable, high-performance storage devices with high margins. Some
manufacturers
are
clearly
better at this task than others. Currently, manufacturers are
taking the trade secret strategy for maintaining their competitive
advantage - apply for patents on individual elements of the design,
but keep the overall implementation a secret. The message to file
systems developers is "Just trust us" and "Don't worry your pretty
little systems programmers' heads about it" whenever we ask for more
information on SSD implementation. You can't particularly argue with
this strategy at present, but it tends to come from (and reinforce) the
mindset that not only refuses to share information with the outside, but
also ignores information from the outside, such as previously
published academic work.
One of the greatest missed opportunities for optimization based on
lessons learned from log-structured file systems is the slow adoption
of TRIM
support for SSDs. TRIM is a command to a block device informing it
that a certain range of blocks is no longer in use by the file system
- basically a free() call for blocks. As described
earlier, the best performance comes when empty segments are created as
a side effect of ongoing writes. As a simple example, imagine a
segment that contains only a single inode and all of its file data.
If the next set of writes to the file system overwrites all of the
file data (and the inode as a side effect), then that segment becomes
completely free and the file system doesn't have to move any live data
around before it uses that segment again. The equivalent action for
an SSD is to write to a block that has already been written in the
past. Internally, the SSD knows that the old copy of that block is
now free, and it can reuse it without copying its data elsewhere.
But log-structured file systems have a distinct advantage over
pre-TRIM SSDs (basically all commercially available SSDs as of now,
September 2009). Log-structured file systems know when on-disk data
has been freed even when it isn't overwritten. Consider the case of
deleting the one-segment file: the entire segment is freed, but no
overwrite occurred. A log-structured file system knows that this
happened and now has a free segment to work with. All the SSD sees is
a couple of tiny writes to other blocks on the disk. As far as it's
concerned, the blocks used by the now-deleted file are still precious
data in-use by the file system and it must continue to move that data
around forever. Once every block in the device has been written at
least once, the SSD is doomed to a worst case performance state in
which its spare blocks are at a minimum and data must be moved each
time a new block is rotated into use.
As we've seen, the key to good performance in a log-structured file
system is the availability of free or nearly-free segments. An SSD
without TRIM support does not know about many free segments and
accrues an immense performance disadvantage, which make it somewhat
shocking that any SSD ever shipped without the TRIM feature. My guess
is that SSDs were initially performance tested only with
write-in-place file systems (cough, cough, NTFS) and low total file
system usage (say, 70% or less).
Unfortunately, TRIM in its current form is both designed and implemented to
perform incredibly poorly: TRIM commands aren't tagged and at least one
SSD takes hundreds of milliseconds to process a TRIM command.
Kernel developers have debated exactly how to implement TRIM support
at the Linux Plumbers
Conference, at
the Linux
Storage and File System Workshop, and on mailing lists: what the
performance cost of each TRIM is, what granularity TRIMs should have,
how often they should be issued, and whether it's okay to forget or miss
TRIM commands. In my opinion, the in-use/free state of a block on a
TRIM-enabled device should be tracked as carefully as that of a page
of memory. The file system implementation can take the form of
explicit synchronous alloc()/free() calls, or else
asynchronous garbage collection (during a file system check or
scrubbing run), but we shouldn't "leak" in-use blocks for all the same
reasons we don't leak memory.
Additionally, in an ideal world, TRIM would be redesigned or replaced by a
command that is a full-featured, well-designed first-class citizen in the
ATA spec, rather than a hack bolted on after the fact.
Of course, all this is speculation in the absence of implementation
details from the SSD manufacturers. Perhaps some SSD firmware
programmers have come up with entirely new algorithms for remapping
and write-gathering that don't resemble log-structured file systems at
all, and the performance characteristics and optimizations we have
seen so far just happen to match those for log-structured file
systems. However, so far it appears that treating an SSD as though it
were backed by a log-structured file system is a good rule of thumb
for getting good performance. Full TRIM support by both SSDs and file
systems will be key to long-term good performance.
(
Log in to post comments)