By Jonathan Corbet
May 14, 2012
Flash-based solid-state storage devices (SSDs) have a lot to recommend
them; in particular, they can be quite fast even when faced with highly
non-sequential I/O patterns. But SSDs are also relatively small and
expensive; for that reason, for all their virtues, they will not be fully
replacing rotating storage devices for a long time. It would be nice to
have a storage device that provided the best features of both SSDs and
rotating devices—the speed of flash combined with the cheap storage
capacity of traditional drives. Such a device could simultaneously reduce
the performance pain that comes with rotating storage and the financial
pain associated with solid-state storage.
The classic computer science response to such a problem is to add another
level of indirection in the form of another layer of caching. In
this case, a large array of drives could be hidden behind a much smaller
SSD-based cache that provides quick access to frequently-accessed data and
turns random access patterns in something closer to sequential access.
Hybrid drives and high-end storage arrays have provided this kind of
feature for some time, but Linux does not currently have the ability to
construct such two-level drives from independent components. That
situation could change, though, if the bcache patch set finds its way into the
mainline.
LWN last looked at bcache almost two years
ago. Since then, the project has been relatively quiet, but development
has continued. With the current v13 patch set, bcache creator Kent Overstreet
says:
Bcache is solid, production ready code. There are still bugs being
found that affect specific configurations, but there haven't been
any major issues found in awhile - it's well past time I started
working on getting it into mainline.
The idea behind bcache is relatively straightforward: given an SSD and one
or more storage devices, bcache will interpose the SSD between the kernel
and those devices, using the SSD to speed I/O operations to and from the
underlying "backing store" devices. If a read request can be satisfied
from the SSD, the backing store need not be involved at all. Depending on
its configuration, bcache can also buffer write operations; in this mode,
it serves as a sort of extended I/O scheduler, reordering operations so
that they can be sent to the backing device in a more seek-friendly manner.
Once one gets into the details, though, the problem starts to become more
complex than one might imagine.
Consider the buffering and reordering of write operations, for example.
Some users may be uncomfortable with anything that delays the arrival of
data on the backing device; for such situations, bcache can be run in a
write-through caching mode. When write-through behavior is selected, no
write operation is considered to be complete until it has made it to the
backing device. Clearly, in this case, the SSD cache is not going to
improve write performance at all, though it may still improve performance
overall if that data is read while it remains in the cache.
If, instead, writeback caching is enabled, bcache will mark the completion of
writes once they make it to the SSD. It can then flush those dirty blocks
out to the backing device at its leisure. Writeback caching can allow the
system to coalesce multiple writes to the same blocks and to achieve better
on-disk locality when the writes are eventually flushed out; both of those
should improve performance. Obviously, writeback caching also carries the
risk of losing data if the system is struck by a meteorite before the
writeback operation is complete. Bcache includes a fair amount of code
meant to address this concern; the SSD contains an index as well as the
cached data, so dirty blocks can be located and written back after the
system comes back up. Providing meteorite-proof drives is beyond the scope
of the bcache patch set, though.
Of course, maintaining this index on the SSD has some performance costs of
its own, especially since bcache takes pains to only write full erase
blocks at a time. One write operation from the kernel can turn into
several operations at the SSD level to ensure that the on-SSD data
structures are consistent at all times. To mitigate this cost, bcache
provides an optional journaling layer that can speed up operations at the
SSD level.
Another interesting problem that comes with writeback caching is the
implementation of barrier operations. Filesystems use barriers
(implemented as synchronous "force to media" operations in contemporary
kernels) to ensure that the on-disk filesystem structure is consistent at
all times. If bcache does not recognize and implement those barriers, it
runs the risk of wrecking the filesystem's careful ordering of operations
and corrupting things on the backing device. Unfortunately,
bcache does indeed lack such support at the moment, leading to a strong
recommendation to mount filesystems with barriers disabled for now.
Multi-layer solutions like bcache must face another hazard: what happens if
somebody accesses the underlying backing device directly, routing around
bcache? Such access could result in filesystem corruption. Bcache handles
this possibility by requiring exclusive access to the backing device. That
device is formatted with a special marker, and its leading blocks are
hidden when accessing the device by way of bcache. Thus, the beginning of
the device under bcache is not the same as the beginning when the device is
accessed directly. That means that a filesystem created through bcache
will not be recognized by the filesystem code if an attempt is made to
mount the backing device directly. Simple attempts to shoot one's own feet
should be defeated by this mechanism; as always, there is little point in
doing more to protect those who are really determined to injure themselves.
There seems to be a reasonable level of consensus that bcache would be a
useful functionality to add to the kernel. There are some obstacles to
overcome before this code can be merged, though. One of those is that
bcache adds its own management interface involving a set of dedicated tools
and a complex sysfs structure. There is resistance to adding another API
for block device management, so Kent has been encouraged to integrate
bcache into the device mapper code. Nobody seems to be working on that
project at the moment, but Dan Williams has posted a set of patches integrating bcache into the
MD RAID layer. With these patches, a simple mdadm command is
sufficient to set up an array with SSD caching added on top. Once that
code gets into shape, presumably the user-space interface concerns will be
somewhat lessened.
A harder problem to get around may be the simple fact that the bcache patch
set is large, adding over 15,000 lines of code to the kernel. Included
therein is a fair amount of tricky data structure work such as a complex
btree implementation and "closures," being "asynchronous refcounty
things based on workqueues." The complexity of the code will make
it hard to review, but, given the potential for trouble when adding a new
stage to the block I/O path, developers will want this code to be well
reviewed indeed. Getting enough eyeballs directed toward this code could
be a challenge, but the benefit, in the form of faster storage devices,
could well be worth the trouble.
(
Log in to post comments)