A bcache update
The classic computer science response to such a problem is to add another level of indirection in the form of another layer of caching. In this case, a large array of drives could be hidden behind a much smaller SSD-based cache that provides quick access to frequently-accessed data and turns random access patterns in something closer to sequential access. Hybrid drives and high-end storage arrays have provided this kind of feature for some time, but Linux does not currently have the ability to construct such two-level drives from independent components. That situation could change, though, if the bcache patch set finds its way into the mainline.
LWN last looked at bcache almost two years ago. Since then, the project has been relatively quiet, but development has continued. With the current v13 patch set, bcache creator Kent Overstreet says:
The idea behind bcache is relatively straightforward: given an SSD and one or more storage devices, bcache will interpose the SSD between the kernel and those devices, using the SSD to speed I/O operations to and from the underlying "backing store" devices. If a read request can be satisfied from the SSD, the backing store need not be involved at all. Depending on its configuration, bcache can also buffer write operations; in this mode, it serves as a sort of extended I/O scheduler, reordering operations so that they can be sent to the backing device in a more seek-friendly manner. Once one gets into the details, though, the problem starts to become more complex than one might imagine.
Consider the buffering and reordering of write operations, for example. Some users may be uncomfortable with anything that delays the arrival of data on the backing device; for such situations, bcache can be run in a write-through caching mode. When write-through behavior is selected, no write operation is considered to be complete until it has made it to the backing device. Clearly, in this case, the SSD cache is not going to improve write performance at all, though it may still improve performance overall if that data is read while it remains in the cache.
If, instead, writeback caching is enabled, bcache will mark the completion of writes once they make it to the SSD. It can then flush those dirty blocks out to the backing device at its leisure. Writeback caching can allow the system to coalesce multiple writes to the same blocks and to achieve better on-disk locality when the writes are eventually flushed out; both of those should improve performance. Obviously, writeback caching also carries the risk of losing data if the system is struck by a meteorite before the writeback operation is complete. Bcache includes a fair amount of code meant to address this concern; the SSD contains an index as well as the cached data, so dirty blocks can be located and written back after the system comes back up. Providing meteorite-proof drives is beyond the scope of the bcache patch set, though.
Of course, maintaining this index on the SSD has some performance costs of its own, especially since bcache takes pains to only write full erase blocks at a time. One write operation from the kernel can turn into several operations at the SSD level to ensure that the on-SSD data structures are consistent at all times. To mitigate this cost, bcache provides an optional journaling layer that can speed up operations at the SSD level.
Another interesting problem that comes with writeback caching is the implementation of barrier operations. Filesystems use barriers (implemented as synchronous "force to media" operations in contemporary kernels) to ensure that the on-disk filesystem structure is consistent at all times. If bcache does not recognize and implement those barriers, it runs the risk of wrecking the filesystem's careful ordering of operations and corrupting things on the backing device. Unfortunately, bcache does indeed lack such support at the moment, leading to a strong recommendation to mount filesystems with barriers disabled for now.
Multi-layer solutions like bcache must face another hazard: what happens if somebody accesses the underlying backing device directly, routing around bcache? Such access could result in filesystem corruption. Bcache handles this possibility by requiring exclusive access to the backing device. That device is formatted with a special marker, and its leading blocks are hidden when accessing the device by way of bcache. Thus, the beginning of the device under bcache is not the same as the beginning when the device is accessed directly. That means that a filesystem created through bcache will not be recognized by the filesystem code if an attempt is made to mount the backing device directly. Simple attempts to shoot one's own feet should be defeated by this mechanism; as always, there is little point in doing more to protect those who are really determined to injure themselves.
There seems to be a reasonable level of consensus that bcache would be a useful functionality to add to the kernel. There are some obstacles to overcome before this code can be merged, though. One of those is that bcache adds its own management interface involving a set of dedicated tools and a complex sysfs structure. There is resistance to adding another API for block device management, so Kent has been encouraged to integrate bcache into the device mapper code. Nobody seems to be working on that project at the moment, but Dan Williams has posted a set of patches integrating bcache into the MD RAID layer. With these patches, a simple mdadm command is sufficient to set up an array with SSD caching added on top. Once that code gets into shape, presumably the user-space interface concerns will be somewhat lessened.
A harder problem to get around may be the simple fact that the bcache patch
set is large, adding over 15,000 lines of code to the kernel. Included
therein is a fair amount of tricky data structure work such as a complex
btree implementation and "closures," being "asynchronous refcounty
things based on workqueues
". The complexity of the code will make
it hard to review, but, given the potential for trouble when adding a new
stage to the block I/O path, developers will want this code to be well
reviewed indeed. Getting enough eyeballs directed toward this code could
be a challenge, but the benefit, in the form of faster storage devices,
could well be worth the trouble.
Index entries for this article | |
---|---|
Kernel | Block layer/Caching |
Posted May 14, 2012 20:07 UTC (Mon)
by blitzkrieg3 (guest, #57873)
[Link] (6 responses)
It isn't clear to me why this is true. SSDs are persistent storage and the data is still in the SSD, so why can't this be persistent? The only way this could be a problem is if the mapping is in memory and not written out to the SSD ever.
Posted May 14, 2012 20:10 UTC (Mon)
by corbet (editor, #1)
[Link] (4 responses)
Posted May 14, 2012 20:49 UTC (Mon)
by Lennie (subscriber, #49641)
[Link]
Eventually bcache is supposed to get support for mirroring the dirty data, so your dirty data will be stored on two SSDs before it is written to your already redundant backing store (like a RAID1 of HDDs).
Any data that is only a cached copy of data already written to the backing store will only be stored ones on one of the SSDs.
When that has been added that should take away most concerns people might have about their data.
Posted May 14, 2012 23:28 UTC (Mon)
by russell (guest, #10458)
[Link] (2 responses)
Posted May 15, 2012 7:56 UTC (Tue)
by Tobu (subscriber, #24111)
[Link] (1 responses)
Posted May 15, 2012 9:05 UTC (Tue)
by Lennie (subscriber, #49641)
[Link]
About how bcache will support more than one SSD in the future and how it will save 2 copies of your precious data on different SSDs instead of one:
Posted May 15, 2012 19:13 UTC (Tue)
by dwmw2 (subscriber, #2063)
[Link]
Posted May 14, 2012 20:51 UTC (Mon)
by alankila (guest, #47141)
[Link] (2 responses)
Maybe it's somehow critically far worse, but then again, perhaps it's true that perfect is the enemy of good. We'll probably know when we can benchmark these technologies against each other.
Posted May 14, 2012 21:18 UTC (Mon)
by Beolach (guest, #77384)
[Link] (1 responses)
Posted May 14, 2012 21:37 UTC (Mon)
by alankila (guest, #47141)
[Link]
Posted May 14, 2012 22:40 UTC (Mon)
by ballombe (subscriber, #9523)
[Link] (1 responses)
> Such a device could simultaneously increase the performance ecstasy that comes with solid-state storage and the overwhelming financial joy associated with rotating storage.
(just kidding.)
Posted May 15, 2012 4:15 UTC (Tue)
by jzbiciak (guest, #5246)
[Link]
Posted May 14, 2012 23:25 UTC (Mon)
by butlerm (subscriber, #13312)
[Link] (7 responses)
In addition, clearing out the first block of the backing device doesn't seem like such a great idea on simple configurations. If the SSD fails, the remaining filesystem would practically be guaranteed to be in a seriously inconsistent state upon recovery. I would be concerned about using this in a sufficiently critical setup without the ability to mirror the cache devices.
It would also seem that the best way to run such a cache would be with a relatively tight integration with the raid layer, such that backing store devices are marked with metadata indicating the association with the cache device, so that the group of them can properly be reassembled after a hardware failure, possibly on a different system. The raid layer could then potentially use the cache as a write intent log as well, which is a big deal for raid 5/6 setups.
If you want write performance improvements, the ideal thing to do is not to synchronously flush dirty data in the cache to the backing store on receipt of a write barrier, but rather to journalize all the writes themselves on the cache device, so that they can be applied in the correct order on system recovery. Otherwise you get a milder version of problem where synchronous writes are held up by asynchronous ones, which can dramatically affect the time required for an fsync operation, a database commit, and other similar operations, if there are other asynchronous writes pending.
On the other hand, a device like this could make it possible for the raid / lvm layers to handle write barriers properly, i.e. with context specific barriers, that only apply to one logical volume, or subset of the writes from a specific filesystem, without running the risk of serious inconsistency problems due to the normal inability to write synchronously to a group of backing store devices. If you have a reliable, low overhead write intent log, you can avoid that problem, and do it right, degrading to flush the world mode when a persistent intent log is not available.
Posted May 14, 2012 23:54 UTC (Mon)
by koverstreet (✭ supporter ✭, #4296)
[Link] (6 responses)
Yes. That's been priority #1 from the start - it's particularly critical if you
> In addition, clearing out the first block of the backing device doesn't seem
Yeah. For now, you can just use a raid1 of SSDs. Eventually, I'll finish
> It would also seem that the best way to run such a cache would be with a
I think if you're using bcache for writeback caching on top of your raid5/6,
> If you want write performance improvements, the ideal thing to do is not to
Yes. But all that's really required for that is writeback caching (things get
> On the other hand, a device like this could make it possible for the raid /
Well, old style write barriers have gone away - now there's just cache flushes
Posted May 15, 2012 0:55 UTC (Tue)
by dlang (guest, #313)
[Link] (4 responses)
>I think if you're using bcache for writeback caching on top of your raid5/6,
Where tight integration would be nice would be if bcache can align the writeback to the stripe size and alignment the same way that it tries to align the SSD writes to the eraseblock size and alignment.
Posted May 15, 2012 3:43 UTC (Tue)
by koverstreet (✭ supporter ✭, #4296)
[Link] (3 responses)
But for that to be useful we'd have to have writeback specifically pick bigger sequential chunks of dirty data and skip smaller ones, and I'm not sure how useful that actually is. Right now it just flushes dirty data in sorted order, which makes things really easy and works quite well in practice - in particular for regular hard drives, even if your dirty data isn't purely sequential you're still minimizing seek distance. And if you let gigabytes of dirty data buffer up (via writeback_percent) - the dirty data's going to be about as sequential as it's gonna get.
But yeah, there's all kinds of interesting tricks we could do.
Posted May 15, 2012 22:34 UTC (Tue)
by intgr (subscriber, #39733)
[Link] (2 responses)
That's certainly an important optimization for mixed random/sequential write workloads whose working set is larger than the SSD. To make best use of both kinds of disks, random writes should persist on the SSD as long as possible, whereas longer sequential writes should be pushed out quickly to make more room for random writes.
Posted May 16, 2012 1:52 UTC (Wed)
by koverstreet (✭ supporter ✭, #4296)
[Link] (1 responses)
If you can show me a workload that'd benefit though - I'm not opposed to the idea, it's just a question of priorities.
Posted May 16, 2012 7:35 UTC (Wed)
by intgr (subscriber, #39733)
[Link]
Oh, I didn't realize that. That should take care of most of it. There's probably still some benefit to sorting writeback by size, but I'm not sure whether it's worth the complexity.
Posted May 18, 2012 13:54 UTC (Fri)
by Spudd86 (subscriber, #51683)
[Link]
Posted May 14, 2012 23:56 UTC (Mon)
by koverstreet (✭ supporter ✭, #4296)
[Link] (2 responses)
I need to delete that from the wiki, it's out of date - it's only old style
Posted May 15, 2012 4:16 UTC (Tue)
by alonz (subscriber, #815)
[Link] (1 responses)
Posted May 15, 2012 7:13 UTC (Tue)
by koverstreet (✭ supporter ✭, #4296)
[Link]
Posted May 15, 2012 2:42 UTC (Tue)
by omar (guest, #18331)
[Link]
Posted May 15, 2012 12:29 UTC (Tue)
by arekm (guest, #4846)
[Link] (1 responses)
This would mean that I can't add ssd bcache to existing device/filesystem, right? Would be nice to be able to add and remove bcache when needed for a existing dev/fs.
Posted Jul 17, 2012 16:30 UTC (Tue)
by Lennie (subscriber, #49641)
[Link]
I believe if you've formatted your device ones for bcache you can still always add/remove a cache device. In other words, it should run just fine without any cache device.
Posted May 15, 2012 16:35 UTC (Tue)
by aorth (subscriber, #55260)
[Link] (2 responses)
http://rapiddisk.org/index.php?title=RapidCache
RapidCache uses RAM, so you can't cache AS MUCH, but it's probably faster... probably some scenarios where you'd want to use one or the other depending on your data set, applications, etc.
Posted May 16, 2012 6:01 UTC (Wed)
by Rudd-O (guest, #61155)
[Link] (1 responses)
Posted May 16, 2012 10:42 UTC (Wed)
by aorth (subscriber, #55260)
[Link]
Posted May 16, 2012 5:59 UTC (Wed)
by Rudd-O (guest, #61155)
[Link] (3 responses)
Compared to the ZIL and the L2ARC (both in ZFS and available now in-kernel for Linux), bcache is not even *remotely* close.
Posted May 16, 2012 9:02 UTC (Wed)
by abacus (guest, #49001)
[Link] (1 responses)
Posted May 20, 2012 0:25 UTC (Sun)
by jospoortvliet (guest, #33164)
[Link]
Posted May 17, 2012 1:27 UTC (Thu)
by koverstreet (✭ supporter ✭, #4296)
[Link]
Posted May 16, 2012 18:14 UTC (Wed)
by butlerm (subscriber, #13312)
[Link] (2 responses)
Assuming the cache is reliable and persistent, that is exactly what you want in most situations. If a higher level cache flush command actually forces a cache flush of a persistent write back cache, write back mode will be approximately useless. You might as well just leave it in RAM.
Posted May 17, 2012 1:29 UTC (Thu)
by koverstreet (✭ supporter ✭, #4296)
[Link] (1 responses)
If a mode where all writes were writeback cached and cache flushes were never sent to the backing device would be useful, it'd be trivial to add.
Posted May 17, 2012 6:30 UTC (Thu)
by butlerm (subscriber, #13312)
[Link]
In a typical journalling filesystem, it is usually only the journal entries that need to be synchronously flushed at all. Most writes can be applied independently. If a mounted filesystem had two or more I/O "threads" (or flows), one for journal and synchronous data writes, and one for ordinary data writes, an intelligent lower layer could handle a barrier on one thread by flushing a small amount of journal data, while the other one takes its own sweet time - with a persistent cache, even across a system restart if necessary.
Otherwise, the larger the write cache, the larger the delay when a journal commit comes along. Call it a block layer version of buffer bloat. As with networking, other than making the buffer smaller, the typical solution is to use multiple class or flow based queues. If you don't have flow based queuing, you really don't want much of a buffer at all, because it causes latency to skyrocket.
As a consequence, I don't see how write back caching can help very much here, unless all writes (or at least meta data for all out of order writes) are queued in the cache, so that write ordering is not broken. Am I wrong?
Posted May 24, 2012 11:58 UTC (Thu)
by ViralMehta (guest, #80756)
[Link]
Posted May 27, 2012 18:22 UTC (Sun)
by dev (guest, #34359)
[Link]
A bcache update
Yes, exactly...as I tried to explain in that same paragraph. The data does exist on SSD, but it's only useful if the post-meteorite kernel knows what data is there. So the index and such have to be saved to the SSD along with the data.
A bcache update
bcache cache-sets
A bcache update
I think the SSD will be a single point of failure in writeback mode, because the underlying filesystem would have megabytes of metadata or journal blocks not written in the right order, which is bad corruption. I don't know how SSDs tend to fail; if they fail into a read-only the writes would still be recoverable in this case, as long as bcache can replay from a read-only SSD. Maybe a filesystem that handles SSD caching itself could avoid that risk.
A bcache update
A bcache update
A bcache update
"SSDs are persistent storage and the data is still in the SSD, so why can't this be persistent? The only way this could be a problem is if the mapping is in memory and not written out to the SSD ever."
Or if your SSD is like almost all SSDs ever seen, where the internal translation is a "black box" which you can't trust, which is known for its unreliability especially in the face of unexpected power failure, and which you can't debug or diagnose when it goes wrong.
A bcache update
The bcache wiki has a Performance section, which includes a link to just such a comparison.
A bcache update
A bcache update
A bcache update
A bcache update
A bcache update
A bcache update
> of the cache across restarts? That would be a major advantage in avoiding cache
> warm up time in many applications, particularly large databases.
want to be able to use writeback... at all.
> like such a great idea on simple configurations. If the SSD fails, the
> remaining filesystem would practically be guaranteed to be in a seriously
> inconsistent state upon recovery. I would be concerned about using this in a
> sufficiently critical setup without the ability to mirror the cache devices.
multiple cache device support - you'll be able to have multiple SSDs in a cache
set, and only dirty data and metadata will be mirrored.
> relatively tight integration with the raid layer, such that backing store
> devices are marked with metadata indicating the association with the cache
> device, so that the group of them can properly be reassembled after a hardware
> failure, possibly on a different system. The raid layer could then potentially
> use the cache as a write intent log as well, which is a big deal for raid 5/6
> setups.
things will work out pretty well without needing any tight integration;
bcache's writeback tries hard to gather up big sequential IOs, and raid5/6 will
handle those just fine.
> synchronously flush dirty data in the cache to the backing store on receipt of
> a write barrier, but rather to journalize all the writes themselves on the
> cache device, so that they can be applied in the correct order on system
> recovery. Otherwise you get a milder version of problem where synchronous
> writes are held up by asynchronous ones, which can dramatically affect the time
> required for an fsync operation, a database commit, and other similar
> operations, if there are other asynchronous writes pending.
complicated if you only want to use writeback caching for random writes, but
I'll handwave that away for now).
> lvm layers to handle write barriers properly, i.e. with context specific
> barriers, that only apply to one logical volume, or subset of the writes from a
> specific filesystem, without running the risk of serious inconsistency problems
> due to the normal inability to write synchronously to a group of backing store
> devices. If you have a reliable, low overhead write intent log, you can avoid
> that problem, and do it right, degrading to flush the world mode when a
> persistent intent log is not available.
(and FUA), which are much saner to handle.
A bcache update
things will work out pretty well without needing any tight integration;
bcache's writeback tries hard to gather up big sequential IOs, and raid5/6 will handle those just fine.
A bcache update
A bcache update
A bcache update
A bcache update
A bcache update
A bcache update
> implementation of barrier operations. Filesystems use barriers (implemented as
> synchronous "force to media" operations in contemporary kernels) to ensure that
> the on-disk filesystem structure is consistent at all times. If bcache does not
> recognize and implement those barriers, it runs the risk of wrecking the
> filesystem's careful ordering of operations and corrupting things on the
> backing device. Unfortunately, bcache does indeed lack such support at the
> moment, leading to a strong recommendation to mount filesystems with barriers
> disabled for now.
(pre 2.6.38) barriers that were a problem, because they implied much stricter
ordering. Cache flushes/FUA are much easier, and are handled just fine.
Do you mean the advice to mount filesystems with “-o nobarrier” is outdated?
A bcache update
A bcache update
A bcache update
A bcache update
A bcache update
Similar/related project
Similar/related project
Ok, it's a bit over my head, but I'd imagine there must be some benefit to RapidCache. Surely the developer had a legitimate reason to warrant developing it?
Similar/related project
A bcache update
Has someone already published a comparison of the in-kernel ZFS on Linux and BTRFS ? Some people claim that ZFS is better.
ZFS on Linux
ZFS on Linux
A bcache update
A bcache update
A bcache update
A bcache update
A bcache update
A bcache update