LWN: Comments on "A bcache update" https://lwn.net/Articles/497024/ This is a special feed containing comments posted to the individual LWN article titled "A bcache update". en-us Thu, 25 Sep 2025 16:13:08 +0000 Thu, 25 Sep 2025 16:13:08 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net A bcache update https://lwn.net/Articles/507081/ https://lwn.net/Articles/507081/ Lennie <div class="FormattedComment"> "This would mean that I can't add ssd bcache to existing device/filesystem, right? Would be nice to be able to add and remove bcache when needed for a existing dev/fs."<br> <p> I believe if you've formatted your device ones for bcache you can still always add/remove a cache device. In other words, it should run just fine without any cache device.<br> </div> Tue, 17 Jul 2012 16:30:34 +0000 A bcache update https://lwn.net/Articles/498965/ https://lwn.net/Articles/498965/ dev <div class="FormattedComment"> I guess it's dangerous until it supports barriers and can't be used for databases. Even standard FS can get corrupted on power failure...<br> <p> </div> Sun, 27 May 2012 18:22:36 +0000 A bcache update https://lwn.net/Articles/498551/ https://lwn.net/Articles/498551/ ViralMehta <div class="FormattedComment"> It looks close to the idea what Windows is doing when we attach USB drive, isn't it ? It uses USB drive's memory as extended RAM.... <br> </div> Thu, 24 May 2012 11:58:06 +0000 ZFS on Linux https://lwn.net/Articles/497909/ https://lwn.net/Articles/497909/ jospoortvliet <div class="FormattedComment"> no s* sherlock... ZFS is what, 5 years older? Yeah, it is probably still ahead in many scenarios...<br> </div> Sun, 20 May 2012 00:25:42 +0000 A bcache update https://lwn.net/Articles/497729/ https://lwn.net/Articles/497729/ Spudd86 <div class="FormattedComment"> Another solution if you're using btrfs is to just set up a bcache for each btrfs device you want to use (Which is the only way it would work if you're using btrfs internal RAID, I suppose). Then your data is redundant always...<br> </div> Fri, 18 May 2012 13:54:40 +0000 A bcache update https://lwn.net/Articles/497477/ https://lwn.net/Articles/497477/ butlerm <div class="FormattedComment"> This is why the block layer really ought to support write "threads" with thread specific write barriers. It is trivial to convert a barrier to a cache flush where it is impractical to do something more intelligent, but thread specific barriers are ideal for fast commits of journal entries and the like.<br> <p> In a typical journalling filesystem, it is usually only the journal entries that need to be synchronously flushed at all. Most writes can be applied independently. If a mounted filesystem had two or more I/O "threads" (or flows), one for journal and synchronous data writes, and one for ordinary data writes, an intelligent lower layer could handle a barrier on one thread by flushing a small amount of journal data, while the other one takes its own sweet time - with a persistent cache, even across a system restart if necessary.<br> <p> Otherwise, the larger the write cache, the larger the delay when a journal commit comes along. Call it a block layer version of buffer bloat. As with networking, other than making the buffer smaller, the typical solution is to use multiple class or flow based queues. If you don't have flow based queuing, you really don't want much of a buffer at all, because it causes latency to skyrocket.<br> <p> As a consequence, I don't see how write back caching can help very much here, unless all writes (or at least meta data for all out of order writes) are queued in the cache, so that write ordering is not broken. Am I wrong?<br> </div> Thu, 17 May 2012 06:30:33 +0000 A bcache update https://lwn.net/Articles/497455/ https://lwn.net/Articles/497455/ koverstreet <div class="FormattedComment"> The problem is that if you haven't been writeback caching _everything_ - i.e. your sequential writes have been bypassing the cache - you still need that cache flush to the backing device.<br> <p> If a mode where all writes were writeback cached and cache flushes were never sent to the backing device would be useful, it'd be trivial to add.<br> </div> Thu, 17 May 2012 01:29:40 +0000 A bcache update https://lwn.net/Articles/497454/ https://lwn.net/Articles/497454/ koverstreet <div class="FormattedComment"> I would like to see something to back that up.<br> </div> Thu, 17 May 2012 01:27:53 +0000 A bcache update https://lwn.net/Articles/497387/ https://lwn.net/Articles/497387/ butlerm <div class="FormattedComment"> Next question. Considering that there is a write back cache on persistent media, does that mean when the block layer issues a write cache flush (force unit access) command, that bcache synchronously forces the blocks to SSD, but _not_ synchronously to the backing store device?<br> <p> Assuming the cache is reliable and persistent, that is exactly what you want in most situations. If a higher level cache flush command actually forces a cache flush of a persistent write back cache, write back mode will be approximately useless. You might as well just leave it in RAM.<br> </div> Wed, 16 May 2012 18:14:09 +0000 Similar/related project https://lwn.net/Articles/497322/ https://lwn.net/Articles/497322/ aorth Ok, it's a bit over my head, but I'd imagine there must be <em>some</em> benefit to RapidCache. Surely the developer had a legitimate reason to warrant developing it? Wed, 16 May 2012 10:42:07 +0000 ZFS on Linux https://lwn.net/Articles/497308/ https://lwn.net/Articles/497308/ abacus Has someone already published a comparison of the in-kernel <a href="http://zfsonlinux.org/">ZFS on Linux</a> and BTRFS ? Some people claim that <a href="http://rudd-o.com/linux-and-free-software/ways-in-which-zfs-is-better-than-btrfs">ZFS is better</a>. Wed, 16 May 2012 09:02:58 +0000 A bcache update https://lwn.net/Articles/497297/ https://lwn.net/Articles/497297/ intgr <div class="FormattedComment"> <font class="QuotedText">&gt; Well, you want the sequential writes to bypass the cache, which is what bcache does</font><br> <p> Oh, I didn't realize that. That should take care of most of it. There's probably still some benefit to sorting writeback by size, but I'm not sure whether it's worth the complexity.<br> <p> </div> Wed, 16 May 2012 07:35:38 +0000 Similar/related project https://lwn.net/Articles/497291/ https://lwn.net/Articles/497291/ Rudd-O <div class="FormattedComment"> RapidCache does exactly what the LRU block cache already does. There is no gain in fixing and setting aside an amount of RAM to do RapidCache.<br> </div> Wed, 16 May 2012 06:01:00 +0000 A bcache update https://lwn.net/Articles/497290/ https://lwn.net/Articles/497290/ Rudd-O <div class="FormattedComment"> Seriously,<br> <p> Compared to the ZIL and the L2ARC (both in ZFS and available now in-kernel for Linux), bcache is not even *remotely* close.<br> </div> Wed, 16 May 2012 05:59:28 +0000 A bcache update https://lwn.net/Articles/497262/ https://lwn.net/Articles/497262/ koverstreet <div class="FormattedComment"> Well, you want the sequential writes to bypass the cache, which is what bcache does.<br> <p> If you can show me a workload that'd benefit though - I'm not opposed to the idea, it's just a question of priorities.<br> </div> Wed, 16 May 2012 01:52:22 +0000 A bcache update https://lwn.net/Articles/497238/ https://lwn.net/Articles/497238/ intgr <div class="FormattedComment"> <font class="QuotedText">&gt; But for that to be useful we'd have to have writeback specifically pick bigger sequential chunks of dirty data and skip smaller ones, and I'm not sure how useful that actually is</font><br> <p> That's certainly an important optimization for mixed random/sequential write workloads whose working set is larger than the SSD. To make best use of both kinds of disks, random writes should persist on the SSD as long as possible, whereas longer sequential writes should be pushed out quickly to make more room for random writes.<br> <p> </div> Tue, 15 May 2012 22:34:43 +0000 A bcache update https://lwn.net/Articles/497219/ https://lwn.net/Articles/497219/ dwmw2 <blockquote><i>"SSDs are persistent storage and the data is still in the SSD, so why can't this be persistent? The only way this could be a problem is if the mapping is in memory and not written out to the SSD ever."</i></blockquote> Or if your SSD is like almost all SSDs ever seen, where the internal translation is a "black box" which you can't trust, which is known for its unreliability especially in the face of unexpected power failure, and which you can't debug or diagnose when it goes wrong. Tue, 15 May 2012 19:13:25 +0000 Similar/related project https://lwn.net/Articles/497208/ https://lwn.net/Articles/497208/ aorth <div class="FormattedComment"> Cool! I like the idea of using faster, less-reliable storage as a layer above slow, more-reliable storage. On that note, another interesting read/write through cache is RapidDisk's RapidCache. Check it out:<br> <p> <a href="http://rapiddisk.org/index.php?title=RapidCache">http://rapiddisk.org/index.php?title=RapidCache</a><br> <p> RapidCache uses RAM, so you can't cache AS MUCH, but it's probably faster... probably some scenarios where you'd want to use one or the other depending on your data set, applications, etc.<br> </div> Tue, 15 May 2012 16:35:12 +0000 A bcache update https://lwn.net/Articles/497172/ https://lwn.net/Articles/497172/ arekm <div class="FormattedComment"> "That device is formatted with a special marker, and its leading blocks are hidden when accessing the device by way of bcache."<br> <p> This would mean that I can't add ssd bcache to existing device/filesystem, right? Would be nice to be able to add and remove bcache when needed for a existing dev/fs.<br> </div> Tue, 15 May 2012 12:29:58 +0000 A bcache update https://lwn.net/Articles/497161/ https://lwn.net/Articles/497161/ Lennie <div class="FormattedComment"> Did you see my other comment ?<br> <p> About how bcache will support more than one SSD in the future and how it will save 2 copies of your precious data on different SSDs instead of one:<br> <p> <a href="http://lwn.net/Articles/497126/">http://lwn.net/Articles/497126/</a><br> </div> Tue, 15 May 2012 09:05:56 +0000 A bcache update https://lwn.net/Articles/497157/ https://lwn.net/Articles/497157/ Tobu I think the SSD will be a single point of failure in writeback mode, because the underlying filesystem would have megabytes of metadata or journal blocks not written in the right order, which is bad corruption. I don't know how SSDs tend to fail; if they fail into a read-only the writes would still be recoverable in this case, as long as bcache can replay from a read-only SSD. Maybe a filesystem that handles SSD caching itself could avoid that risk. Tue, 15 May 2012 07:56:32 +0000 A bcache update https://lwn.net/Articles/497156/ https://lwn.net/Articles/497156/ koverstreet <div class="FormattedComment"> Yep. I took it out of the wiki, no need anymore.<br> </div> Tue, 15 May 2012 07:13:04 +0000 A bcache update https://lwn.net/Articles/497151/ https://lwn.net/Articles/497151/ alonz Do you mean the advice to mount filesystems with &ldquo;<tt>-o nobarrier</tt>&rdquo; is outdated? Tue, 15 May 2012 04:16:30 +0000 A bcache update https://lwn.net/Articles/497152/ https://lwn.net/Articles/497152/ jzbiciak <div class="FormattedComment"> That reads like poorly translated Japanese or Chinese documentation...<br> </div> Tue, 15 May 2012 04:15:50 +0000 A bcache update https://lwn.net/Articles/497150/ https://lwn.net/Articles/497150/ koverstreet <div class="FormattedComment"> For that we wouldn't really need tight integration - you could conceivably just tell bcache the stripe size of the backing device, and it wouldn't consider partial stripes sequential with full stripes.<br> <p> But for that to be useful we'd have to have writeback specifically pick bigger sequential chunks of dirty data and skip smaller ones, and I'm not sure how useful that actually is. Right now it just flushes dirty data in sorted order, which makes things really easy and works quite well in practice - in particular for regular hard drives, even if your dirty data isn't purely sequential you're still minimizing seek distance. And if you let gigabytes of dirty data buffer up (via writeback_percent) - the dirty data's going to be about as sequential as it's gonna get.<br> <p> But yeah, there's all kinds of interesting tricks we could do.<br> </div> Tue, 15 May 2012 03:43:36 +0000 A bcache update https://lwn.net/Articles/497147/ https://lwn.net/Articles/497147/ omar <div class="FormattedComment"> Are there any plans to turn this into a module which can be used by something like a RHEL 2.6.32 kernel?<br> </div> Tue, 15 May 2012 02:42:28 +0000 A bcache update https://lwn.net/Articles/497145/ https://lwn.net/Articles/497145/ dlang <div class="FormattedComment"> <font class="QuotedText">&gt;&gt; It would also seem that the best way to run such a cache would be with a relatively tight integration with the raid layer, such that backing store devices are marked with metadata indicating the association with the cache device, so that the group of them can properly be reassembled after a hardware failure, possibly on a different system. The raid layer could then potentially use the cache as a write intent log as well, which is a big deal for raid 5/6 setups.</font><br> <p> <font class="QuotedText">&gt;I think if you're using bcache for writeback caching on top of your raid5/6,</font><br> things will work out pretty well without needing any tight integration;<br> bcache's writeback tries hard to gather up big sequential IOs, and raid5/6 will handle those just fine.<br> <p> Where tight integration would be nice would be if bcache can align the writeback to the stripe size and alignment the same way that it tries to align the SSD writes to the eraseblock size and alignment.<br> </div> Tue, 15 May 2012 00:55:20 +0000 A bcache update https://lwn.net/Articles/497143/ https://lwn.net/Articles/497143/ koverstreet <div class="FormattedComment"> <font class="QuotedText">&gt; Another interesting problem that comes with writeback caching is the</font><br> <font class="QuotedText">&gt; implementation of barrier operations. Filesystems use barriers (implemented as</font><br> <font class="QuotedText">&gt; synchronous "force to media" operations in contemporary kernels) to ensure that</font><br> <font class="QuotedText">&gt; the on-disk filesystem structure is consistent at all times. If bcache does not</font><br> <font class="QuotedText">&gt; recognize and implement those barriers, it runs the risk of wrecking the</font><br> <font class="QuotedText">&gt; filesystem's careful ordering of operations and corrupting things on the</font><br> <font class="QuotedText">&gt; backing device. Unfortunately, bcache does indeed lack such support at the</font><br> <font class="QuotedText">&gt; moment, leading to a strong recommendation to mount filesystems with barriers</font><br> <font class="QuotedText">&gt; disabled for now.</font><br> <p> I need to delete that from the wiki, it's out of date - it's only old style <br> (pre 2.6.38) barriers that were a problem, because they implied much stricter <br> ordering. Cache flushes/FUA are much easier, and are handled just fine.<br> </div> Mon, 14 May 2012 23:56:29 +0000 A bcache update https://lwn.net/Articles/497140/ https://lwn.net/Articles/497140/ koverstreet <div class="FormattedComment"> <font class="QuotedText">&gt; My number one question is does bcache preserve the (safely written) contents</font><br> <font class="QuotedText">&gt; of the cache across restarts? That would be a major advantage in avoiding cache</font><br> <font class="QuotedText">&gt; warm up time in many applications, particularly large databases.</font><br> <p> Yes. That's been priority #1 from the start - it's particularly critical if you <br> want to be able to use writeback... at all.<br> <p> <font class="QuotedText">&gt; In addition, clearing out the first block of the backing device doesn't seem</font><br> <font class="QuotedText">&gt; like such a great idea on simple configurations. If the SSD fails, the</font><br> <font class="QuotedText">&gt; remaining filesystem would practically be guaranteed to be in a seriously</font><br> <font class="QuotedText">&gt; inconsistent state upon recovery. I would be concerned about using this in a</font><br> <font class="QuotedText">&gt; sufficiently critical setup without the ability to mirror the cache devices.</font><br> <p> Yeah. For now, you can just use a raid1 of SSDs. Eventually, I'll finish <br> multiple cache device support - you'll be able to have multiple SSDs in a cache<br> set, and only dirty data and metadata will be mirrored.<br> <p> <font class="QuotedText">&gt; It would also seem that the best way to run such a cache would be with a</font><br> <font class="QuotedText">&gt; relatively tight integration with the raid layer, such that backing store</font><br> <font class="QuotedText">&gt; devices are marked with metadata indicating the association with the cache</font><br> <font class="QuotedText">&gt; device, so that the group of them can properly be reassembled after a hardware</font><br> <font class="QuotedText">&gt; failure, possibly on a different system. The raid layer could then potentially</font><br> <font class="QuotedText">&gt; use the cache as a write intent log as well, which is a big deal for raid 5/6</font><br> <font class="QuotedText">&gt; setups. </font><br> <p> I think if you're using bcache for writeback caching on top of your raid5/6, <br> things will work out pretty well without needing any tight integration;<br> bcache's writeback tries hard to gather up big sequential IOs, and raid5/6 will<br> handle those just fine.<br> <p> <font class="QuotedText">&gt; If you want write performance improvements, the ideal thing to do is not to </font><br> <font class="QuotedText">&gt; synchronously flush dirty data in the cache to the backing store on receipt of</font><br> <font class="QuotedText">&gt; a write barrier, but rather to journalize all the writes themselves on the</font><br> <font class="QuotedText">&gt; cache device, so that they can be applied in the correct order on system</font><br> <font class="QuotedText">&gt; recovery. Otherwise you get a milder version of problem where synchronous</font><br> <font class="QuotedText">&gt; writes are held up by asynchronous ones, which can dramatically affect the time</font><br> <font class="QuotedText">&gt; required for an fsync operation, a database commit, and other similar</font><br> <font class="QuotedText">&gt; operations, if there are other asynchronous writes pending. </font><br> <p> Yes. But all that's really required for that is writeback caching (things get <br> complicated if you only want to use writeback caching for random writes, but <br> I'll handwave that away for now).<br> <p> <font class="QuotedText">&gt; On the other hand, a device like this could make it possible for the raid /</font><br> <font class="QuotedText">&gt; lvm layers to handle write barriers properly, i.e. with context specific</font><br> <font class="QuotedText">&gt; barriers, that only apply to one logical volume, or subset of the writes from a</font><br> <font class="QuotedText">&gt; specific filesystem, without running the risk of serious inconsistency problems</font><br> <font class="QuotedText">&gt; due to the normal inability to write synchronously to a group of backing store</font><br> <font class="QuotedText">&gt; devices. If you have a reliable, low overhead write intent log, you can avoid</font><br> <font class="QuotedText">&gt; that problem, and do it right, degrading to flush the world mode when a </font><br> <font class="QuotedText">&gt; persistent intent log is not available. </font><br> <p> Well, old style write barriers have gone away - now there's just cache flushes <br> (and FUA), which are much saner to handle.<br> </div> Mon, 14 May 2012 23:54:21 +0000 A bcache update https://lwn.net/Articles/497138/ https://lwn.net/Articles/497138/ russell <div class="FormattedComment"> could it be that the SSD is a single point of failure in front of a redundant set of disks. So writing to the SSD is probably no better than keeping it in RAM. Power supply failure vs SSD failure.<br> </div> Mon, 14 May 2012 23:28:29 +0000 A bcache update https://lwn.net/Articles/497137/ https://lwn.net/Articles/497137/ butlerm <div class="FormattedComment"> My number one question is does bcache preserve the (safely written) contents of the cache across restarts? That would be a major advantage in avoiding cache warm up time in many applications, particularly large databases.<br> <p> In addition, clearing out the first block of the backing device doesn't seem like such a great idea on simple configurations. If the SSD fails, the remaining filesystem would practically be guaranteed to be in a seriously inconsistent state upon recovery. I would be concerned about using this in a sufficiently critical setup without the ability to mirror the cache devices.<br> <p> It would also seem that the best way to run such a cache would be with a relatively tight integration with the raid layer, such that backing store devices are marked with metadata indicating the association with the cache device, so that the group of them can properly be reassembled after a hardware failure, possibly on a different system. The raid layer could then potentially use the cache as a write intent log as well, which is a big deal for raid 5/6 setups.<br> <p> If you want write performance improvements, the ideal thing to do is not to synchronously flush dirty data in the cache to the backing store on receipt of a write barrier, but rather to journalize all the writes themselves on the cache device, so that they can be applied in the correct order on system recovery. Otherwise you get a milder version of problem where synchronous writes are held up by asynchronous ones, which can dramatically affect the time required for an fsync operation, a database commit, and other similar operations, if there are other asynchronous writes pending.<br> <p> On the other hand, a device like this could make it possible for the raid / lvm layers to handle write barriers properly, i.e. with context specific barriers, that only apply to one logical volume, or subset of the writes from a specific filesystem, without running the risk of serious inconsistency problems due to the normal inability to write synchronously to a group of backing store devices. If you have a reliable, low overhead write intent log, you can avoid that problem, and do it right, degrading to flush the world mode when a persistent intent log is not available.<br> </div> Mon, 14 May 2012 23:25:41 +0000 A bcache update https://lwn.net/Articles/497134/ https://lwn.net/Articles/497134/ ballombe <div class="FormattedComment"> why such a negative style, Jon ? You could just as well have written<br> <p> <font class="QuotedText">&gt; Such a device could simultaneously increase the performance ecstasy that comes with solid-state storage and the overwhelming financial joy associated with rotating storage.</font><br> <p> (just kidding.)<br> </div> Mon, 14 May 2012 22:40:22 +0000 A bcache update https://lwn.net/Articles/497131/ https://lwn.net/Articles/497131/ alankila <div class="FormattedComment"> Hmh. Looks like garbage comparison. They use XFS, which was changed a lot between 2.6.32 (which is their flashcache kernel) and 3.1 (for bcache).<br> </div> Mon, 14 May 2012 21:37:48 +0000 A bcache update https://lwn.net/Articles/497130/ https://lwn.net/Articles/497130/ Beolach The bcache wiki has a <a href="http://bcache.evilpiepirate.org/#Performance">Performance</a> section, which includes <a href="http://www.accelcloud.com/2012/04/18/linux-flashcache-and-bcache-performance-testing/">a link</a> to just such a comparison. Mon, 14 May 2012 21:18:43 +0000 A bcache update https://lwn.net/Articles/497127/ https://lwn.net/Articles/497127/ alankila <div class="FormattedComment"> There's also the facebook flashcache which does something similar. I've been running it experimentally as a writeback cache on a test server and so far it seems to do what it promises. It can be experimented with today, and contains only about 6400 lines of code by a quick count, and can be insmod'd into kernel rather than patched in.<br> <p> Maybe it's somehow critically far worse, but then again, perhaps it's true that perfect is the enemy of good. We'll probably know when we can benchmark these technologies against each other.<br> </div> Mon, 14 May 2012 20:51:05 +0000 bcache cache-sets https://lwn.net/Articles/497126/ https://lwn.net/Articles/497126/ Lennie <div class="FormattedComment"> Also as briefly mentioned in the older article: bcache has cache-sets, you can assign several SSDs to a backing store.<br> <p> Eventually bcache is supposed to get support for mirroring the dirty data, so your dirty data will be stored on two SSDs before it is written to your already redundant backing store (like a RAID1 of HDDs).<br> <p> Any data that is only a cached copy of data already written to the backing store will only be stored ones on one of the SSDs.<br> <p> When that has been added that should take away most concerns people might have about their data.<br> </div> Mon, 14 May 2012 20:49:06 +0000 A bcache update https://lwn.net/Articles/497123/ https://lwn.net/Articles/497123/ corbet Yes, exactly...as I tried to explain in that same paragraph. The data does exist on SSD, but it's only useful if the post-meteorite kernel knows what data is there. So the index and such have to be saved to the SSD along with the data. Mon, 14 May 2012 20:10:01 +0000 A bcache update https://lwn.net/Articles/497122/ https://lwn.net/Articles/497122/ blitzkrieg3 <div class="FormattedComment"> <font class="QuotedText">&gt; Obviously, writeback caching also carries the risk of losing data if the system is struck by a meteorite before the writeback operation is complete. </font><br> <p> It isn't clear to me why this is true. SSDs are persistent storage and the data is still in the SSD, so why can't this be persistent? The only way this could be a problem is if the mapping is in memory and not written out to the SSD ever.<br> </div> Mon, 14 May 2012 20:07:30 +0000