A bcache update
A bcache update
Posted May 14, 2012 23:25 UTC (Mon) by butlerm (subscriber, #13312)Parent article: A bcache update
In addition, clearing out the first block of the backing device doesn't seem like such a great idea on simple configurations. If the SSD fails, the remaining filesystem would practically be guaranteed to be in a seriously inconsistent state upon recovery. I would be concerned about using this in a sufficiently critical setup without the ability to mirror the cache devices.
It would also seem that the best way to run such a cache would be with a relatively tight integration with the raid layer, such that backing store devices are marked with metadata indicating the association with the cache device, so that the group of them can properly be reassembled after a hardware failure, possibly on a different system. The raid layer could then potentially use the cache as a write intent log as well, which is a big deal for raid 5/6 setups.
If you want write performance improvements, the ideal thing to do is not to synchronously flush dirty data in the cache to the backing store on receipt of a write barrier, but rather to journalize all the writes themselves on the cache device, so that they can be applied in the correct order on system recovery. Otherwise you get a milder version of problem where synchronous writes are held up by asynchronous ones, which can dramatically affect the time required for an fsync operation, a database commit, and other similar operations, if there are other asynchronous writes pending.
On the other hand, a device like this could make it possible for the raid / lvm layers to handle write barriers properly, i.e. with context specific barriers, that only apply to one logical volume, or subset of the writes from a specific filesystem, without running the risk of serious inconsistency problems due to the normal inability to write synchronously to a group of backing store devices. If you have a reliable, low overhead write intent log, you can avoid that problem, and do it right, degrading to flush the world mode when a persistent intent log is not available.
Posted May 14, 2012 23:54 UTC (Mon)
by koverstreet (✭ supporter ✭, #4296)
[Link] (6 responses)
Yes. That's been priority #1 from the start - it's particularly critical if you
> In addition, clearing out the first block of the backing device doesn't seem
Yeah. For now, you can just use a raid1 of SSDs. Eventually, I'll finish
> It would also seem that the best way to run such a cache would be with a
I think if you're using bcache for writeback caching on top of your raid5/6,
> If you want write performance improvements, the ideal thing to do is not to
Yes. But all that's really required for that is writeback caching (things get
> On the other hand, a device like this could make it possible for the raid /
Well, old style write barriers have gone away - now there's just cache flushes
Posted May 15, 2012 0:55 UTC (Tue)
by dlang (guest, #313)
[Link] (4 responses)
>I think if you're using bcache for writeback caching on top of your raid5/6,
Where tight integration would be nice would be if bcache can align the writeback to the stripe size and alignment the same way that it tries to align the SSD writes to the eraseblock size and alignment.
Posted May 15, 2012 3:43 UTC (Tue)
by koverstreet (✭ supporter ✭, #4296)
[Link] (3 responses)
But for that to be useful we'd have to have writeback specifically pick bigger sequential chunks of dirty data and skip smaller ones, and I'm not sure how useful that actually is. Right now it just flushes dirty data in sorted order, which makes things really easy and works quite well in practice - in particular for regular hard drives, even if your dirty data isn't purely sequential you're still minimizing seek distance. And if you let gigabytes of dirty data buffer up (via writeback_percent) - the dirty data's going to be about as sequential as it's gonna get.
But yeah, there's all kinds of interesting tricks we could do.
Posted May 15, 2012 22:34 UTC (Tue)
by intgr (subscriber, #39733)
[Link] (2 responses)
That's certainly an important optimization for mixed random/sequential write workloads whose working set is larger than the SSD. To make best use of both kinds of disks, random writes should persist on the SSD as long as possible, whereas longer sequential writes should be pushed out quickly to make more room for random writes.
Posted May 16, 2012 1:52 UTC (Wed)
by koverstreet (✭ supporter ✭, #4296)
[Link] (1 responses)
If you can show me a workload that'd benefit though - I'm not opposed to the idea, it's just a question of priorities.
Posted May 16, 2012 7:35 UTC (Wed)
by intgr (subscriber, #39733)
[Link]
Oh, I didn't realize that. That should take care of most of it. There's probably still some benefit to sorting writeback by size, but I'm not sure whether it's worth the complexity.
Posted May 18, 2012 13:54 UTC (Fri)
by Spudd86 (subscriber, #51683)
[Link]
A bcache update
> of the cache across restarts? That would be a major advantage in avoiding cache
> warm up time in many applications, particularly large databases.
want to be able to use writeback... at all.
> like such a great idea on simple configurations. If the SSD fails, the
> remaining filesystem would practically be guaranteed to be in a seriously
> inconsistent state upon recovery. I would be concerned about using this in a
> sufficiently critical setup without the ability to mirror the cache devices.
multiple cache device support - you'll be able to have multiple SSDs in a cache
set, and only dirty data and metadata will be mirrored.
> relatively tight integration with the raid layer, such that backing store
> devices are marked with metadata indicating the association with the cache
> device, so that the group of them can properly be reassembled after a hardware
> failure, possibly on a different system. The raid layer could then potentially
> use the cache as a write intent log as well, which is a big deal for raid 5/6
> setups.
things will work out pretty well without needing any tight integration;
bcache's writeback tries hard to gather up big sequential IOs, and raid5/6 will
handle those just fine.
> synchronously flush dirty data in the cache to the backing store on receipt of
> a write barrier, but rather to journalize all the writes themselves on the
> cache device, so that they can be applied in the correct order on system
> recovery. Otherwise you get a milder version of problem where synchronous
> writes are held up by asynchronous ones, which can dramatically affect the time
> required for an fsync operation, a database commit, and other similar
> operations, if there are other asynchronous writes pending.
complicated if you only want to use writeback caching for random writes, but
I'll handwave that away for now).
> lvm layers to handle write barriers properly, i.e. with context specific
> barriers, that only apply to one logical volume, or subset of the writes from a
> specific filesystem, without running the risk of serious inconsistency problems
> due to the normal inability to write synchronously to a group of backing store
> devices. If you have a reliable, low overhead write intent log, you can avoid
> that problem, and do it right, degrading to flush the world mode when a
> persistent intent log is not available.
(and FUA), which are much saner to handle.
A bcache update
things will work out pretty well without needing any tight integration;
bcache's writeback tries hard to gather up big sequential IOs, and raid5/6 will handle those just fine.
A bcache update
A bcache update
A bcache update
A bcache update
A bcache update