LWN.net Logo

Bcache: Caching beyond just RAM

July 2, 2010

By William Stearns and Kent Overstreet

Kent Overstreet has been working on bcache, which is a Linux kernel module intended to improve the performance of block devices. Instead of using just memory to cache hard drives, he proposes to use one or more solid-state storage devices (SSDs) to cache block data (hence bcache, a block device cache).

The code is largely filesystem agnostic as long as the filesystem has an embedded UUID (which includes the standard Linux filesystems and swap devices). When data is read from the hard drive, a copy is saved to the SSD. Later, when one of those sectors needs to be retrieved again, the kernel checks to see if it's still in page cache. If so, the read comes from RAM just like it always has on Linux. If it's not in RAM but it is on the SSD, it's read from there. It's like we've added 64GB or more of - slower - RAM to the system and devoted it to caching.

The design of bcache allows the use of more than one SSD to perform caching. It is also possible to cache more than one existing filesystem, or choose instead to just cache a small number of performance-critical filesystems. It would be perfectly reasonable to cache a partition used by a database manager, but not cache a large filesystem holding archives of old projects. The standard Linux page cache can be wiped out by copying a few large (near the size of your system RAM) files. Large file copies on that project archive partition won't wipe out an SSD-based cache using bcache.

Another potential use is using local media to cache remote disks. You could use an existing partition/drive/loopback device to cache any of the following: AOE (ATA-over-Ethernet) drive, SAN LUN, DRBD or NBD remote drives, iSCSI drives, local CD, or local (slow) USB thumb drives. The local cache wouldn't have to be an SSD, it would just have to be faster than the media you're caching.

Note that this type of caching is only for block devices (anything that shows up as a block device in /dev/). It isn't for network filesystems like NFS, CIFS, and so on (see the FS-cache module for the ability to cache individual files on an NFS or AFS client).

Implementation

To intercept filesystem operations, bcache hooks into the top of the block layer, in __generic_make_request(). It thus works entirely in terms of BIO structures. By hooking into the sole function through which all disk requests pass, bcache doesn't need to make any changes to block device naming or filesystem mounting. If /dev/md5 was originally mounted on /usr/, it continues to show up as /dev/md5 mounted on /usr/ after bcache is enabled for it. Because the caching is transparent, there are no changes to the boot process; in fact, bcache could be turned on long after the system is up and running. This approach of intercepting bio requests in the background allows us to start and stop caching on the fly, to add and remove cache devices, and to boot with or without bcache.

bcache's design focuses on avoiding random writes and playing to SSD's strengths. Roughly, a cache device is divided up into buckets, which are intended to match the physical disk's erase blocks. Each bucket has an eight-bit generation number which is maintained in a separate array on the SSD just past the superblock. Pointers (both to btree buckets and to cached data) contain the generation number of the bucket they point to; thus to free and reuse a bucket, it is sufficient to increment the generation number.

This mechanism allows bcache keep the cache device completely full; when it wants to write some new data, it just picks a bucket, increments its generation number, invalidating all the existing pointers to it. Garbage collection will remove the actual pointers eventually; there is no need for backpointers or any other infrastructure.

For each bucket, bcache remembers the generation number from the last time it had a full garbage collection performed. Once the difference between the current generation number and the remembered number reaches 64, it's time for another garbage collection. Since the generation number has no chance to wrap, an 8-bit generation number is sufficient.

Unlike standard btrees, bcache's btrees aren't kept fully sorted, so if you want to insert a key you don't have to rebalance the whole thing. Rather, they're kept sorted according to when they were written out; if the first ten pages are already on disk, bcache will insert into the 11th page, in sorted order, until it's full. During garbage collection (and, in the future, during insertion if there are too many sets) it'll re-sort the whole bucket. This means bcache doesn't have much of the index pinned in memory, but it also doesn't have to do much work to keep the index written out. Compare that to a hash table of ten million or so entries and the advantages are obvious.

State of the code

Currently, the code is looking fairly stable; it's survived overnight torture tests. Production is still a ways away, and there are some corner cases and IO error handling to flesh out, but more testing would be very welcome at this point.

There's a long list of planned features:

IO tracking: By keeping a hash of the most recent IOs, it's possible to track sequential IO and bypass the cache - large file copies, backups, and raid resyncs will all bypass the cache. This one's mostly implemented.

Write behind caching: Synchronous writes are becoming more and more of a problem for many workloads, but with bcache random writes become sequential writes. If you've got the ability to buffer 50 or 100GB of writes, many might never hit your RAID array before being rewritten.

The initial write behind caching implementation isn't far off, but at first it'll work by flushing out new btree pages to disk quicker, before they fill up - journaling won't be required because the only metadata to write out in order is a single index. Since we have garbage collection, we can mark buckets as in-use before we use them and leaking free space is a non issue. (This is analogous to soft updates - only much more practical). However, journaling would still be advantageous so that all new keys can be flushed out sequentially; then updates to the btree can happen as pages fill up, versus many smaller writes so that synchronous writes can be completed quickly. Since bcache's btree is very flat, this won't be much of an issue for most workloads, but should still be worthwhile.

Multiple cache devices have been planned from the start, and mostly implemented. Suppose you had multiple SSDs to use - you could stripe them, but then you have no redundancy, which is a problem for writeback caching. Or you could mirror them, but then you're pointlessly duplicating data that's present elsewhere. Bcache will be able to mirror only the dirty data, and then drop one of the copies when it's flushed out.

Checksumming is a ways off, but definitely planned; it'll keep checksums of all the cached data in the btree, analogous to what Btrfs does. If a checksum doesn't match, that data can be simply tossed, the error logged, and the data read from the backing device or redundant copy.

There's also a lot of room for experimentation and potential improvement in the various heuristics. Right now the cache functions in a least-recently-used (LRU) mode, but it's flexible enough to allow for other schemes. Potentially, we can retain data based on how much real time it saves the backing device, calculated from both the seek time and bandwidth.

Sample performance numbers

Of course, performance is the only reason to use bcache, so benchmarks matter. Unfortunately there's still an odd bug affecting buffered IO so the current benchmarks don't yet fully reflect bcache's potential, but are more a measure of current progress. Bonnie isn't particularly indicative of real world performance, but has the advantage of familiarity and being easy to interpret; here is the bonnie output:

Uncached: SATA 2 TB Western Digital Green hard drive

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
utumno          16G   672  91 68156   7 36398   4  2837  98 102864   5 269.3   2
Latency             14400us    2014ms   12486ms   18666us     549ms     460ms

And now cached with a 64 gb Corsair Nova:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
utumno          16G   536  92 70825   7 53352   7  2785  99 181433  11  1756  15
Latency             14773us    1826ms    3153ms    3918us    2212us   12480us

In these numbers, the per character columns are mostly irrelevant for our purposes, as they're affected by other parts of the kernel. The write and rewrite numbers are only interesting in that they don't go down, since bcache isn't doing write behind caching yet. The sequential input is reading data bonnie previously wrote, and thus should all be coming from the SSD. That's where bcache is lacking, the SSD is capable of about 235 mb/sec. The random IO numbers are actually about 90% reads, 10% writes of 4k each; without write behind caching bonnie is actually bottlenecked on the writes hitting the spinning metal disk, and bcache isn't that far off from the theoretical maximum.

For more information

The bcache wiki holds more details about the software, more formal benchmark numbers, and sample commands for getting started.

The git repository for the kernel code is available at git://evilpiepirate.org/~kent/linux-bcache.git. The userspace tools are in a separate repository: git://evilpiepirate.org/~kent/bcache-tools.git. Both are viewable with a web browser at the gitweb site.


(Log in to post comments)

Bcache: Caching beyond just RAM

Posted Jul 8, 2010 3:09 UTC (Thu) by NightMonkey (subscriber, #23051) [Link]

Could this be used to make caches out of magnetic platter drives, or tmpfs, or any block device?

Bcache: Caching beyond just RAM

Posted Jul 8, 2010 3:23 UTC (Thu) by koverstreet (subscriber, #4296) [Link]

Aside from tmpfs not being a block device, yes. I don't know that using hard drives as cache would ever be worthwhile - it's written with the assumption that random reads are fast, if you were to try that you'd want it to try to have it cache decent sized chunks of contiguous data (on cache miss, always read 64 kb or so in at a time). But it would certainly work.

Bcache: Caching beyond just RAM

Posted Jul 8, 2010 3:35 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

a local magnetic drive may be faster than a network drive

Bcache: Caching beyond just RAM

Posted Jul 8, 2010 11:13 UTC (Thu) by ewan (subscriber, #5533) [Link]

And a small fast disk system (e.g. striped 15k SAS disks) may be a useful cache over a big slow one (e.g. a huge RAID of 5400rpm SATA disks).

I wonder if bcache can be stacked? So you could have an SSD over some SAS disks over a SATA RAID?

Bcache: Caching beyond just RAM

Posted Jul 8, 2010 11:24 UTC (Thu) by koverstreet (subscriber, #4296) [Link]

There's no way to do that now - all the cache devices in the system are pooled together and used symmetrically - but it is something I've been considering down the road. It might not be all that much more work, I'm just curious if the performance would justify it. Most people's working set just isn't all that huge, and also the big problem with your big slow raid6 is random writes, which bcache should be able to mostly eliminate. But we'll see.

Bcache: Caching beyond just RAM

Posted Jul 8, 2010 3:42 UTC (Thu) by NightMonkey (subscriber, #23051) [Link]

I see. It might be interesting to use with a hard drive (or RAID 0 set) as a write-behind cache, once that is implemented. Or perhaps in front of a high-latency DRBD device. Some interesting possibilities! :)

Bcache: Caching beyond just RAM

Posted Jul 8, 2010 15:40 UTC (Thu) by martinfick (subscriber, #4455) [Link]

Why would you want it in front of a DRBD device? If you really want to sacrifice reliability for performance with a DRBD device, simply use protocol B or C, it will likely be more reliable.

Ramdisks and hard drives as cache devices

Posted Jul 8, 2010 3:43 UTC (Thu) by wstearns (✭ supporter ✭, #4102) [Link]

To add to Kent's answer, you could use a traditional (non-tmpfs) ramdisk as it's a full block device. In a sense you're overriding the cache strategy by devoting the ramdisk's memory to caching one or more performance critical filesystems. It might be useful in some situations even though the system as a whole has less efficient use of memory. Using a hard drive as cache might actually be useful if the block device/filesystem you're caching is over a slow link or significantly slower than the cache device. If your local drive hard drive is at least as large as the remote drive, you also have the option of using raid 1 and setting the remote drive to "write-mostly" (so that all reads come from the local drive except in case of drive failure; see "man mdadm").

Ramdisks and hard drives as cache devices

Posted Jul 8, 2010 14:20 UTC (Thu) by butlerm (subscriber, #13312) [Link]

"In a sense you're overriding the cache strategy by devoting the ramdisk's memory to caching one or more performance critical filesystems."

I imagine that Bcache caches O_DIRECT reads and writes, which could give a caching advantage even where an application has elected to bypass the ordinary buffer cache.

Bcache seems ideal for use in NFS servers, and iSCSI and FCoE targets. NFS servers are supposed to do synchronous writes, right? (Reliable, crash resistant) write behind caching would be outstanding.

Ramdisks and hard drives as cache devices

Posted Jul 8, 2010 14:57 UTC (Thu) by etienne (subscriber, #25256) [Link]

> I imagine that Bcache caches O_DIRECT reads and writes, which could give a caching advantage even where an application has elected to bypass the ordinary buffer cache.

An application like a boot-loader installation/upgrade which tries to tell which disk sectors to load from the MBR?
I hope there is a sync() to write to (the real destination) disk.

Ramdisks and hard drives as cache devices

Posted Jul 8, 2010 23:19 UTC (Thu) by koverstreet (subscriber, #4296) [Link]

Bootloaders are a special situation - the most practical thing to do would be to just not enable write behind caching for /boot. For everything else, sync happens when you switch off write behind caching, and sync without switching it off doesn't really make any sense, the cached device is going to be inconsistent as long as write behind caching is on.

Ramdisks and hard drives as cache devices

Posted Jul 8, 2010 23:26 UTC (Thu) by jlayton (subscriber, #31672) [Link]

> NFS servers are supposed to do synchronous writes, right? (Reliable, crash resistant) write behind caching would be outstanding.

Not since NFSv3 was implemented. That gave the ability to do safe asynchronous writes. The client sends a bunch of WRITEs to the server and then issues a COMMIT. If the server crashes, the client will reissue uncommitted writes.

With NFSv2 however, you're correct that the server is supposed to do sync writes (and most do these days, at least by default).

Ramdisks and hard drives as cache devices

Posted Jul 20, 2010 11:38 UTC (Tue) by neilbrown (subscriber, #359) [Link]

Yes, NFSv3 helps with writing large files. But metadata operations (create, unlink, chmod, mv) still need to be synchronous and some workloads can be very metadata-heavy (untar is a good example, 'make' on a big project tends to delete and create lots of relatively small files too).
So a low-latency cache can definitely improve the performance of an NFS server.

Ramdisks and hard drives as cache devices

Posted Jul 22, 2010 20:59 UTC (Thu) by nix (subscriber, #2304) [Link]

Unfortunately if you want to spot cache invalidations, the lack of leases on NFSv3 is a killer, because you still have to roundtrip to the server to see if your cache is stale, and roundtrips are the slow part :( fs-cache slows NFS *down* quite considerably in my experience, for exactly this reason.

Ramdisks and hard drives as cache devices

Posted Jul 8, 2010 23:32 UTC (Thu) by koverstreet (subscriber, #4296) [Link]

> (Reliable, crash resistant)

That's what I've been working towards :)

Safe write behind caching really is a large step up in terms of the guarantees the cache has to be able to make - with writethrough caching, you just have to make sure you return good data, or nothing. And btree code is hard enough, so I've been working on getting the easy cases rock solid first - but safe write behind caching is absolutely happening. I've got some of the preliminary stuff out of the way and I'm hoping to have a rough initial version before too long, depending on how debugging the new stuff goes.

Bcache: Caching beyond just RAM

Posted Jul 8, 2010 14:05 UTC (Thu) by cesarb (subscriber, #6266) [Link]

> as long as the filesystem has an embedded UUID

This sounds problematic to me. There is no guarantee that the UUID is unique. It might have been unique when it was *created*, but as soon as you duplicate a partition (with dd to another disk, a RAID 1 which is later split, or anything else), it ceases being unique. You could change the UUID of the copy, but who will remember to do that?

Bcache: Caching beyond just RAM

Posted Jul 8, 2010 14:52 UTC (Thu) by sync (guest, #39669) [Link]

XFS refuses to mount a filesystem with a duplicate UUID.
I think other filesystems should follow.

Bcache: Caching beyond just RAM

Posted Jul 8, 2010 14:57 UTC (Thu) by faramir (subscriber, #2327) [Link]

Given that many distributions now use UUIDs rather then device names in /etc/fstab, not having unique UUIDs is already a recipe for disaster.

Unique UUIDs

Posted Jul 8, 2010 16:19 UTC (Thu) by wstearns (✭ supporter ✭, #4102) [Link]

You're correct; at least in the case of RAID 1 mdadm, the md device and the underlying partitions all share the exact same UUID. When I use bcache on a raid array, I don't cache the underlying partitions, I cache the raid device itself. That lets the raid code handle a missing component partition, raid resync, and any other raid issues internally.
If you copy a filesystem with dd or raid split as you mention, you still have to tell bcache both what UUID and what block device to cache, letting you specify the original partition or the new copy, whichever you want cached. So bcache shouldn't introduce new problems; a non-bcache linux system will have problems already if you have 2 different filesystems with the same UUID as others have mentioned.

Bcache and filesystem recovery

Posted Jul 8, 2010 16:54 UTC (Thu) by abacus (guest, #49001) [Link]

An important issue has not been discussed in the above article, and that is how file system recovery works when using Bcache. Does Bcache honor the write ordering requirements communicated by filesystems such that e.g. journal-based filesystem recovery still works ?

Bcache and filesystem recovery

Posted Jul 9, 2010 2:15 UTC (Fri) by koverstreet (subscriber, #4296) [Link]

It doesn't handle barriers just yet. That's coming soon, it's needed for write behind caching.

This isn't a correctness issue, if you don't have barriers you just have to wait for writes to complete before you start another to guarantee ordering. Barriers have historically taken awhile to add to things like raid, lvm and drbd, bcache is nothing special.

There is a bug in that if you're running bcache on something that supports barriers, bcache won't indicated that they're no longer supported... I haven't worried much about that since you can disable them in the filesystem and it's only till I get barriers implemented in bcache.

Bcache: Caching beyond just RAM

Posted Jul 8, 2010 17:34 UTC (Thu) by sflintham (subscriber, #47422) [Link]

This approach of intercepting bio requests in the background allows us to start and stop caching on the fly, to add and remove cache devices, and to boot with or without bcache
Am I right in thinking this will only be possible once the checksumming feature mentioned at the bottom of the article is in place? Otherwise, what would stop the cache returning an out-of-date copy of a block modified while it was offline?

Bcache: Caching beyond just RAM

Posted Jul 8, 2010 23:16 UTC (Thu) by koverstreet (subscriber, #4296) [Link]

Checksumming won't save you here - what's it supposed to compared the checksum against? If you fill up your cache, turn of caching, write some data, and then turn caching back on without telling bcache about it... well, computers can't always save you from rm -rf either.

I am looking into invalidating the cache contents if a filesystem was already opened read write, or perhaps checking if the first page or so matches what it previously had - this would catch it provided the filesystem superblock changed. But the real performance gains are to be had with write behind caching, and none of this really matters there, since the cached device is now inconsistent if you use it without the cache. By the time bcache saw the cache was out of date it'd be too late to do anything.

Bcache: Caching beyond just RAM

Posted Jul 9, 2010 0:08 UTC (Fri) by akumria (subscriber, #7773) [Link]

Perhaps I'm not understanding everything.

But it seems to me, that you could move the bcache functionality into the filesystem.

i.e. Why can't btrfs store all writes on that faster medium and them replicate them back to the slower one?

Also, isn't this akin to storing the journal of a filesystem externally (I've never done it). i.e. Why not point the ext3/ext4 journal at the SSD?

Would you get the same (or better) benefit?

It seem bcache is useful is the underlying filesystem doesn't do it, but in the above two cases, is there much benefit?

Insight appreciated.

Thanks,
Anand

Bcache: Caching beyond just RAM

Posted Jul 9, 2010 1:21 UTC (Fri) by koverstreet (subscriber, #4296) [Link]

You could stuff everything into a filesystem, sure, but what would be the point? Caching really is doing something else, it wouldn't really be able to share any filesystem code, and it'd be tied to that one filesystem.

The main thing is the allocation strategy you want for caching is _completely_ different than for filesystems. Fragmentation isn't a real problem in the cache, since we can free fixed sized chunks regardless of what's in them. This means we're free to write data to the cache however we want to get the best performance. A filesystem has to retain data for an arbitrary amount of time, and thus needs to pay a lot of attention to making sure free space doesn't fragment too much.

Putting an external journal on an SSD gets you a bit of what bcache is after, but it'll only help with writes, and not to the extent bcache can. How would you effectively use an 80 gb journal? With bcache, you'll be able to fill up your caches with almost all dirty writes, and then write them out to your RAID6 with no restrictions on ordering - potentially turning a huge portion of random writes into mostly sequential ones, and even more will get overwritten in the cache before the raid ever sees them.

bcache and cluster filesystems?

Posted Jul 9, 2010 2:56 UTC (Fri) by skissane (subscriber, #38675) [Link]

How would this interact with a cluster file system such as OCFS?
I assume it wouldn't work, unless the cluster file system was aware of the
bcache, and had some way to do cache invalidation of the blocks which had
been changed by other nodes in the cluster....

bcache and cluster filesystems?

Posted Jul 9, 2010 3:28 UTC (Fri) by koverstreet (subscriber, #4296) [Link]

Yeah, it'd quickly explode. Probably be fun to watch, though.

bcache and cluster filesystems?

Posted Jul 10, 2010 11:20 UTC (Sat) by skissane (subscriber, #38675) [Link]

it would be interesting to work out, what extra features would bcache need to provide so that a clustered filesystem could use it without issue... i assume just some API to tell bcache to invalidate some block...

bcache and cluster filesystems?

Posted Jul 11, 2010 3:22 UTC (Sun) by koverstreet (subscriber, #4296) [Link]

Yeah, nothing more complicated than that. Make a new bio flag to invalidate a region of cache, and make sure everything that isn't a cache ignores it. Making bcache handle it would be trivial, it'd just be a matter of making sure ocfs2 always invalidates what it needs to.

Bcache: Caching beyond just RAM

Posted Jul 9, 2010 3:36 UTC (Fri) by mwalls (guest, #6268) [Link]

Isn't this substantially duplicating Facebook's flashcache, so are they related?

Bcache: Caching beyond just RAM

Posted Jul 9, 2010 3:52 UTC (Fri) by koverstreet (subscriber, #4296) [Link]

Completely different design and completely unrelated. I started bcache a couple months before flashcache was announced. Flashcache is simpler and being used in production now. Bcache has, in my opinion, a lot more potential than flashcache.

Bcache: Caching beyond just RAM

Posted Jul 17, 2010 3:55 UTC (Sat) by bill_mcgonigle (guest, #60249) [Link]

I was wondering the same thing. As I understand it, the main difference is that with Flashcache you tell it, "here, you're in front of this device". And then you mount the flashcache device.

bcache uses a kernel feature that lets it hook in 'from the side'. So, you still mount the regular device. In theory you can insert/pull/insert bcache as much as you want and your apps would be fine to keep on running. I think that makes it ultimately more powerful.

You could even imagine a workload where you have a bazillion disks that receive sporadic heavy access (maybe remote disks where old data isn't useful), and you could have some sort of monitor that would keep a few SSD's busy by inserting them as bcaches on-demand.

Bcache: Caching beyond just RAM

Posted Jul 9, 2010 8:03 UTC (Fri) by MisterIO (guest, #36192) [Link]

Is it safe in the presence of kernel bugs(or similar problems) for writes? I mean, we've seen things link ext4 delaying writes for 30 secs, but that had obvious relayability problems. Let's say that bcache successfully caches a certain amount of writes(that is they are complete on the cache), but then the system freezes, so the cache isn't written back to the actual hd. At the next boot, does bcache remember to perform those writes?

Bcache: Caching beyond just RAM

Posted Jul 10, 2010 16:18 UTC (Sat) by MisterIO (guest, #36192) [Link]

Ooouuuh, it's so cold being alone in the shadow on winter!

Was my previous question very retarded? I noticed it's the only comment in the thread which was completely ignored. So I thought that knowing to have said something really stupid is still more useful than nothing. Thus, this post was born.

Bcache: Caching beyond just RAM

Posted Jul 10, 2010 20:32 UTC (Sat) by dlang (✭ supporter ✭, #313) [Link]

if the data isn't written to the bcache HD and you do not have battery backed cache on the system, the data will be lost. This is the same as with any storage/filesystem.

however (once the write behind features are in bcache) once the data is written to the bcache storage it should push those changes back to the real device after the next boot. If it doesn't the write-behind feature can't be considered done and useable ;-)

Bcache: Caching beyond just RAM

Posted Jul 11, 2010 3:20 UTC (Sun) by koverstreet (subscriber, #4296) [Link]

Yeah, that's all correct :)

Kernel bugs are a different matter entirely though - if a program doesn't do what you think it does, it could be doing anything. If it's buggy it could be overwriting all your important data with Rick Astley songs, or opening the door the the velociraptor cages. There's just no way to tell at that point.

You test everything as best you can, but software's complicated, there's always something lurking and no complete guarantees.

Bcache: Caching beyond just RAM

Posted Jul 12, 2010 0:14 UTC (Mon) by MisterIO (guest, #36192) [Link]

Well, with kernel bugs I meant kernel bugs not affecting this code and which by no means start writing random songs of some unknown guy(yeah, sorry, I don't know him) over your data, but still freeze the system. At that point basically the only real problem you could have is losing data on your hd(either just the data that didn't make it to the hd or the whole file you were writing at the time of the failure. The second case happened to me frequently with ext4 and I lost many normal data files because of it, while I was able instead to save some sources I was coding simply because I used git and I committed quite often).

Bcache: Caching beyond just RAM

Posted Jul 9, 2010 12:27 UTC (Fri) by eludias (subscriber, #4058) [Link]

How will O_DIRECT be handled? I need this to circumvent any caching layer to keep my hardware healthy.

Background: I've also got one of those nasty WD Caviar Green (1Tb) drives, and there is a feature in the firmware of the drives which auto-parks the heads after 8 seconds of inactivity. This interacts quite badly with an OS which saves its data once per 30 seconds, resulting in a drive worn down in about 6 months.

Now the way to circumvent this behaviour is to read something from the drive, say once per 5 seconds. And the most efficient way to do so is to read from the cache of the drive, so we re-read sector 0 over and over again. With O_DIRECT we can bypass the disk cache of Linux traditionally, but will this also be the case when using bcache?

O_DIRECT read of sector 0

Posted Jul 9, 2010 15:37 UTC (Fri) by wstearns (✭ supporter ✭, #4102) [Link]

Since bcache can only cache individual filesystems and I assume your actual filesystems are on partitions (as opposed to a single filesystem covering the entire drive), bcache won't be caching sector 0 as that's not part of a filesystem.

O_DIRECT read of sector 0

Posted Jul 10, 2010 10:45 UTC (Sat) by eludias (subscriber, #4058) [Link]

I was under the impression that bcache cached block devices (hence the name ), and not filesystems. But you have a good point: if I would cache the partition and not the whole device, I might be able to circumvent it by reading outside the partition.

O_DIRECT read of sector 0

Posted Jul 10, 2010 21:13 UTC (Sat) by wstearns (✭ supporter ✭, #4102) [Link]

You're correct; bcache does cache block devices. I phrased it poorly; the idea I failed to get across was that the filesystems we'd normally submit to bcache to cache are on individual partitions (which linux treats as block devices) as opposed to the entire drive, allowing you to bypass bcache on sector 0 and still cache the partitions.

Bcache: Caching beyond just RAM

Posted Jul 12, 2010 9:16 UTC (Mon) by etienne (subscriber, #25256) [Link]

> there is a feature in the firmware of the drives which auto-parks the heads after 8 seconds of inactivity

Probably maximum power saving more, you can usually see/adjust (and save the setup) using hdparm.

Bcache: Caching beyond just RAM

Posted Jul 10, 2010 3:20 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

Is it persistent across reboots? I.e. will my frequently accessed data still be available from the SSD after a reboot?

Bcache: Caching beyond just RAM

Posted Jul 10, 2010 8:59 UTC (Sat) by koverstreet (subscriber, #4296) [Link]

It is. Currently if you shut down uncleanly it'll invalidate the cache as it'll be inconsistent, but handling that is part of safe write behind caching on, which I'm working on now.

Bcache: Caching beyond just RAM

Posted Jul 11, 2010 14:39 UTC (Sun) by alankila (subscriber, #47141) [Link]

Excellent news. I look forwards to using this technology.

Bcache: Caching beyond just RAM

Posted Jul 12, 2010 19:50 UTC (Mon) by rilder (guest, #59804) [Link]

Interesting project. However, I wonder how this will work with something like cleancache(http://lwn.net/Articles/389873/). Cleancache works in the realm of VM. So is bcache supposed to complement cleancache or is a replacement for it.

The transparency of bcache is quite nice. Is bcache available as module(so that it can be built out of tree and used) ?

Bcache: Caching beyond just RAM

Posted Jul 12, 2010 20:30 UTC (Mon) by rilder (guest, #59804) [Link]

I recently came across L2ARC caching used in Solaris with ZFS + SSD. Is there any improvement or difference between that and bcache ?

Bcache: Caching beyond just RAM

Posted Jul 15, 2010 22:59 UTC (Thu) by bill_mcgonigle (guest, #60249) [Link]

There's quite a bit of similarity. L2ARC is read-only, for wide random accesses. bcache aims to be read/write. ZFS also separates out its ZIL for writes. If I'm building a ZFS box I'd use a big MLC drive for the L2ARC and a smaller SLC drive for the ZIL as the workloads differ. You could probably set bcache to be read-only and put a filesystem's journal on a different drive if you wanted a more-like-ZFS segregation. bcache has the nice attribute of just being able to pull it and keep running - ZFS isn't usually set up that way. The L2ARC and ZIL have the checksumming today whereas bcache will get to that. Of course, bcache is much more general and useful in situations where ZFS has no relevance. It's good to have chisels and screwdrivers.

Bcache: Caching beyond just RAM

Posted Jul 17, 2010 16:10 UTC (Sat) by intgr (subscriber, #39733) [Link]

How are you planning to protect file systems from being mounted if it has a write-back cache that is not activated or not available? Sounds like there is no infrastructure for this currently, but could badly corrupt data if not recognized.

Generation number

Posted May 17, 2013 8:55 UTC (Fri) by wiza (guest, #91017) [Link]

I don't quite understand the effect of maintaining the generation number. Since FTL has used some algorithm to avoid wearing out, what good is the generation number?

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds