| Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net. |
Kent Overstreet has been working on bcache, which is a Linux kernel module intended to improve the performance of block devices. Instead of using just memory to cache hard drives, he proposes to use one or more solid-state storage devices (SSDs) to cache block data (hence bcache, a block device cache).
The code is largely filesystem agnostic as long as the filesystem has an embedded UUID (which includes the standard Linux filesystems and swap devices). When data is read from the hard drive, a copy is saved to the SSD. Later, when one of those sectors needs to be retrieved again, the kernel checks to see if it's still in page cache. If so, the read comes from RAM just like it always has on Linux. If it's not in RAM but it is on the SSD, it's read from there. It's like we've added 64GB or more of - slower - RAM to the system and devoted it to caching.
The design of bcache allows the use of more than one SSD to perform caching. It is also possible to cache more than one existing filesystem, or choose instead to just cache a small number of performance-critical filesystems. It would be perfectly reasonable to cache a partition used by a database manager, but not cache a large filesystem holding archives of old projects. The standard Linux page cache can be wiped out by copying a few large (near the size of your system RAM) files. Large file copies on that project archive partition won't wipe out an SSD-based cache using bcache.
Another potential use is using local media to cache remote disks. You could use an existing partition/drive/loopback device to cache any of the following: AOE (ATA-over-Ethernet) drive, SAN LUN, DRBD or NBD remote drives, iSCSI drives, local CD, or local (slow) USB thumb drives. The local cache wouldn't have to be an SSD, it would just have to be faster than the media you're caching.
Note that this type of caching is only for block devices (anything that shows up as a block device in /dev/). It isn't for network filesystems like NFS, CIFS, and so on (see the FS-cache module for the ability to cache individual files on an NFS or AFS client).
To intercept filesystem operations, bcache hooks into the top of the block layer, in __generic_make_request(). It thus works entirely in terms of BIO structures. By hooking into the sole function through which all disk requests pass, bcache doesn't need to make any changes to block device naming or filesystem mounting. If /dev/md5 was originally mounted on /usr/, it continues to show up as /dev/md5 mounted on /usr/ after bcache is enabled for it. Because the caching is transparent, there are no changes to the boot process; in fact, bcache could be turned on long after the system is up and running. This approach of intercepting bio requests in the background allows us to start and stop caching on the fly, to add and remove cache devices, and to boot with or without bcache.
bcache's design focuses on avoiding random writes and playing to SSD's strengths. Roughly, a cache device is divided up into buckets, which are intended to match the physical disk's erase blocks. Each bucket has an eight-bit generation number which is maintained in a separate array on the SSD just past the superblock. Pointers (both to btree buckets and to cached data) contain the generation number of the bucket they point to; thus to free and reuse a bucket, it is sufficient to increment the generation number.
This mechanism allows bcache keep the cache device completely full; when it wants to write some new data, it just picks a bucket, increments its generation number, invalidating all the existing pointers to it. Garbage collection will remove the actual pointers eventually; there is no need for backpointers or any other infrastructure.
For each bucket, bcache remembers the generation number from the last time it had a full garbage collection performed. Once the difference between the current generation number and the remembered number reaches 64, it's time for another garbage collection. Since the generation number has no chance to wrap, an 8-bit generation number is sufficient.
Unlike standard btrees, bcache's btrees aren't kept fully sorted, so if you want to insert a key you don't have to rebalance the whole thing. Rather, they're kept sorted according to when they were written out; if the first ten pages are already on disk, bcache will insert into the 11th page, in sorted order, until it's full. During garbage collection (and, in the future, during insertion if there are too many sets) it'll re-sort the whole bucket. This means bcache doesn't have much of the index pinned in memory, but it also doesn't have to do much work to keep the index written out. Compare that to a hash table of ten million or so entries and the advantages are obvious.
Currently, the code is looking fairly stable; it's survived overnight torture tests. Production is still a ways away, and there are some corner cases and IO error handling to flesh out, but more testing would be very welcome at this point.
There's a long list of planned features:
IO tracking: By keeping a hash of the most recent IOs, it's possible to track sequential IO and bypass the cache - large file copies, backups, and raid resyncs will all bypass the cache. This one's mostly implemented.
Write behind caching: Synchronous writes are becoming more and more of a problem for many workloads, but with bcache random writes become sequential writes. If you've got the ability to buffer 50 or 100GB of writes, many might never hit your RAID array before being rewritten.
The initial write behind caching implementation isn't far off, but at first it'll work by flushing out new btree pages to disk quicker, before they fill up - journaling won't be required because the only metadata to write out in order is a single index. Since we have garbage collection, we can mark buckets as in-use before we use them and leaking free space is a non issue. (This is analogous to soft updates - only much more practical). However, journaling would still be advantageous so that all new keys can be flushed out sequentially; then updates to the btree can happen as pages fill up, versus many smaller writes so that synchronous writes can be completed quickly. Since bcache's btree is very flat, this won't be much of an issue for most workloads, but should still be worthwhile.
Multiple cache devices have been planned from the start, and mostly implemented. Suppose you had multiple SSDs to use - you could stripe them, but then you have no redundancy, which is a problem for writeback caching. Or you could mirror them, but then you're pointlessly duplicating data that's present elsewhere. Bcache will be able to mirror only the dirty data, and then drop one of the copies when it's flushed out.
Checksumming is a ways off, but definitely planned; it'll keep checksums of all the cached data in the btree, analogous to what Btrfs does. If a checksum doesn't match, that data can be simply tossed, the error logged, and the data read from the backing device or redundant copy.
There's also a lot of room for experimentation and potential improvement in the various heuristics. Right now the cache functions in a least-recently-used (LRU) mode, but it's flexible enough to allow for other schemes. Potentially, we can retain data based on how much real time it saves the backing device, calculated from both the seek time and bandwidth.
Of course, performance is the only reason to use bcache, so benchmarks matter. Unfortunately there's still an odd bug affecting buffered IO so the current benchmarks don't yet fully reflect bcache's potential, but are more a measure of current progress. Bonnie isn't particularly indicative of real world performance, but has the advantage of familiarity and being easy to interpret; here is the bonnie output:
Uncached: SATA 2 TB Western Digital Green hard drive
Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP utumno 16G 672 91 68156 7 36398 4 2837 98 102864 5 269.3 2 Latency 14400us 2014ms 12486ms 18666us 549ms 460ms
And now cached with a 64 gb Corsair Nova:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP utumno 16G 536 92 70825 7 53352 7 2785 99 181433 11 1756 15 Latency 14773us 1826ms 3153ms 3918us 2212us 12480us
In these numbers, the per character columns are mostly irrelevant for our purposes, as they're affected by other parts of the kernel. The write and rewrite numbers are only interesting in that they don't go down, since bcache isn't doing write behind caching yet. The sequential input is reading data bonnie previously wrote, and thus should all be coming from the SSD. That's where bcache is lacking, the SSD is capable of about 235 mb/sec. The random IO numbers are actually about 90% reads, 10% writes of 4k each; without write behind caching bonnie is actually bottlenecked on the writes hitting the spinning metal disk, and bcache isn't that far off from the theoretical maximum.
The bcache wiki holds more details about the software, more formal benchmark numbers, and sample commands for getting started.
The git repository for the kernel code is available at git://evilpiepirate.org/~kent/linux-bcache.git. The userspace tools are in a separate repository: git://evilpiepirate.org/~kent/bcache-tools.git. Both are viewable with a web browser at the gitweb site.
| Index entries for this article | |
|---|---|
| Kernel | Block layer/Caching |
| GuestArticles | Stearns, William |
Bcache: Caching beyond just RAM
Posted Jul 8, 2010 3:09 UTC (Thu) by NightMonkey (subscriber, #23051) [Link]
Bcache: Caching beyond just RAM
Posted Jul 8, 2010 3:23 UTC (Thu) by koverstreet (subscriber, #4296) [Link]
Bcache: Caching beyond just RAM
Posted Jul 8, 2010 3:35 UTC (Thu) by dlang (guest, #313) [Link]
Bcache: Caching beyond just RAM
Posted Jul 8, 2010 11:13 UTC (Thu) by ewan (subscriber, #5533) [Link]
I wonder if bcache can be stacked? So you could have an SSD over some SAS disks over a SATA RAID?
Bcache: Caching beyond just RAM
Posted Jul 8, 2010 11:24 UTC (Thu) by koverstreet (subscriber, #4296) [Link]
Bcache: Caching beyond just RAM
Posted Jul 8, 2010 3:42 UTC (Thu) by NightMonkey (subscriber, #23051) [Link]
Bcache: Caching beyond just RAM
Posted Jul 8, 2010 15:40 UTC (Thu) by martinfick (subscriber, #4455) [Link]
Ramdisks and hard drives as cache devices
Posted Jul 8, 2010 3:43 UTC (Thu) by wstearns (guest, #4102) [Link]
Ramdisks and hard drives as cache devices
Posted Jul 8, 2010 14:20 UTC (Thu) by butlerm (guest, #13312) [Link]
I imagine that Bcache caches O_DIRECT reads and writes, which could give a caching advantage even where an application has elected to bypass the ordinary buffer cache.
Bcache seems ideal for use in NFS servers, and iSCSI and FCoE targets. NFS servers are supposed to do synchronous writes, right? (Reliable, crash resistant) write behind caching would be outstanding.
Ramdisks and hard drives as cache devices
Posted Jul 8, 2010 14:57 UTC (Thu) by etienne (guest, #25256) [Link]
An application like a boot-loader installation/upgrade which tries to tell which disk sectors to load from the MBR?
I hope there is a sync() to write to (the real destination) disk.
Ramdisks and hard drives as cache devices
Posted Jul 8, 2010 23:19 UTC (Thu) by koverstreet (subscriber, #4296) [Link]
Ramdisks and hard drives as cache devices
Posted Jul 8, 2010 23:26 UTC (Thu) by jlayton (subscriber, #31672) [Link]
Not since NFSv3 was implemented. That gave the ability to do safe asynchronous writes. The client sends a bunch of WRITEs to the server and then issues a COMMIT. If the server crashes, the client will reissue uncommitted writes.
With NFSv2 however, you're correct that the server is supposed to do sync writes (and most do these days, at least by default).
Ramdisks and hard drives as cache devices
Posted Jul 20, 2010 11:38 UTC (Tue) by neilbrown (subscriber, #359) [Link]
Ramdisks and hard drives as cache devices
Posted Jul 22, 2010 20:59 UTC (Thu) by nix (subscriber, #2304) [Link]
Ramdisks and hard drives as cache devices
Posted Jul 8, 2010 23:32 UTC (Thu) by koverstreet (subscriber, #4296) [Link]
That's what I've been working towards :)
Safe write behind caching really is a large step up in terms of the guarantees the cache has to be able to make - with writethrough caching, you just have to make sure you return good data, or nothing. And btree code is hard enough, so I've been working on getting the easy cases rock solid first - but safe write behind caching is absolutely happening. I've got some of the preliminary stuff out of the way and I'm hoping to have a rough initial version before too long, depending on how debugging the new stuff goes.
Bcache: Caching beyond just RAM
Posted Jul 8, 2010 14:05 UTC (Thu) by cesarb (subscriber, #6266) [Link]
This sounds problematic to me. There is no guarantee that the UUID is unique. It might have been unique when it was *created*, but as soon as you duplicate a partition (with dd to another disk, a RAID 1 which is later split, or anything else), it ceases being unique. You could change the UUID of the copy, but who will remember to do that?
Bcache: Caching beyond just RAM
Posted Jul 8, 2010 14:52 UTC (Thu) by sync (guest, #39669) [Link]
Bcache: Caching beyond just RAM
Posted Jul 8, 2010 14:57 UTC (Thu) by faramir (subscriber, #2327) [Link]
Unique UUIDs
Posted Jul 8, 2010 16:19 UTC (Thu) by wstearns (guest, #4102) [Link]
Bcache and filesystem recovery
Posted Jul 8, 2010 16:54 UTC (Thu) by abacus (guest, #49001) [Link]
Bcache and filesystem recovery
Posted Jul 9, 2010 2:15 UTC (Fri) by koverstreet (subscriber, #4296) [Link]
This isn't a correctness issue, if you don't have barriers you just have to wait for writes to complete before you start another to guarantee ordering. Barriers have historically taken awhile to add to things like raid, lvm and drbd, bcache is nothing special.
There is a bug in that if you're running bcache on something that supports barriers, bcache won't indicated that they're no longer supported... I haven't worried much about that since you can disable them in the filesystem and it's only till I get barriers implemented in bcache.
Bcache: Caching beyond just RAM
Posted Jul 8, 2010 17:34 UTC (Thu) by sflintham (guest, #47422) [Link]
This approach of intercepting bio requests in the background allows us to start and stop caching on the fly, to add and remove cache devices, and to boot with or without bcacheAm I right in thinking this will only be possible once the checksumming feature mentioned at the bottom of the article is in place? Otherwise, what would stop the cache returning an out-of-date copy of a block modified while it was offline?
Bcache: Caching beyond just RAM
Posted Jul 8, 2010 23:16 UTC (Thu) by koverstreet (subscriber, #4296) [Link]
I am looking into invalidating the cache contents if a filesystem was already opened read write, or perhaps checking if the first page or so matches what it previously had - this would catch it provided the filesystem superblock changed. But the real performance gains are to be had with write behind caching, and none of this really matters there, since the cached device is now inconsistent if you use it without the cache. By the time bcache saw the cache was out of date it'd be too late to do anything.
Bcache: Caching beyond just RAM
Posted Jul 9, 2010 0:08 UTC (Fri) by akumria (subscriber, #7773) [Link]
Perhaps I'm not understanding everything.
But it seems to me, that you could move the bcache functionality into the filesystem.
i.e. Why can't btrfs store all writes on that faster medium and them replicate them back to the slower one?
Also, isn't this akin to storing the journal of a filesystem externally (I've never done it). i.e. Why not point the ext3/ext4 journal at the SSD?
Would you get the same (or better) benefit?
It seem bcache is useful is the underlying filesystem doesn't do it, but in the above two cases, is there much benefit?
Insight appreciated.
Thanks,
Anand
Bcache: Caching beyond just RAM
Posted Jul 9, 2010 1:21 UTC (Fri) by koverstreet (subscriber, #4296) [Link]
The main thing is the allocation strategy you want for caching is _completely_ different than for filesystems. Fragmentation isn't a real problem in the cache, since we can free fixed sized chunks regardless of what's in them. This means we're free to write data to the cache however we want to get the best performance. A filesystem has to retain data for an arbitrary amount of time, and thus needs to pay a lot of attention to making sure free space doesn't fragment too much.
Putting an external journal on an SSD gets you a bit of what bcache is after, but it'll only help with writes, and not to the extent bcache can. How would you effectively use an 80 gb journal? With bcache, you'll be able to fill up your caches with almost all dirty writes, and then write them out to your RAID6 with no restrictions on ordering - potentially turning a huge portion of random writes into mostly sequential ones, and even more will get overwritten in the cache before the raid ever sees them.
bcache and cluster filesystems?
Posted Jul 9, 2010 2:56 UTC (Fri) by skissane (subscriber, #38675) [Link]
bcache and cluster filesystems?
Posted Jul 9, 2010 3:28 UTC (Fri) by koverstreet (subscriber, #4296) [Link]
bcache and cluster filesystems?
Posted Jul 10, 2010 11:20 UTC (Sat) by skissane (subscriber, #38675) [Link]
bcache and cluster filesystems?
Posted Jul 11, 2010 3:22 UTC (Sun) by koverstreet (subscriber, #4296) [Link]
Bcache: Caching beyond just RAM
Posted Jul 9, 2010 3:36 UTC (Fri) by mwalls (guest, #6268) [Link]
Bcache: Caching beyond just RAM
Posted Jul 9, 2010 3:52 UTC (Fri) by koverstreet (subscriber, #4296) [Link]
Bcache: Caching beyond just RAM
Posted Jul 17, 2010 3:55 UTC (Sat) by bill_mcgonigle (guest, #60249) [Link]
bcache uses a kernel feature that lets it hook in 'from the side'. So, you still mount the regular device. In theory you can insert/pull/insert bcache as much as you want and your apps would be fine to keep on running. I think that makes it ultimately more powerful.
You could even imagine a workload where you have a bazillion disks that receive sporadic heavy access (maybe remote disks where old data isn't useful), and you could have some sort of monitor that would keep a few SSD's busy by inserting them as bcaches on-demand.
Bcache: Caching beyond just RAM
Posted Jul 9, 2010 8:03 UTC (Fri) by MisterIO (guest, #36192) [Link]
Bcache: Caching beyond just RAM
Posted Jul 10, 2010 16:18 UTC (Sat) by MisterIO (guest, #36192) [Link]
Was my previous question very retarded? I noticed it's the only comment in the thread which was completely ignored. So I thought that knowing to have said something really stupid is still more useful than nothing. Thus, this post was born.
Bcache: Caching beyond just RAM
Posted Jul 10, 2010 20:32 UTC (Sat) by dlang (guest, #313) [Link]
however (once the write behind features are in bcache) once the data is written to the bcache storage it should push those changes back to the real device after the next boot. If it doesn't the write-behind feature can't be considered done and useable ;-)
Bcache: Caching beyond just RAM
Posted Jul 11, 2010 3:20 UTC (Sun) by koverstreet (subscriber, #4296) [Link]
Kernel bugs are a different matter entirely though - if a program doesn't do what you think it does, it could be doing anything. If it's buggy it could be overwriting all your important data with Rick Astley songs, or opening the door the the velociraptor cages. There's just no way to tell at that point.
You test everything as best you can, but software's complicated, there's always something lurking and no complete guarantees.
Bcache: Caching beyond just RAM
Posted Jul 12, 2010 0:14 UTC (Mon) by MisterIO (guest, #36192) [Link]
Bcache: Caching beyond just RAM
Posted Jul 9, 2010 12:27 UTC (Fri) by eludias (guest, #4058) [Link]
Background: I've also got one of those nasty WD Caviar Green (1Tb) drives, and there is a feature in the firmware of the drives which auto-parks the heads after 8 seconds of inactivity. This interacts quite badly with an OS which saves its data once per 30 seconds, resulting in a drive worn down in about 6 months.
Now the way to circumvent this behaviour is to read something from the drive, say once per 5 seconds. And the most efficient way to do so is to read from the cache of the drive, so we re-read sector 0 over and over again. With O_DIRECT we can bypass the disk cache of Linux traditionally, but will this also be the case when using bcache?
O_DIRECT read of sector 0
Posted Jul 9, 2010 15:37 UTC (Fri) by wstearns (guest, #4102) [Link]
O_DIRECT read of sector 0
Posted Jul 10, 2010 10:45 UTC (Sat) by eludias (guest, #4058) [Link]
O_DIRECT read of sector 0
Posted Jul 10, 2010 21:13 UTC (Sat) by wstearns (guest, #4102) [Link]
Bcache: Caching beyond just RAM
Posted Jul 12, 2010 9:16 UTC (Mon) by etienne (guest, #25256) [Link]
Probably maximum power saving more, you can usually see/adjust (and save the setup) using hdparm.
Bcache: Caching beyond just RAM
Posted Jul 10, 2010 3:20 UTC (Sat) by giraffedata (subscriber, #1954) [Link]
Is it persistent across reboots? I.e. will my frequently accessed data still be available from the SSD after a reboot?
Bcache: Caching beyond just RAM
Posted Jul 10, 2010 8:59 UTC (Sat) by koverstreet (subscriber, #4296) [Link]
Bcache: Caching beyond just RAM
Posted Jul 11, 2010 14:39 UTC (Sun) by alankila (guest, #47141) [Link]
Bcache: Caching beyond just RAM
Posted Jul 12, 2010 19:50 UTC (Mon) by rilder (guest, #59804) [Link]
The transparency of bcache is quite nice. Is bcache available as module(so that it can be built out of tree and used) ?
Bcache: Caching beyond just RAM
Posted Jul 12, 2010 20:30 UTC (Mon) by rilder (guest, #59804) [Link]
Bcache: Caching beyond just RAM
Posted Jul 15, 2010 22:59 UTC (Thu) by bill_mcgonigle (guest, #60249) [Link]
There's quite a bit of similarity. L2ARC is read-only, for wide random accesses. bcache aims to be read/write. ZFS also separates out its ZIL for writes. If I'm building a ZFS box I'd use a big MLC drive for the L2ARC and a smaller SLC drive for the ZIL as the workloads differ. You could probably set bcache to be read-only and put a filesystem's journal on a different drive if you wanted a more-like-ZFS segregation. bcache has the nice attribute of just being able to pull it and keep running - ZFS isn't usually set up that way. The L2ARC and ZIL have the checksumming today whereas bcache will get to that. Of course, bcache is much more general and useful in situations where ZFS has no relevance. It's good to have chisels and screwdrivers.
Bcache: Caching beyond just RAM
Posted Jul 17, 2010 16:10 UTC (Sat) by intgr (subscriber, #39733) [Link]
Generation number
Posted May 17, 2013 8:55 UTC (Fri) by wiza (guest, #91017) [Link]
Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds