|
|
Log in / Subscribe / Register

The bcachefs filesystem

By Jonathan Corbet
August 25, 2015
The Linux kernel does not lack for filesystem support; many dozens of filesystem implementations are available for one use case or another. But, after all these years, Linux arguably lacks an established "next-generation" filesystem with advanced features and a design suited to contemporary hardware. That situation holds despite the existence of a number of competitors for that title; Btrfs remains at the top of the list, but others, such as tux3 and (still!) reiser4, are out there as well. In each case, it has taken rather longer than expected for the code to reach the required level of maturity. The list of putative next-generation filesystems has just gotten longer with the recent announcement of the "bcachefs" filesystem.

Bcachefs is an extension of bcache, which first appeared in LWN in 2010. Bcache was designed as a caching layer that improves block I/O performance by using a fast solid-state drive as a cache for a (slower, larger) underlying storage device. Bcache has been steadily developed over the last five years; it was merged into the mainline kernel during the 3.10 development cycle in 2013.

Mainline bcache is not a filesystem; instead, it looks like a special kind of block device. It manages the movement of blocks of data between fast and slow storage, working to ensure that the most frequently used data is kept on the faster device. This task is complex; bcache must manage data in a way that yields high performance while ensuring that no data is ever lost, even in the face of an unclean shutdown. Even so, at its interface to the rest of the system, bcache looks like a simple block device: give it numbered blocks of data, and it will store (and retrieve) them.

Users typically want something a bit higher-level than that; they want to be able to organize blocks into files, and files into directory hierarchies. That task is handled by a filesystem like ext4 or Btrfs. Thus, on current systems, bcache will be used in conjunction with a filesystem layer to provide a complete solution.

It seems that, over time, bcache has developed the potential to provide filesystem functionality on its own. In the bcachefs announcement, Kent Overstreet said:

Well, years ago (going back to when I was still at Google), I and the other people working on bcache realized that what we were working on was, almost by accident, a good chunk of the functionality of a full blown filesystem - and there was a really clean and elegant design to be had there if we took it and ran with it.

The actual running with this idea appears to have happened relatively recently, with the first publicly visible version of the bcachefs code being committed to the bcache repository in May 2015. Since then, it has seen a steady stream of commits from Kent; it was announced on the bcache mailing list in mid-July, and on linux-kernel just over a month later.

With the bcachefs code added, bcache has gained the namespace and file-management features that, until now, had to be supplied by a separate filesystem layer. Like Btrfs, it is a copy-on-write filesystem, meaning that data is never overwritten. Instead, a block that is overwritten moves to a new location, with the older version persisting as long as any references to it remain. Copy-on-write works well on solid-state storage devices and makes a number of advanced features relatively easy to implement.

Since the original bcache was a block-device management layer, bcachefs has some strong features in this area. Naturally, it offers multi-tier hybrid caching of data, and is able to integrate multiple physical devices into a single logical volume. Bcachefs does not appear to have any sort of higher-level RAID capability at this time, though; a basic replication mechanism is "like 80% done". Features like data checksumming and compression are supported.

The plans for the future include filesystem features like snapshots — an important Btrfs feature that is not yet available in bcachefs. Kent listed erasure coding as well, presumably as an alternative to higher-level RAID support. Native support for shingled magnetic recording drives is on the list, as is support for working with raw flash storage directly.

But none of those features are present in bcachefs now; work has been focused on getting the basic filesystem working in a reliable manner. Performance tuning has not been a priority thus far, but the filesystem claims reasonable performance numbers already — though, as Kent admitted, it suffers from the common (to copy-on-write filesystems) problem of "filling up" well before the underlying storage is actually filled with data. Importantly, the on-disk filesystem format has not yet been finalized — a clear sign that a filesystem is not yet ready for real-world use.

Another important (though unlisted) missing feature is a filesystem integrity checker ("fsck") utility.

Bcachefs looks like a promising filesystem, even if many of the intended features have not yet been implemented. But those who have watched filesystem development for any period of time will know what comes next: a surprisingly long wait while the code matures to the point that it can actually be trusted for production workloads. This process, it seems, cannot be hurried beyond a certain point; that is why other next-generation filesystem efforts are seemingly never quite ready. The low-level device-management code in bcachefs is tested and production-quality, but the filesystem code lacks that pedigree. Kent says that it "won't be done in a month (or a year)", but the truth is that it may not be done for several years yet; that is how filesystem development tends to go.

How many years depends, of course, on how many people test the filesystem and how much development effort it gets. Currently it has a development community of one — Kent — and he has noted that his full-time attention is "only going to last as long as my interest and my savings account hold out". If bcachefs acquires both a commercial sponsor and a wider development community, it may yet develop into that mature next-generation filesystem that we seem to never quite get (though Btrfs is there by some accounts). Until that happens, it should probably be looked at as an interesting idea with some advanced proof-of-concept code.

Index entries for this article
KernelFilesystems/bcachefs


to post comments

The bcachefs filesystem

Posted Aug 27, 2015 14:09 UTC (Thu) by koverstreet (✭ supporter ✭, #4296) [Link]

To clarify, by erasure coding I mainly mean reed-solomon - i.e. raid5/6.

What happened to Btrfs?

Posted Aug 28, 2015 7:21 UTC (Fri) by nirbheek (subscriber, #54111) [Link] (1 responses)

On reading this article, I can't help but wonder whatever happened to Btrfs. There seem to have been no progress updates on it ever since a lot of Btrfs developers joined Facebook. I've heard about good work being done internally, though.

bcachefs seems to be at the exact same place that Btrfs was in 2008. A somewhat-working filesystem with great potential that aims to be the next-gen FS with features that ZFS has had for almost a decade now. It would be a tragedy if we had to wait till 2020 only to hear that this filesystem ran out of steam too.

What happened to Btrfs?

Posted Aug 29, 2015 0:29 UTC (Sat) by orodeh (guest, #4219) [Link]

BTRFS is in its stabilization phase. The authors are hard at working improving it, and the tools around it. You can see the filesystem site (https://btrfs.wiki.kernel.org/index.php/Main_Page) for recent progress.

Of blocks and files

Posted Aug 31, 2015 8:48 UTC (Mon) by rvfh (guest, #31018) [Link] (3 responses)

I am probably talking BS here, but it seems there are two layers: blocks and files.
Could we not separate both somehow? The block management part could be like bcache, and could be chosen differently based on HDD, SSD, whatever is coming next. The file management could be based on ext4 and be more generic.

Of blocks and files

Posted Sep 3, 2015 9:51 UTC (Thu) by eternaleye (guest, #67051) [Link] (1 responses)

You may find the idea of "Nameless Writes" interesting: https://www.usenix.org/legacy/event/fast12/tech/full_pape...

The idea is that they allow decoupling the *extent allocation policy* from the filesystem, without going all the way to complicated object-storage schemes.

They are, largely, best-suited for COW - however, given a single location that accepts in-place writes, one can build an entire filesystem using nothing but nameless writes and updating that location to point to the most-recent root extent.

Of blocks and files

Posted Sep 3, 2015 9:54 UTC (Thu) by eternaleye (guest, #67051) [Link]

Probably the most succinct summary of the behavior of "nameless writes" I've found is "malloc-with-data" - you have some data, and you say "Store this somewhere, and then give me a pointer to it."

From there, it's an exercise in persistent data structures.

Of blocks and files

Posted Sep 3, 2015 14:28 UTC (Thu) by zdzichu (subscriber, #17118) [Link]

Slightly related, but that how layering in ZFS works. You have lower layer caring about block allocation, check summing, duplication etc. On top of that you can plug different upper layers. There is one giving you filesystems interface (ZFS Posix Layer). There is another giving you block devices suitable for swap and mkfs'ing other filesystems.


Copyright © 2015, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds