New tricks for XFS
The XFS filesystem has been in the kernel for fifteen years and was used in production on IRIX systems for five years before that. But it might just be time to teach that "old dog" of a filesystem some new tricks, Dave Chinner said, at the beginning of his linux.conf.au 2018 presentation. There are a number of features that XFS lacks when compared to more modern filesystems, such as snapshots and subvolumes; but he has been thinking—and writing code—on a path to get them into XFS.
Some background
XFS is the "original B-tree filesystem" since everything that the filesystem stores is organized in B-trees. They are not actually a traditional B-tree, rather they are a form of B* tree. A difference is that each node has a sibling pointer, which allows horizontal traversal of the tree. That kind of traversal is important when looking at features like copy on write (CoW).
An XFS filesystem is split into allocation groups, "which are like mini-filesystems"; they have their own free-space index B-trees, inode B-trees, reverse-mapping B-trees, and so on. File data is referenced by extents, with the help of B-trees. "Directories and attributes are more B-trees"; the directory B-tree is the most complex as it is a "virtually mapped, multiple index B-tree with all sorts of hashing" for scalability.
XFS uses writeahead journaling for crash resistance. It has checkpoint-based journaling that is meant to reduce the write amplification that can result from changing blocks that are already in the journal.
He followed that with a quick overview of CoW filesystems. When a CoW filesystem writes to a block of data or metadata, it first makes a copy of it; in doing so, it needs to update the index tree entries to point to the new block. That leads to modifying the block that holds those entries, which necessitates another copy, thus a modification to the parent index entry, and so on, all the way up to the root of the filesystem. All of those updates can be written together anywhere in the filesystem, which allows lots of optimizations to be done. It also provides consistent on-disk images, since the entire update can be written prior to making an atomic change to the root-level index.
All of that is great for crash recovery, he said, but the downside is that it requires that space be allocated for these on-disk updates. That allocation process requires metadata updates, which means a metadata tree update, thus more space needs to be allocated for that. That leads to the problem that the filesystem does not know exactly how much space is going to be needed for a given CoW operation. "That leads to other problems in the future."
These index tree updates are what provide many of the features that are associated with CoW filesystems, Chinner said, "sharing, snapshots, subvolumes, and so on". They are all a natural extension of having an index tree structure that reference-counts objects; that allows multiple indexes to point to the same object by just increasing the reference count on it. Snapshots are simply keeping around an index tree that has been superseded; that can be done by taking a reference to that tree. Replication is done by creating a copy of the tree and all of its objects, which is a complicated process, but "does give us the send-receive-style replication" that users are familiar with.
CoW in XFS is different. Because of the B* trees, it cannot do the leaf-to-tip update that CoW filesystems do; it would require updating laterally as well, which in the worst case means updating the entire filesystem. So CoW in XFS is data-only.
Data-only CoW limits the functionality that XFS can provide; features like deduplication and file cloning are possible, but others are not. The features it does provide are useful for projects like overlayfs and NFS, Chinner said. The advantage of data-only CoW is that there is no impact on non-shared data or metadata. In addition, XFS can always calculate how much space is needed for a CoW operation because only the data is copied; the metadata is updated in place.
But, since the metadata updates are not done with CoW, crash resiliency is a bit more difficult—it is not a matter of simply writing a new tree branch and then switching to it atomically. XFS has implemented "deferred operations", which are a kind of "intent logging mechanism", Chinner said. Deferred operations were used for freeing extents in the past, but have been extended to do reference count and reverse B-tree mapping updates. That allows replaying CoW updates as part of recovery.
What is a subvolume?
Thinking about all of that led Chinner to a number of questions about what can be done with data-only CoW. Everyone seems to want subvolume snapshots, but that seems to require CoW operations for metadata. How can the problem be repackaged so that there is a way to implement the same functionality? That is the ultimate goal, of course. He wondered how much of a filesystem was actually needed to implement a subvolume. There are other implementations to look at, so we can learn from them, he said. "What should we avoid? What do they do right?" The good ideas can be stolen—copied—"because that's the easy way".
Going back to first principles, he asked: "what is a subvolume? What does it provide?" From what he can tell, there are three attributes that define a subvolume. It has flexible capacity, so it can grow or shrink without any impact. A subvolume is also a fully functioning filesystem that allows operations like punching holes in files or cloning. The main attribute, though, is that a subvolume is the unit of granularity for snapshots. Everything else is built on top of those three attributes.
He asked: could subvolumes be implemented as a namespace construct that sits atop the filesystem? Bind mounts and mount namespaces already exist in VFS, he wondered if those could be used to create something that "looks like and smells like a subvolume". If you add a directory hierarchy quota on top of a bind mount, it will result in a kind of flexible space management. If you "squint hard enough", that is something like a subvolume, he said.
Similarly, a recursive copy operation using --reflink=always can create a kind of snapshot. It still replicates the metadata, but the vast majority of the structure has been cloned without copying the data. Replication can be done with rsync and tar; "sure, it's slow", but there are tools to do that sort of thing. It doesn't really resemble a Btrfs subvolume, for example, but it can still provide the same functionality, Chinner said. In addition, overlayfs copies data and replicates metadata, so it shows that you can provide "something that looks like a subvolume using data-only copy on write".
Another idea might be to implement the subvolume below the filesystem with a device construct of some sort. In fact, we already have that, he said. A filesystem image can be stored in a sparse file, then loopback mounted. That image file can be cloned with data-only CoW, which allows for fast snapshots. The space management is "somewhat flexible", but is limited by what the block layer provides and what filesystems implement. Replication is a simple file copy.
What this shows "is that what we think of as a subvolume, we're already using", Chinner said. The building blocks are there, they are just being used in ways that do not make people think of subvolumes.
The loopback filesystem solution suffers from the classic ENOSPC problem, however. If the filesystem that holds the image file runs out of space, it will communicate that by returning ENOSPC, but the filesystem inside the image will not be prepared to handle that failure and things break horribly: "blammo!". This is same problem that thin provisioning has. It is worse than the CoW filesystem ENOSPC problem, because you can't predict when it will happen and you can't recover when it does, he said.
He returned to the idea of learning from others at that point. Overlayfs and, to a lesser extent, Btrfs have taught us that specifying subvolumes via mount options is "really, really clunky", Chinner said. Btrfs subvolumes share the same superblock, which can cause some subtle issues about how they are treated by various tools like find or backup programs. A subvolume needs to be implemented as an independent VFS entity and not just act like one. "There's only so much you can hide by lying."
The ENOSPC problem is important to solve. The root of the problem is that upper and lower volumes (however defined) have a different view of free-space availability and those two layers do not communicate about it. This problem has been talked about many times at LSFMM (for example, in 2017 and in 2016) without making any real progress. But a while back, Christoph Hellwig came up with a file layout interface for the Parallel NFS (pNFS) server running on top of XFS; it allowed the pNFS client to remotely map files from the server and to allocate blocks from the server. The actual data lives elsewhere and the client does its reads and writes to those locations; so the client is doing its filesystem allocation on the server and then doing the I/O to somewhere else. This provides a model for a cross-layer communication of space accounting and management that is "very instructive".
A new kind of subvolume
He has been factoring all of this into his thinking on a new type of subvolume; one that acts the same as the subvolumes CoW filesystems have, but is implemented quite differently. The kernel could be changed so that it can directly mount image files (rather than via the loopback device) and a device space-management API could be added. If a filesystem implements both sides of that API, image files of the same filesystem type can be used as subvolumes. The API can be used to get the mapping information, which will allow the subvolume to do its I/O directly to the host filesystem's block device. This breaks the longstanding requirement that filesystems must use block devices; with his changes, they can now use files directly.
But this mechanism will still work for block devices, which will make it useful for thin provisioning as well. The thin-provisioned block device (such as dm-thin) can implement the host side of the space-management API; the filesystem can then use the client-side API for space accounting and I/O mapping. That way the underlying block device will report ENOSPC before the filesystem has modified its structures and issued I/O. That is something of a bonus, he said, but if his idea solves two problems at once, that gives him reason to think he is on the right track.
Snapshots are "really easy in this model". The subvolume is frozen and the image file is cloned. It is fast and efficient. In effect, the subvolume gets CoW metadata even though its filesystem does not implement it; the data-only CoW of the filesystem below (where the image file resides) provides the metadata CoW.
Replication could be done by copying the image files, but there are better ways to do it. Two image files can be compared to determine which blocks have changed between two snapshots. It is quite simple to do and does not require any knowledge of what is in the files being replicated. He implemented a prototype using XFS filesystems on loopback devices in 200 lines of shell script using xfs_io. "It's basically a delta copy" that is independent of what is in the filesystem image; if you had two snapshots of ext4 filesystems, the same code would work, he said.
There are features that people are asking for that the current CoW filesystems (e.g. Btrfs, ZFS) cannot provide, but this new scheme could. Right now, there is a lot of data shared between files on disk that is not shared once it gets to the page cache. If you have 500 containers based on the same golden image, you can have multiple snapshots being used but each container has its own version of the same file in the cache. "So you have 500 copies of /bin/bash in memory", he said. Overlayfs does this the right way since it shares the one cached version of the unmodified Bash between all of the containers.
His goal is to get that behavior for this new scheme as well. That requires sharing the data in shared extents in the page cache. It is a complex and difficult problem, Chinner said, because the page cache is indexed by file and offset, whereas the only information available for the shared extents is their physical location in the filesystem (i.e. the block number). Instead of doing an exhaustive search in the page cache to see if a shared extent is cached, he is proposing adding a buffer cache that is indexed by block number. XFS already has a buffer cache, but it doesn't have a way to share pages between multiple files. Chinner indicated that Matthew Wilcox was working on solving that particular problem; that solution would be coming "maybe next week", he said with a grin.
For a long time people have been saying that you don't need encryption for subvolumes because containers are isolated, but then came Meltdown and Spectre, which broke all that isolation. He thinks that may lead some to want more layers of defense to make it harder to steal their data when that isolation breaks down. Adding the generic VFS file-encryption API to XFS will allow encrypting the image files and/or individual files within a subvolume. There might be something to be gained by adding key management into the space-management API as well.
It is looking like XFS could offer "encrypted, snapshottable, cloned subvolumes with these mechanisms", Chinner said. There is still a lot of work to do to get there, of course; it is still in its early stages.
The management interface that will be presented to users is not nailed down yet; he has been concentrating on getting the technology working before worrying about policy management. How subvolumes are represented, what the host volume looks like to users, and whether everything is a subvolume are all things that need to be worked out. There is also a need to integrate this work with tools like Anaconda and Docker.
None of the code has had any review yet; it all resides on his laptop and servers. Once it gets posted, there will be lots of discussion about the pieces he will need to push into the kernel as well as the XFS-specific parts. There will probably be "a few flame wars around that, a bit of shouting, all the usual melodrama that goes along with doing controversial things". He recommended popcorn.
He then gave a demo (starting around 36:56 in the YouTube video of the talk) of what he had gotten working so far. It is a fairly typical early stage demo, but managed to avoid living up to the names of the subvolume and snapshot, which were "blammo" and "kaboom".
After the demo, Chinner summarized the talk (and the work). He started out by looking at how to get the same functionality as subvolumes, but without implementing copy on write for metadata. The "underlying revelation" was to use files as subvolumes and to treat subvolumes as filesystems. That gives the same functionality as a CoW filesystem for that old dog XFS.
[I would like to thank LWN's travel sponsor, the Linux Foundation, for
travel assistance to Sydney for LCA.]
Index entries for this article | |
---|---|
Kernel | Filesystems/XFS |
Conference | linux.conf.au/2018 |
Posted Feb 22, 2018 8:55 UTC (Thu)
by jag (subscriber, #3766)
[Link]
Posted Feb 22, 2018 9:16 UTC (Thu)
by TomH (subscriber, #56149)
[Link] (2 responses)
Posted Feb 22, 2018 23:51 UTC (Thu)
by simcop2387 (subscriber, #101710)
[Link] (1 responses)
Posted Feb 23, 2018 8:44 UTC (Fri)
by TomH (subscriber, #56149)
[Link]
Posted Feb 22, 2018 10:11 UTC (Thu)
by nix (subscriber, #2304)
[Link] (22 responses)
The hard part of all this does indeed appear to be the usual space management and policy palaver. If you're storing snapshots in image files, where do those files reside? On the XFS filesystem that the snapshot is being mounted on, probably, but where? .snapshot above the initial mount point? /.snapshot? But the space-management problem is worse.
You can run out of space in a subvolume when there is plenty of space on the containing fs, because the image file is too small. One could kludge around this a bit by routinely creating subvolumes as sparse files as big as the containing fs -- but since these can obviously only grow, and the inner fs has no idea that every time it writes to a block full of zeroes it's actually doing irreversible space allocation, that then means you can steal space from the containing fs until it runs out of room, even if the subvolume is still rather small. Also, if you expand the containing fs, you probably want to expand all the subvolume files as well... the space management looks *painful* and frankly these things are starting to look not very much like conventional files at all, but like a sort of super-sparse file where the inner and outer filesystems should communicate (at least if they're the same fs) to share free space between all subvolumes and the containing fs. From the POSIX API perspective these would probably look like sparse files that could shrink again, and routinely did, as the contained fs notified the containing fs that it had deallocated a bunch of blocks.
(Oh, also privilege problems emerge, though to my eye they are minor compared to the space management problems. Unprivileged snapshots appear... hard in this model, unless you want unprivileged mount of arbitrary XFS filesystems, which is probably still a bit risky.)
Posted Feb 22, 2018 13:07 UTC (Thu)
by dgc (subscriber, #6611)
[Link] (20 responses)
> but like a sort of super-sparse file
Well, yes. The underlying principle is that subvolumes need to be abstracted from physical storage space. e.g. This is all based on patches to XFS to dissociate the physical size of the filesystem from the amount of space a subvolume presents to the user that this is all based on. This can't be done without sparse storage, be it a file or a thin block device, and the initial patchset is here:
https://marc.info/?l=linux-xfs&m=150900682026621&w=2
> where the inner and outer filesystems should communicate
Yup, this is precisely the mechanism I've implemented to handle the ENOSPC problem. More info in this thread:
https://marc.info/?l=linux-xfs&m=151722047128077&w=2
-Dave.
Posted Feb 22, 2018 13:59 UTC (Thu)
by Paf (subscriber, #91811)
[Link]
Posted Feb 22, 2018 14:28 UTC (Thu)
by doublez13 (guest, #122213)
[Link] (2 responses)
Posted Feb 23, 2018 21:46 UTC (Fri)
by dgc (subscriber, #6611)
[Link] (1 responses)
Posted Feb 24, 2018 1:30 UTC (Sat)
by rbrito (guest, #66188)
[Link]
Posted Feb 22, 2018 20:03 UTC (Thu)
by ebiederm (subscriber, #35028)
[Link] (1 responses)
This would require something to not allow access to the subvolume file except to mount it as a subvolume.
Posted Feb 23, 2018 21:52 UTC (Fri)
by dgc (subscriber, #6611)
[Link]
I don't see any fundamental problem with doing this - we're going to want to enable root inside a container to snapshot their own subvolumes, so we're going to need some kind of mechanism to allow non-init namespace access to the management interfaces. We haven't decided on how we're going to present snapshot volume management to userspace yet - this is on my list of requirements for when we start on that aspect of the problem....
Posted Feb 22, 2018 20:50 UTC (Thu)
by clump (subscriber, #27801)
[Link] (6 responses)
Posted Feb 23, 2018 17:05 UTC (Fri)
by wazoox (subscriber, #69624)
[Link] (5 responses)
Posted Feb 23, 2018 19:45 UTC (Fri)
by clump (subscriber, #27801)
[Link] (4 responses)
I also wish there was something akin to an ext5 with an upstream COW implementation.
Posted Feb 23, 2018 21:15 UTC (Fri)
by wazoox (subscriber, #69624)
[Link]
Posted Feb 23, 2018 21:23 UTC (Fri)
by nix (subscriber, #2304)
[Link] (2 responses)
So old filesystems mostly continue to use older codepaths, or aspects of newer codepaths which are heavily tested by virtue of the weight of existing filesystems out there. Only people willing to take the significant plunge of re-mkfsing are affected by the potential bug load (and, obviously, don't even do this with test filesystems on important machines until you're confident it won't oops :) ).
(The unusual thing about this set of features is that it doesn't seem to need new xfs features at all: any fs with the existing reflink and rmapbt features can take advantage of it! That's *very* impressive to me.)
Posted Feb 23, 2018 22:01 UTC (Fri)
by dgc (subscriber, #6611)
[Link] (1 responses)
No, that's not the case. I think I showed the list of experimental features that the demo used in the talk. Sure, reflink and rmapbt are no longer experimental (as of 4.16-rc1), but I added a new "thinspace" feature for thin provisioning awareness and another for "subvol" support, because that also requires special flags in inodes and other such things to mark the filesystem as a subvol.
Overall, there's remarkably little change for the subvol feature - about 1400 lines of new kernel code and ~300 lines of userspace code were needed for the demo I gave during the talk. Compare that to the data-COW feature in XFS that it relies on - we added about 20,000 lines of code for that between kernel and userspace....
Posted Mar 3, 2018 13:58 UTC (Sat)
by nix (subscriber, #2304)
[Link]
Posted Feb 22, 2018 21:42 UTC (Thu)
by nix (subscriber, #2304)
[Link] (6 responses)
I saw the thin provisioning-aware stuff go by, but didn't realise it would have *this* consequence. I mean, thin provisioning is all very well but unless you're running a data centre or testing filesystems isn't really important in your day-to-day life. But everyone likes snapshots :)
Of course this also obsoletes LVM: instead of LVM, you have a single giant XFS filesystem with each "partition" consisting of a sparse disk image containing whatever filesystem you like (as long as it supports trimming of unused regions): if it's XFS, it asks the containing fs to CoW the appropriate metadata regions whenever a subvolume is created, and bingo.
There isn't even more metadata than there would otherwise be, despite the existence of two filesystems where before you had one, since the outermost one has minimal metadata tracking the space allocation of the contained fs and that's just about all. As lots of subvolumes are created, the extents in that fs tracking the increasingly fragmented thin file containing the subvolumes will grow ever smaller (the same problem CoW filesystems have traditionally had with crazy fragmentation), but defragmenting single files with XFS is something it's been very good at for a long time, and it's a heck of a lot easier than defragmenting a filesystem with CoW metadata would be: the metadata in the contained fs, of course, is completely unchanged by any of this, rather than having to go through rewrite-and-change hell.
I'll admit it. I want all this cool stuff *now* and I'm kicking myself that I re-mkfsed my big fileserver only eight months ago and can't convince myself to redo it like this instantly. :)
Posted Feb 22, 2018 22:12 UTC (Thu)
by rodgerd (guest, #58896)
[Link] (3 responses)
1. Convert big fileserver into two Gluster servers.
Posted Feb 22, 2018 22:44 UTC (Thu)
by nix (subscriber, #2304)
[Link] (2 responses)
I am vaguely tempted to do something kinda similar with my new desktop, which will likely come with an MMC far too small to hold the fs I want to put on it, but *with* nice fast 10GbE (imagine my surprise to find that managed switches with a couple of 10GbE ports on have plunged in price to under £200 nowadays). I was thinking of putting Lustre on it as a way to split the metadata onto the MMC and the data onto the big machine over the 10GbE :) however, it's quite possible that NFSv4 onto a big RAID array over the 10GbE would be *faster*, particularly for writes. In practice I suspect I'll be happy with just nfsroot and ignoring the MMC entirely. (Which means I'll have to thunk to the fileserver to do snapshots etc in any case... though if I go to NFSv4.2 I will at least be able to do reflinks on the client.)
Posted Feb 23, 2018 4:52 UTC (Fri)
by rodgerd (guest, #58896)
[Link] (1 responses)
I find "being a pack rat with more cases and motherboards lying about than is reasonable" seems to work acceptably well ;). The odd number of disks is a bit trickier.
> I was thinking of putting Lustre on it as a way to split the metadata onto the MMC and the data onto the big machine over the 10GbE :)
Good heavens. That's a dedication to complexity! I'm moving my performance-sensitive Gluster FSes to use tiering, so I'm in no position to criticise, though.
> imagine my surprise to find that managed switches with a couple of 10GbE ports on have plunged in price to under £200 nowadays
The plunging price of network equipment continues to astonish me. They days where just management was a prohibitively expensive option are long gone, and I'm loving it.
Posted Feb 24, 2018 20:06 UTC (Sat)
by naptastic (guest, #60139)
[Link]
I've learned that the only way to know for sure that you have enough is to have too much.
Posted Feb 23, 2018 22:08 UTC (Fri)
by dgc (subscriber, #6611)
[Link] (1 responses)
No, it doesn't obsolete LVM. You still need something to manage your physical storage that the base XFS filesystem is placed on. What it allows is much more flexible use and management of the space within that XFS filesystem, but it doesn't change how you manage the underlying devices. Especially if you want to grow the base filesystem in future....
Posted Mar 3, 2018 14:00 UTC (Sat)
by nix (subscriber, #2304)
[Link]
Posted Mar 8, 2018 9:32 UTC (Thu)
by nilsmeyer (guest, #122604)
[Link]
Posted Feb 22, 2018 20:19 UTC (Thu)
by dcg (subscriber, #9198)
[Link] (4 responses)
Also, what are your next plans for XFS - add some kind of internal raid capabilities, and then fully become a ZFS-like filesystem? :P
Posted Feb 22, 2018 22:12 UTC (Thu)
by nix (subscriber, #2304)
[Link] (3 responses)
I am bubbling over thinking of things you can do with this. It's so powerful!
Posted Feb 23, 2018 0:19 UTC (Fri)
by nix (subscriber, #2304)
[Link]
Posted Feb 23, 2018 15:47 UTC (Fri)
by dcg (subscriber, #9198)
[Link] (1 responses)
Now let's imagine that you make a large file in the last subvolume. It will requires to allocate blocks for its image residing in the previous filesystem. If the allocation is big enough, it may require also to allocate blocks for that previous subvolume, and would require to allocate blocks for him too - all the way to the host filesystem. Allocation performance will suffer heavily. In short, XFS needs to replicate space management structures for each embedded subvolume (and potentially need to update *all* of them in some extreme cases). In ZFS/btrfs, space management is shared by all subvolumes so it does not have these problems.
This is what leads me to believe that excessive subvolume embedding would be a scalability issue for this model, but may be I'm not understanding it right
Posted Feb 23, 2018 21:26 UTC (Fri)
by nix (subscriber, #2304)
[Link]
(I'll admit that what I want isn't subvolumes at all -- it's snapshotting on a directory granularity level. No subvolumes, no notion of "this is like a smaller subfilesystem", just each subdir can be snapshotted and the snapshots can be forked freely. That's still, I think, something not that hard with a CoW tree fs but incredibly clunky with Dave's scheme. One filesystem image per directory? I don't think that'll scale too well... a complete log and fs structures per directory, uh, no. :) )
Posted Feb 23, 2018 20:51 UTC (Fri)
by tarvin (guest, #4412)
[Link]
Posted Mar 25, 2019 9:45 UTC (Mon)
by ShalokShalom (guest, #111022)
[Link] (1 responses)
Is there any news?
Posted Sep 26, 2021 12:11 UTC (Sun)
by bluss (guest, #47454)
[Link]
New tricks for XFS
New tricks for XFS
New tricks for XFS
New tricks for XFS
New tricks for XFS
New tricks for XFS
> (at least if they're the same fs) to share free space between all subvolumes and the containing fs
New tricks for XFS
New tricks for XFS
New tricks for XFS
New tricks for XFS
Unprivileged mounts
Unprivileged mounts
New tricks for XFS
New tricks for XFS
New tricks for XFS
New tricks for XFS
New tricks for XFS
New tricks for XFS
> at all: any fs with the existing reflink and rmapbt features can take advantage of it! That's
> *very* impressive to me.)
New tricks for XFS
New tricks for XFS
New tricks for XFS
2. Reformat them with Cool Stuff one by one.
3. Beg the Gluster maintainers to integrate their tooling with the new XFS features.
New tricks for XFS
New tricks for XFS
New tricks for XFS
New tricks for XFS
> with each "partition" consisting of a sparse disk image containing whatever filesystem you
> like (as long as it supports trimming of unused regions): if it's XFS, it asks the containing
> fs to CoW the appropriate metadata regions whenever a subvolume is created, and bingo.
New tricks for XFS
You still need something to manage your physical storage that the base XFS filesystem is placed on.
I'm not sure why one giant filesystem completely filling the physical storage needs management -- but I suppose if it's sitting on a RAID array or something, that array is more growable than if it were a simple disk of that size, so you might still need some layer between the physical storage and the top-level XFS.
New tricks for XFS
New tricks for XFS
New tricks for XFS
New tricks for XFS
New tricks for XFS
New tricks for XFS
FS shrinking
New tricks for XFS
New tricks for XFS