A mapping layer for filesystems
In a plenary session on the second day of the Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Dave Chinner described his ideas for a virtual block address-space layer. It would allow "space accounting to be shared and managed at various layers in the storage stack". One of the targets for this work is for filesystems on thin-provisioned devices, where the filesystem is larger than the storage devices holding it (and administrators are expected to add storage as needed); in current systems, running out of space causes huge problems for filesystems and users because the filesystem cannot communicate that error in a usable fashion.
His talk is not about block devices, he said; it is about a layer that provides a managed logical-block address (LBA) space. It will allow user space to make fallocate() calls that truly reserve the space requested. Currently, a filesystem will tell a caller that the space was reserved even though the underlying block device may not actually have that space (or won't when user space goes to use it), as in a thin-provisioned scenario. He also said that he would not be talking about his ideas for a snapshottable subvolume for XFS that was the subject of his talk at linux.conf.au 2018.
The new layer will provide the address space, which is a representation of an LBA range. There will be a set of interfaces to manage the backend storage for that range. A filesystem will usually be the client of the interface, while a block device or a separate filesystem can be the supplier of the storage for the layer.
![Dave Chinner [Dave Chinner]](https://static.lwn.net/images/2018/lsf-chinner-sm.jpg)
The filesystem does not treat the virtual block address layer any differently than it does a block device from a space-management perspective. The supplier provides allocation and space reservation; it could also provide copy-on-write (CoW) to the upper layer, which would allow for snapshots at that level. In order to read and write data, however, a mapping must be done to turn the virtual LBA into a real LBA and block device for the I/O. It is similar to the export blocks feature of Parallel NFS (pNFS).
When the client wants to do I/O, it first maps the virtual LBA, then does the operation directly to the block device where the data is stored. Jan Kara asked if it is simply a remapping layer for filesystems; Chinner agreed that it was. He was looking at adding this ability to XFS but realized it was more widely applicable. It is similar to what is done for loopback devices, but he has chopped some layers out of that; instead of going through the block device interface, it is going through the remapping layer.
One of the problems with space reservation is that there may be a delay between the write of data and its associated metadata. But it is important that space reserved for that metadata does not disappear when it comes time to write the metadata. The upper layer filesystem needs to be able to ensure that a later writeback does not get an ENOSPC error for something that it believes it can write.
Under this new scheme, the filesystem can ask the supplier for a reservation, which will result in an opaque cookie that the filesystem can use to indicate portions of the reservation. Every object modification has the cookie associated with it; when all of those modifications are done, the reference count on the cookie drops to zero and any extra reservation goes back to the backend.
This allows allocation based on the I/O that the filesystem is building. It also can allow for write combining that is optimal for the thin-provisioned devices. Overall, it allows for optimal I/O for the underlying structures, he said.
The client does not know anything about what the underlying backing store actually does. Similarly, the supplier does not know what the client is doing; it is just allocating and mapping. The idea is just to create an abstraction that allows two different layers in the stack to manage blocks in a way that can report errors properly.
When the BIO is formed for a read operation, the filesystem does everything it does now, but it also calls out to the mapping layer to find out which block device to do the read on. It will issue I/O directly to the underlying device, taking a shortcut around all of the layers that a loopback device would use, he said.
A write operation would use a two-phase write that is similar to what XFS uses for direct I/O. It would get the block device and LBA from the mapping layer and it would also attach any needed reservation cookies to the BIO. If the target area is a hole, the system first allocates for those blocks; if it is a CoW supplier, it allocates new blocks and returns the mapping and reservation for those. All of that behavior would be hidden in the lower layers. The BIOs are built and sent down to the block device; when the write completes, the supplier must run its completion routines first, then the client runs its completions to finish its two-phase write.
At no time does the client know anything about what the underlying backing store actually does, Chinner reiterated. Similarly, the supplier does not know what the client is actually doing; it simply handles allocation and mapping. Anything that can provide a 64-bit address space can be used as a supplier, a file could be used, for example.
It is an abstract interface, he said, that is not specific to any filesystem or block device. It could be ext4 as a client with XFS as a supplier, or vice versa if ext4 implements the supplier interface. Ted Ts'o said that he originally thought this was all simply targeting thin provisioning, but having filesystems as the supplier "becomes interesting"; "that's neat". Chinner said his actual motivation was for XFS subvolumes, not thin provisioning.
The problem has turned out to be fairly simple to solve. It is about 1700 lines of code right now and he thinks it will grow to 3000 or so once he gets it cleaned up and ready for posting. He does think it will be interesting for other filesystems. Kara said that it resembled some things that Btrfs does; Chinner agreed, he is not really doing anything new, but is simply "repackaging and reimagining" ideas that are already out there.
One of the reasons he likes this approach is that it reuses the infrastructure already available in the filesystem layer. It can turn snapshots into regular files, for example. Chris Mason said that he uses loopback devices for some containers, but that this mechanism would be better. Chinner acknowledged that and noted that he has some "wild plans" for page-cache sharing that will make it even better. There are lots of use cases, he said, so will get his act together and post patches soon.
Index entries for this article | |
---|---|
Kernel | Block layer |
Conference | Storage, Filesystem, and Memory-Management Summit/2018 |
Posted May 9, 2018 17:06 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (8 responses)
We already have LVM. So just use it as a substrate for a filesystem, monitor its free space and extend the filesystem as needed. ext4 and xfs both support online expansion.
If free space/volumes in the LVM array are running low, simply add more. No need for any kernel-level features and can be implemented entirely in userspace.
Posted May 10, 2018 0:31 UTC (Thu)
by Paf (subscriber, #91811)
[Link]
Are you talking about some more specific instance of thin provisioning?
Posted May 10, 2018 0:56 UTC (Thu)
by ringerc (subscriber, #3071)
[Link] (5 responses)
"Just add more storage when you need it" is easy when you're managing individual servers and you're responsible for the storage and the server.
It quickly becomes unfeasible when you're running an infrastructure where people can request (often virtual) servers and run them for $WHATEVER. The server is their problem. They may have one, they may be part of a team running hundreds or thousands of them. Numerous small sysadmin teams doesn't work well and leads to low server utilisation, excessive divergence in tooling and SOE between teams, overlapping software licensing spends, and more. So your organisation has moved toward centralising the underlying server infrastructure with virtualisation and SAN storage to improve utilisation, reduce hardware and power costs, and make staffing more efficient.
As a result you have this big SAN with, say, 50TB of storage. You don't want to upgrade to 200TB and allocate 4x or more the needed storage for all your servers just for slush free space, in case they might need it. Most of them won't. But you don't want server admins coming to you many times a day asking for another small space increase and doing constant maintenance work to resize file systems just-in-time either.
So you set the SAN up to lie about the real size of the LUNs it hands out. Sysadmin asks for 1TB for each of 10 servers? Sure, you can have 10TB, but you need to give me an estimate of likely actual growth for my capacity planning, because most of that storage won't exist when I give it to you. I'll add it progressively as it's actually needed. You don't have to resize your file systems all the time, I don't have to spend tons of power and money on empty disks. Their servers just have a FC HBA, 10G ethernet iSCSI or whatever to talk to the SAN.
Over larger farms of servers growth tends to even out and you get fairly predictable, plannable capacity needs and plannable costs. Things managers just love. SAN vendors love it too, because they can ship you powered down racks of extra disks and charge you to turn them on - no need for a site visit.
The support for thin provisioning on Linux can be used to implement similar schemes for things like a vm host that carries thousands or tens of thousands of VMs, most of which don't need much space. But where you can't easily predict which ones will, and when.
But everything's an estimate. If you fail to capacity plan properly or monitor properly you can run out of thin provisioned space entirely. That's quite bad. Most systems offer safeguards like reserved space pools for subsets of critical servers, so it won't bring down everything. But centralising storage does centralise failure.
That failure should still be limited to "oops, we can't write data to this volume" though. The problem is that right now, it isn't. Due to the fsync issues discussed on lwn recently and due to FS layer limitations, thin provisioning failures are not handled very gracefully, and write failures may turn into data loss.
Posted May 10, 2018 9:31 UTC (Thu)
by dgm (subscriber, #49227)
[Link] (1 responses)
That said, I fail to see how thin provisioning is better than allowing users to freely take more space as they need it.
By your explanation, it seems that it is used like some kind of pre-granted quota that you can use without explicit permission, but with the undesirable side effect of unexpected errors when resources that are supposedly there fail to materialize. That in turn forces complexity on the applications, because you cannot make assumptions.
If you want to have quotas, why don't you use quotas? Or is there some other reason?
Posted May 10, 2018 10:38 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
Thin provisioning moves the management overhead from the thousands or tens of thousands of VMs using the storage system, to the single storage system.
Instead of having to ask for space in small chunks (say 10 GiB at most), and having to expand regularly to cope on each of the VMs, you can give the VMs virtual disks that appear big enough for (say) 5 years worth of predicted use. The server admin only has to check in and ask for more space when they have new projections showing that they need much more space than they have has assigned.
In the meantime, though, you now have tens of thousands of machines that have enough space for the worst case 5 year projection; they're not going to need that up front, so you really don't want to buy that much disk space today, only to leave hundreds of disks empty. Instead, you thin-provision; each virtual disk is only allocated as much space on the real drives as it really needs today, and you set up monitoring so that you know when you're going to need more real disks and can order them in before you need them.
This means that instead of tens of thousands of server admins allocating 10 GiB chunks every couple of days, you have a small number of storage admins buying and bringing online disks every few weeks or months as needed for your growth. Your servers think they have 500 PiB of space between them, but you've only bought 200 TiB so far, and bring another 10 TiB online whenever you run low on used disk.
If it all works as designed, thin provisioning is completely transparent to the clients - you always have more real disk than you're using, so the impact of thin provisioning is that you neither buy a huge amount of otherwise idle disk nor have to keep allocating space to the servers that need it. You just wait for the storage array to run low, then add another chunk.
Posted May 10, 2018 18:20 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
This way you don't need to have untested failure paths that can happen at any moment.
Posted May 11, 2018 0:14 UTC (Fri)
by dgc (subscriber, #6611)
[Link]
Read the article again - that's essentially what the infrastructure I was talking about provides us with.
i.e. it provides us with the ability for the filesystem to request the amount of space it needs from the underlying storage before it starts doing operations that depend on that space being available. Hence if the device does not have space available, then the filesystem can back out and report ENOSPC to the application before getting into a state it can't sanely recover from if it receives device-based ENOSPC errors during IO....
-Dave.
Posted May 29, 2018 16:12 UTC (Tue)
by zdzichu (subscriber, #17118)
[Link]
Posted May 12, 2018 17:41 UTC (Sat)
by fyrchik (guest, #124371)
[Link]
A mapping layer for filesystems
A mapping layer for filesystems
A mapping layer for filesystems
A mapping layer for filesystems
A mapping layer for filesystems
A mapping layer for filesystems
A mapping layer for filesystems
> your size?" request to the iSCSI/whatever block devices
That's why first thing you should do after getting such fake space is dd if=/dev/urandom of=/file. This would give you some guarantees that your space you need is there. And should avoid unpleasant surprises in the future.
A mapping layer for filesystems
A mapping layer for filesystems