User: Password:
|
|
Subscribe / Log in / New account

Blurred boundaries in the storage stack

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

March 24, 2016

This article was contributed by Neil Brown

It has been said that an important part of a maintainer's role is to say "no". Just how this "no" is said can define the style and effectiveness of a maintainer. Linus Torvalds recently displayed just how effective his style can be when saying "no" to a pair of fairly innocuous patches to add a new ioctl() command for block devices — patches in their fifth revision that had already received "Reviewed-by" tags from Christoph Hellwig:

NAK, just based on annoyance with the randomness of this interface

It became clear that Torvalds only had a fairly general understanding of the underlying functionality and didn't much care about it anyway. What he cared about, as he said, was the interface. It seemed both "too specific" and too generic; "too 'future-proofing'".

These complaints led to a wide-ranging discussion that brought out a number of underlying issues, drew parallels between disparate parts of the storage stack, and resulted in a new interface proposal that gives quite a different flavor to the same basic operations.

The heart of the matter

Modern storage devices can do a lot more with stored data than simply read or write arbitrary blocks. Of the other operations the best known is doubtlessly "discard". This operation, named TRIM in the ATA protocol and UNMAP in SCSI, tells the storage device that the data in some data blocks is no longer needed. It is well-known because it is both valuable and problematic. Some SSDs work better if unused data is regularly trimmed, but trim implementations work differently on different devices, both in terms of efficiency and effectiveness. This variation means that users often need to know precise details of their hardware to achieve the best performance.

There is an operation that is the inverse of discard that is important for thin-provisioned devices. Thin provisioning allows a storage array to appear to be extremely large, while only having physical capacity for a much smaller amount of storage. As data is written, the available storage is allocated to the target addresses. As the free space shrinks, the device administrator is alerted and action can be taken, which could include acquiring extra physical capacity.

A particularly useful operation when using a thin-provisioned device is to request that storage space be allocated before actually writing data to it. This makes it possible to report allocation problems earlier and to avoid unpleasant surprises. The SCSI spec refers to these unwritten allocations as ANCHORED blocks, and supports anchoring with the WRITE SAME SCSI command, which writes a particular block of data (often zeros) to multiple locations over a given range of addresses.

The Linux block layer has an interface, blkdev_issue_zeroout(), that combines both the de-allocation of discard and the pre-allocation of WRITE SAME with the more generic goal of zeroing out a range of blocks on a device. Depending on the capabilities of the device and on the "discard" flag that is passed to the function as a hint, it will issue a discard request (i.e. TRIM or UNMAP), a WRITE SAME request, or write a zeroed page of memory to every block in the range. Future reads are guaranteed to return zeros, but pre-allocation or de-allocation happens on a best-effort basis.

The "discard" hint flag and the possible issuing of a discard request is a relatively recent addition and is, importantly, different from the similar blkdev_issue_discard() interface. The latter will issue a discard even if the result might be that subsequent reads return random data. blkdev_issue_zeroout() will only issue a discard if future reads will reliably return zeros.

Simple patches for a simple problem

The pair of patches that Darrick Wong posted does two things. Primarily they add a new ioctl() command so that the "discard" flag can be set from user space; the existing BLKZEROOUT ioctl() calls blkdev_issue_zeroout() but always sets the "discard" flag to zero. Hoping not to have to create yet-another-command if even more functionality is ever added to blkdev_issue_zeroout(), Wong defined the new BLKZEROOUT2 with room for expansion: 32 flags of which only one was used, and even some "padding" fields that must be zero now, but could be defined later.

The other effect of these patches is to purge parts of the page cache for the block device when blocks are zeroed. Normal reads and writes on a block device (e.g. /dev/sda) are cached in the page cache. An O_DIRECT write is instead sent directly to the device, which could make it inconsistent with the page cache. To avoid such inconsistency, the corresponding pages of the page cache are removed when an O_DIRECT write happens. BLKZEROOUT is much like an O_DIRECT write, so, with the patches applied, both it and BLKZEROOUT2 will purge the page cache.

Torvalds's response seems to be based on an intuitive "it doesn't feel right" rather than clear logical reasoning. One flaw he identified was not actually present in the code; it boiled down to "I absolutely detest code that tries to be overly forward-thinking", which is a little surprising given the problems there have been with system calls not having a suitable flags argument. Most of the rest is summed up by his comment: "So the whole patch looks pointless." He did approve of purging the page cache, though.

As the discussion progressed and requirements were more explicitly stated, the source of Torvalds's discomfort became clearer. The operations of interest deserved to be thought about at a much higher level than just ioctl() commands for a block device. They are much more like operations on a file — to allocate and de-allocate backing store.

The Linux fallocate() system call has a flag FALLOC_FL_PUNCH_HOLE, which is a lot like TRIM, particularly the style of TRIM that causes future reads to return zeros. fallocate() also has that FALLOC_FL_ZERO_RANGE flag, which is a good match for WRITE SAME or writing zeros. Rather than providing an ioctl() command that seems focused on matching low-level functionality provided by certain hardware, using fallocate() would reuse an existing high-level interface that is described in terms of the needs of applications. Existing fallocate() implementations already purge the page cache as appropriate, so had this approach been used instead of the initial BLKZEROOUT ioctl() command, it is likely that those implementations would have been used as a guide, so we would not have the current situation where zeros can be written without any purge.

Wong provided a new patch set that added fallocate() support for block devices; this received much warmer support from Torvalds. He found a few little nits, but admitted that "on the whole, I like it". This was a fitting close to a maintainership interaction done really well: Torvalds followed his intuition and complained about things that bothered him, despite not having a full picture of the problem space. Wong responded directly, called Torvalds out where he was clearly wrong, and attempted to justify other choices with extra details. A more complete picture was formed, against which preferences could be explained more coherently. Finally a resolution was found, implemented, and approved — apparently to everyone's satisfaction. This is a model worth following.

An enlightening tangent

While the conclusion to the main thread of discussion was that treating block devices a bit more like files could make it easier to work with new hardware, there was a sub-thread that seemed to head in a complementary direction.

There appear to be a number of user-space file servers — Ceph was given as an example — that use a local filesystem to store data, but aren't really interested in many of the traditional semantics of a filesystem. A good example of this is the O_NOMTIME flag that was discussed last year. These file servers really just want space to store data and want reads and writes to that space to be passed down to the device with minimal friction from the filesystem.

In much the same way as described earlier for thin provisioning, these file servers need to be able to allocate space and write to it later. While they wouldn't object to that space being filled with zeros, they really don't care about the contents of the space, but they do care about the allocation and subsequent writes being fast.

Filesystems do support pre-allocating space with fallocate(), but they typically do so by recording which blocks have been written and which have only been anchored. This means that each subsequent write needs to spend time updating metadata: extra work that brings no value to the file server.

At the beginning of the sub-thread, Ted Ts'o mentioned in passing that he had out-of-tree patches that provide a flag, FALLOC_FL_NO_HIDE_STALE, that would do exactly what the file servers want: allocate space so that future writes happen with no further metadata updates. In general, this can be a security issue since reading data from those ranges could return potentially sensitive data belonging to some other user.

Ts'o's patches restrict this operation to a single privileged group ID. There were suggestions that a mount option should be used instead of, or maybe as well as, a special group ID. There were also observations that using the flag in containers could lead to unexpected information leaks. Possibly the most vocal critic was Dave Chinner who was blunt: "it is dangerous and compromises system security. As such, it does not belong in upstream kernels." An example he gave of possible information leaks was automated backups. While the application that pre-allocated space may be trusted to never look at the stale data, once it leaks out in backups it seems to be more exposed.

Torvalds wasn't convinced by Chinner's fears; his only requirement is that it isn't too easy to do something dangerous. He has always been in favor of providing functionality if people are actually going to use it, so the fact that Ts'o has this out-of-tree patch that is widely used within Google does carry weight. It was also noted that the presence of these performance issues has already caused Ceph developers to give up on using a local filesystem and to instead start using block devices directly, so the issues are clearly real. If performance benefits can be clearly demonstrated and application developers affirm that they would use the functionality, then remaining barriers are unlikely to stand for long.

If we step back for a moment to grasp the big picture, what we see here is the cluster filesystem using a local filesystem a lot like a logical volume manager. It wants storage space of arbitrary size with the ability to expand later. It doesn't care about any metadata except the size, and doesn't care about the initial contents, which in practice could be stale data. This sounds exactly like the logical volumes that LVM2 can provide, though by being embedded in a filesystem they are much easier to manage than LVM2 volumes. In a mirror image of the decision to treat block devices more like files so as to meet the needs of low-level hardware, it seems that we might want to treat files more like block devices so as to meet the needs of high-level filesystems.

As Chinner himself noted, there are synergies here with the "splitting filesystems in two" idea that he floated at the Linux Storage, Filesystem, and Memory Management Summit in 2014. While nothing appears to have come of that yet, it is valuable food for thought and something may yet arise as needs and options become clearer. The distinction that Chinner made between "names" and "storage" certainly seems stronger than the distinction between "files" and "block devices", which is showing its weakness. If the old lines are going to blur, it might be useful to have new lines to focus our thoughts on a clearer overall picture. That way, we might not need to depend so much on the intuition of experienced maintainers.


(Log in to post comments)

Blurred boundaries in the storage stack

Posted Mar 25, 2016 10:07 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

On a related note, it would be nice to have lseek(SEEK_HOLE) and lseek(SEEK_DATA) on block devices map to the SCSI command "GET LBA STATUS"!

Blurred boundaries in the storage stack

Posted Mar 25, 2016 11:41 UTC (Fri) by gebi (subscriber, #59940) [Link]

"splitting filesystems in two"

Something reminds me about zfs and it's split between zpool and zfs posix layer, where the zpool layer provides ("partitioning", checksums, encryption, compression, and a transactional interface to implement a posix FS on top).

Wouldn't that be an abstraction worth exploring?

Blurred boundaries in the storage stack

Posted Mar 26, 2016 1:09 UTC (Sat) by oshepherd (guest, #90163) [Link]

ZFS is actually three layers: the zpool volume manager, the ZFS object store, and the ZFS POSIX Layer which uses the object store to provide a POSIX-like FS.

Blurred boundaries in the storage stack

Posted Mar 25, 2016 13:27 UTC (Fri) by lmb (subscriber, #39048) [Link]

I see the Ceph pain (obviously), but potentially exposing stale data on the disk ... Just, like, no. A Ceph cluster hosting multiple tenants and this can go wrong really fast. Such leaks need to be prevented as low as possible in the stack.

Blurred boundaries in the storage stack

Posted Mar 25, 2016 22:55 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Ceph won't expose stale data. It uses the file system as a cache, providing all the logic in the upper layers.

Blurred boundaries in the storage stack

Posted Mar 25, 2016 17:23 UTC (Fri) by karkhaz (subscriber, #99844) [Link]

> This was a fitting close to a maintainership interaction done really well: Torvalds followed his intuition and complained about things that bothered him, despite not having a full picture of the problem space. Wong responded directly, called Torvalds out where he was clearly wrong, and attempted to justify other choices with extra details. A more complete picture was formed, against which preferences could be explained more coherently. Finally a resolution was found, implemented, and approved — apparently to everyone's satisfaction. This is a model worth following.

Cheers for highlighting this. Amidst complaints that interaction on the LKML and other venues is uncivil and hostile, it is nice to have examples of interactions that went well, and *why* they turned out well. There was a BOF during last year's DebConf where the issue of recognising pleasant and effective communication was raised [link to video]:

http://saimei.acc.umu.se/pub/debian-meetings/2015/debconf...

Blurred boundaries in the storage stack

Posted Mar 25, 2016 18:01 UTC (Fri) by bronson (subscriber, #4806) [Link]

Agree 100%, the LKML is nowhere near as bad as Hacker News would like to believe. Great article.

That said, I'd like to hear a little more about how Wong called Linus out where he was clearly wrong. :)

Blurred boundaries in the storage stack

Posted Mar 26, 2016 2:06 UTC (Sat) by firasha (subscriber, #4230) [Link]

The whole thread is worth reading, but I believe Neil was referring to this post.

Blurred boundaries in the storage stack

Posted Mar 26, 2016 6:11 UTC (Sat) by bronson (subscriber, #4806) [Link]

A nice gentle call-out. Thanks, that was a great thread.

Blurred boundaries in the storage stack

Posted Mar 28, 2016 9:34 UTC (Mon) by paulj (subscriber, #341) [Link]

This article states Torvalds "only had a fairly general understanding of the underlying functionality" and later "Torvalds followed his intuition and complained about things that bothered him, despite not having a full picture of the problem space. Wong responded directly, called Torvalds out where he was clearly wrong, and attempted to justify other choices with extra details".

Some readers may be tempted to interpret that as Torvalds not being qualified to object to the patch (he was "wrong" and Wong was right to "call him out" on that). I'd like to give a different interpretation, as someone who helps maintain another free software project:

The patch contributor failed to give the required information to understand the patch-set in their commit message. Both in terms of explaining the abstract motivations, in terms of the use-cases (starting at a high-level) and explaining how the implementation addressed the motivation.

I see this a *lot* in contributions to the project I work on. We get patches with commit messages that fail to communicate *why* the patch exists to begin with, and then *how* it addresses the motivation. Further, the "how" part should also mention alternative possible approaches and discuss the pro/cons of the chosen approach and why it is the most suitable - this is almost *never* done.

Contributors need to understand their commit message is the "map" the reviewer/maintainer will use to mentally make their way around the patch and judge it. If no real "map" is provided, the reviewer/maintainer will have to reconstruct it for themselves, if that's even possible. At a minimum they'll have expended more effort and may be grumpier as a result. As a contributor you really want to lower the friction for someone else to review your patch. So provide a commit message with structure that addresses the motivation and implementation of the patch, and its effects. I.e.:

- Why?: What was the original problem or use-case? (in the abstract, independent of this patch).

- How?: How does this patch address the problem or use-case? What other approaches were considered and why was this approach chosen? What limitations does this approach have? What limitations does this implementation have?

- What?: What was the result? What testing was done (and how can it be tested)? What effect was there on performance?

Discussing the "why" is needed to help convince a maintainer a patch is actually needed. Maintainers see lots of patches, and they need to decide whether the benefit of the patch is worth the churn to the codebase of the code-base. A patch whose motivation is unclear or not well explained risks getting dismissed as useless churn!

Discussing the "How" is useful in convincing the maintainer that you've thought through the patch, and considered other alternatives - especially for any non-trivial patches. For longer patches, providing a detailed walk-through can help in convincing the maintainer that you've at least self-reviewed the patch before hitting send.

Discussing the "What" is useful to give the maintainer confidence in the patch. Note, that if you've done /no/ testing that you should say that. It is much better to say it up front. Others will at least have confidence in /you/ and that you will be clear when your patch has had no testing.

tl;dr: If a maintainer didn't understand your patch, then the problem very likely was that your commit message sucked. Write good commit messages!

Blurred boundaries in the storage stack

Posted Mar 31, 2016 5:14 UTC (Thu) by ashkulz (subscriber, #102382) [Link]

That's a fantastic writeup, I'm going to bookmark it.

High watermark?

Posted Apr 7, 2016 13:43 UTC (Thu) by Wol (guest, #4433) [Link]

I guess this must have cost implications, but would it not be impossible to have two counters? Disks tend to write and allocate space from the start, so you do the same thing for files here :-)

If I want, say, 1TB as a virtual disk, there's nothing stopping the system allocating that space. It then sets a "high water mark" at 0. The guarantee is that any attempt to write beyond the high water mark will fill the space between the old high water mark and where I'm writing with 0s before resetting the high water mark. Any attempt to read beyond the high water mark will return zeros. That way, I can't read any disk that I haven't written.

Actually, that sounds very like the current setup with holes in files, which are supposed to return 0s if a read hits a hole, aren't they? But this pushes the bulk of the cost of the security onto writes, which are slow anyway, so would that be a reasonable trade-off?

Cheers,
Wol

High watermark?

Posted Apr 8, 2016 0:10 UTC (Fri) by neilbrown (subscriber, #359) [Link]

> Disks tend to write and allocate space from the start,

Do they? I don't think disks allocate space at all - they provide space.
File systems allocate space. Some, like FAT, may allocate from the start. Many "modern" filesystems have allocation groups sprinkled across the address space and allocate from different allocation groups based on different criteria.
I don't think a high water mark would help a lot.


Copyright © 2016, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds