Ideas for supporting shingled magnetic recording (SMR)
At the 2014 Linux Storage, Filesystem, and Memory Management (LSFMM) Summit, Dave Chinner and Ted Ts'o jointly led a session that ended up spanning two slots over two days. The topic was, broadly, whether the filesystem or the block layer was the right interface for supporting shingled magnetic recording (SMR) devices. In the end, it ranged a bit more broadly than that.
Zone information API and cache
Ts'o began with a description of a kernel-level C interface to the zone information returned by SMR devices that he has been working on. SMR devices will report the zones that are present on the drive, their characteristics (size, sequential-only, ...), and the location of the write pointer for each sequential-only zone. Ts'o's idea is to cache that information in a compact form in the kernel so that multiple "report zones" commands do not need to be sent to the device. Instead, interested kernel subsystems can query for the sizes of zones and the position of the write pointer in each zone, for example.
The interface for user space would be ioctl(), Ts'o said, though James Bottomley thought a sysfs interface made more sense. Chinner was concerned about having thousands of entries in sysfs, and Ric Wheeler noted that there could actually be tens of thousands of zones in a given device.
The data structures he is using assume that zones are mostly grouped into regions of same-sized zones, Ts'o said. He is "optimizing for sanity", but the interface would support other device layouts. Zach Brown wondered why the kernel needed to cache the information, since that might require snooping the SCSI bus, looking for reset write pointer commands. No one thought snooping the bus was viable, but some thought disallowing raw SCSI access was plausible. Bottomley dumped cold water on that with a reminder that the SCSI generic (sg) layer would bypass Ts'o's cache.
The question of how to handle host-managed devices (where the host must ensure that all writes to sequential zones are sequential) then came up. Ts'o said he has seen terrible one-second latency in host-aware devices (where the host can make mistakes and a translation layer will remap the non-sequential writes—which can lead to garbage collection and terrible latencies), which means that users will want Linux to support host-managed behavior. That should avoid these latencies even on host-aware devices.
But, as Chinner pointed out, there are things that have fixed layouts in user space that cannot be changed. For example, mkfs zeroes out the end of the partition, and SMR drives have to be able to work with that, he said. He is "highly skeptical" that host-managed devices will work at all with Linux. Nothing that Linux has today can run on host-managed SMR devices, he said. But those devices will likely be cheaper to produce, so they will be available and users will want support for them. An informal poll of the device makers in the room about the host-managed vs. host-aware question was largely inconclusive.
Ts'o suggested using the device mapper to create a translation layer in the kernel that would support host-managed devices. "We can fix bugs quicker than vendors can push firmware." But, as Chris Mason pointed out, any new device mapper layer won't be available to users for something like three years, but there is a need to support both types of SMR devices "tomorrow". The first session expired at that point, without much in the way of real conclusions.
When it picked up again, Ts'o had shifted gears a bit. There are a number of situations where the block device is "doing magic behind the scenes", for example SMR and thin provisioning with dm-thin. What filesystems have been doing to try to optimize their layout for basic, spinning drives is not sensible in other scenarios. For SSD drives, the translation layer and drives were so fast that filesystems don't need to care about the translation layer and other magic happening in the drive firmware. For SMR and other situations, that may not be true, so there is a need to rethink the filesystem layer somewhat.
Splitting filesystems in two
That was an entrée to Chinner's thoughts about filesystems. He cautioned that he had just started to write things down, and is open to other suggestions and ideas, but he wanted to get feedback on his thinking. A filesystem really consists of two separate layers, Chinner said: a namespace layer and a block allocation layer. Linux filesystems have done a lot of work to optimize the block allocations for spinning devices, but there are other classes of device, SMR and persistent memory for example, where those optimizations fall down.
So, in order to optimize block allocation for all of these different kinds of devices, it would make sense to split out block allocation from namespace handling in filesystems. The namespace portion of filesystems would remain unchanged, and all of the allocation smarts would move to a "smart block device" that would know the characteristics of the underlying device and be able to allocate blocks accordingly.
The filesystem namespace layer would know things like the fact that it would like a set of allocations to be contiguous, but the block allocator could override those decisions based on its knowledge. If it were allocating blocks on an SMR device and recognized that it couldn't put the data in a contiguous location, it would return "nearby" blocks. For spinning media, it would return contiguous blocks, but for persistent memory, "we don't care", so it could just return some convenient blocks. Any of the existing filesystems that do not support copy-on-write (COW) cannot really be optimized for SMR, he said, because you can't overwrite data in sequential zones. That would mean adding COW to ext4 and XFS, Chinner said.
But splitting the filesystem into two pieces means that the on-disk format can change, he said. All the namespace layer cares about is that the metadata it carries is consistent. But Ts'o brought up something that was obviously on the minds of many in the room: how is it different from object-based storage that was going to start taking over fifteen years ago?—but hasn't.
Chinner said that he had no plans to move things like files and inodes down into the block allocation layer, as object-based storage does; there would just be a layer that would allocate and release blocks. He asked: Why do the optimization of block allocation for different types of devices in each filesystem?
Another difference between Chinner's idea and object-based storage is that the metadata stays with the filesystem, unlike moving it down to the device as it is in the object-based model, Bottomley said. Chinner said that he is not looking to allocate an object that he can attach attributes to, just creating allocators that are optimized for a particular type of device. Once that happens, it would make sense to share those allocators with multiple filesystems.
Mason noted that what Chinner was describing was a lot like the FusionIO filesystem DirectFS. Chinner said that he was not surprised; he looked and did not find much documentation on DirectFS and that others have come up with these ideas in the past. It is not necessarily new, but he is looking at it as a way to solve some of the problems that have cropped up.
Bottomley asked how to get to "something we can test". Chinner thought it would take six months of work, but there is still lots to do before that work could start. "Should we take this approach?", he asked. Wheeler thought the idea showed promise; it avoids redundancy and takes advantage of the properties of new devices. Others were similarly positive, though they wanted Chinner to firmly keep the reasons that object-based storage failed in his mind as he worked on it. Chinner thought a proof-of-concept should be appearing in six to twelve months time.
[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]
| Index entries for this article | |
|---|---|
| Conference | Storage, Filesystem, and Memory-Management Summit/2014 |
