User: Password:
|
|
Subscribe / Log in / New account

Ideas for supporting shingled magnetic recording (SMR)

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jake Edge
April 2, 2014
2014 LSFMM Summit

At the 2014 Linux Storage, Filesystem, and Memory Management (LSFMM) Summit, Dave Chinner and Ted Ts'o jointly led a session that ended up spanning two slots over two days. The topic was, broadly, whether the filesystem or the block layer was the right interface for supporting shingled magnetic recording (SMR) devices. In the end, it ranged a bit more broadly than that.

Zone information API and cache

Ts'o began with a description of a kernel-level C interface to the zone information returned by SMR devices that he has been working on. SMR devices will report the zones that are present on the drive, their characteristics (size, sequential-only, ...), and the location of the write pointer for each sequential-only zone. Ts'o's idea is to cache that information in a compact form in the kernel so that multiple "report zones" commands do not need to be sent to the device. Instead, interested kernel subsystems can query for the sizes of zones and the position of the write pointer in each zone, for example.

The interface for user space would be ioctl(), Ts'o said, though James Bottomley thought a sysfs interface made more sense. Chinner was concerned about having thousands of entries in sysfs, and Ric Wheeler noted that there could actually be tens of thousands of zones in a given device.

The data structures he is using assume that zones are mostly grouped into regions of same-sized zones, Ts'o said. He is "optimizing for sanity", but the interface would support other device layouts. Zach Brown wondered why the kernel needed to cache the information, since that might require snooping the SCSI bus, looking for reset write pointer commands. No one thought snooping the bus was viable, but some thought disallowing raw SCSI access was plausible. Bottomley dumped cold water on that with a reminder that the SCSI generic (sg) layer would bypass Ts'o's cache.

The question of how to handle host-managed devices (where the host must ensure that all writes to sequential zones are sequential) then came up. Ts'o said he has seen terrible one-second latency in host-aware devices (where the host can make mistakes and a translation layer will remap the non-sequential writes—which can lead to garbage collection and terrible latencies), which means that users will want Linux to support host-managed behavior. That should avoid these latencies even on host-aware devices.

But, as Chinner pointed out, there are things that have fixed layouts in user space that cannot be changed. For example, mkfs zeroes out the end of the partition, and SMR drives have to be able to work with that, he said. He is "highly skeptical" that host-managed devices will work at all with Linux. Nothing that Linux has today can run on host-managed SMR devices, he said. But those devices will likely be cheaper to produce, so they will be available and users will want support for them. An informal poll of the device makers in the room about the host-managed vs. host-aware question was largely inconclusive.

Ts'o suggested using the device mapper to create a translation layer in the kernel that would support host-managed devices. "We can fix bugs quicker than vendors can push firmware." But, as Chris Mason pointed out, any new device mapper layer won't be available to users for something like three years, but there is a need to support both types of SMR devices "tomorrow". The first session expired at that point, without much in the way of real conclusions.

When it picked up again, Ts'o had shifted gears a bit. There are a number of situations where the block device is "doing magic behind the scenes", for example SMR and thin provisioning with dm-thin. What filesystems have been doing to try to optimize their layout for basic, spinning drives is not sensible in other scenarios. For SSD drives, the translation layer and drives were so fast that filesystems don't need to care about the translation layer and other magic happening in the drive firmware. For SMR and other situations, that may not be true, so there is a need to rethink the filesystem layer somewhat.

Splitting filesystems in two

That was an entrée to Chinner's thoughts about filesystems. He cautioned that he had just started to write things down, and is open to other suggestions and ideas, but he wanted to get feedback on his thinking. A filesystem really consists of two separate layers, Chinner said: a namespace layer and a block allocation layer. Linux filesystems have done a lot of work to optimize the block allocations for spinning devices, but there are other classes of device, SMR and persistent memory for example, where those optimizations fall down.

So, in order to optimize block allocation for all of these different kinds of devices, it would make sense to split out block allocation from namespace handling in filesystems. The namespace portion of filesystems would remain unchanged, and all of the allocation smarts would move to a "smart block device" that would know the characteristics of the underlying device and be able to allocate blocks accordingly.

The filesystem namespace layer would know things like the fact that it would like a set of allocations to be contiguous, but the block allocator could override those decisions based on its knowledge. If it were allocating blocks on an SMR device and recognized that it couldn't put the data in a contiguous location, it would return "nearby" blocks. For spinning media, it would return contiguous blocks, but for persistent memory, "we don't care", so it could just return some convenient blocks. Any of the existing filesystems that do not support copy-on-write (COW) cannot really be optimized for SMR, he said, because you can't overwrite data in sequential zones. That would mean adding COW to ext4 and XFS, Chinner said.

But splitting the filesystem into two pieces means that the on-disk format can change, he said. All the namespace layer cares about is that the metadata it carries is consistent. But Ts'o brought up something that was obviously on the minds of many in the room: how is it different from object-based storage that was going to start taking over fifteen years ago?—but hasn't.

Chinner said that he had no plans to move things like files and inodes down into the block allocation layer, as object-based storage does; there would just be a layer that would allocate and release blocks. He asked: Why do the optimization of block allocation for different types of devices in each filesystem?

Another difference between Chinner's idea and object-based storage is that the metadata stays with the filesystem, unlike moving it down to the device as it is in the object-based model, Bottomley said. Chinner said that he is not looking to allocate an object that he can attach attributes to, just creating allocators that are optimized for a particular type of device. Once that happens, it would make sense to share those allocators with multiple filesystems.

Mason noted that what Chinner was describing was a lot like the FusionIO filesystem DirectFS. Chinner said that he was not surprised; he looked and did not find much documentation on DirectFS and that others have come up with these ideas in the past. It is not necessarily new, but he is looking at it as a way to solve some of the problems that have cropped up.

Bottomley asked how to get to "something we can test". Chinner thought it would take six months of work, but there is still lots to do before that work could start. "Should we take this approach?", he asked. Wheeler thought the idea showed promise; it avoids redundancy and takes advantage of the properties of new devices. Others were similarly positive, though they wanted Chinner to firmly keep the reasons that object-based storage failed in his mind as he worked on it. Chinner thought a proof-of-concept should be appearing in six to twelve months time.

[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]

Index entries for this article
ConferenceStorage Filesystem & Memory Management/2014


(Log in to post comments)

Ideas for supporting shingled magnetic recording (SMR)

Posted Apr 3, 2014 5:30 UTC (Thu) by koverstreet (subscriber, #4296) [Link]

As I noted at the conference, bcache is now at the point where I should be able to add support for SMR drives to it quite easily - provided you're using it in conjunction with another drive (i.e. an SSD for caching).

ETA will just depend on how much time I can devote to it, my plate is kept pretty full.

Ideas for supporting shingled magnetic recording (SMR)

Posted Apr 3, 2014 8:59 UTC (Thu) by zdzichu (subscriber, #17118) [Link]

> A filesystem really consists of two separate layers, Chinner said: a namespace layer and a block allocation layer. [...] it would make sense to split out block allocation from namespace handling in filesystems.

That's exactly what ZFS creators did 10 years ago. There's 1) DMU (Data Management Unit) - the allocator - and 2) ZFS POSIX Layer - handling names.
Allocator layer can be used by things different than namespace layers, allowing for swap, logical volumes etc. to be carved out ZFS pool.

See also: http://www.fh-wedel.de/~si/seminare/ws08/Ausarbeitung/02....

So it's in ZFS..?

Posted Apr 3, 2014 12:29 UTC (Thu) by k3ninho (subscriber, #50375) [Link]

Anyone know off the top of their head (dear Lazyweb) if this separation is obvious to practitioners faced with this problem in their field of expertise but patented by Sun at the time?

K3n.

Ideas for supporting shingled magnetic recording (SMR)

Posted Apr 7, 2014 21:44 UTC (Mon) by dgc (subscriber, #6611) [Link]

Right, That's exactly what most people fail to recognise with ZFS - the management interface combined filesystem and device management, but the internal implementation was still very strongly layered in the traditional namespace/block/device layers.

i.e. the linux storage stack design doesn't need to be turned inside out to provide ZFS-like management functionality, and that's a mistake we made 6-7 years ago by taking the btrfs path. Most of us thought btrfs would be stable and widespread years ago, but in hindsight, that was rather optimistic....

-Dave.

Ideas for supporting shingled magnetic recording (SMR)

Posted Apr 8, 2014 1:53 UTC (Tue) by dlang (guest, #313) [Link]

there were people poining this out then, but code talks and writing a new filesystem is sexy, writing a new management tool to use the existing infrastructure isn't.

As for the new filesystem being deployed everywhere, people are _extremely_ conservative about their storage. Even if btrfs had been finished on the original (insanely optimistic) schedule, it still would just be getting traction now.

Look how reluctant people are to move off of ext3 to ext4/xfs, both of which have much longer track records.

Ideas for supporting shingled magnetic recording (SMR)

Posted Apr 8, 2014 14:54 UTC (Tue) by Jonno (subscriber, #49613) [Link]

> Right, That's exactly what most people fail to recognise with ZFS - the management interface combined filesystem and device management, but the internal implementation was still very strongly layered in the traditional namespace/block/device layers.

> [...] and that's a mistake we made 6-7 years ago by taking the btrfs path.

BTRFS actually has a strong block device layer internally, it just doesn't provide the same interface as the classic Linux block device layer, just like ZFS has a strong block device layer internally, which isn't the same as the classic Solaris block device layer.

The reason for this is quite simple: the classic block device interfaces does not provide everything BTRFS and ZFS need. While it certainly would have been possible to rewrite the old block device code to provide the extended functionality needed, you would have a hard time convincing enterprise users to trust their data to the new and unproven block device code, when the old and proven version can is perfectly capable to host their ext3 file system...

Ideas for supporting shingled magnetic recording (SMR)

Posted Apr 8, 2014 21:45 UTC (Tue) by dgc (subscriber, #6611) [Link]

> The reason for this is quite simple: the classic block device
> interfaces does not provide everything BTRFS and ZFS need.

That's exactly the point that I made in the presentation that the article was talking about. i.e. the "classic block device" doesn't have the functionality needed to handle SMR properly, and neither do "traditional filesystems". The whole point of the discussion we had was to understand if we could extend the "classic block device" to provide the functionality we need to support SMR devices....

IOWS, we're now in a situation where we need a richer, smarter block device. However, instead of just plugging into the infrastructure we should have put into place for btrfs years ago, we don't have anything we can use. That's the architectural mistake we made with btrfs years ago. Hence now we have to provide a generic "smart block device" despite the fact that btrfs has that functionality internally. We can't use any of the btrfs functionality precisely because it's not generic and is deeply tied to the internal BTRFS structure.

> you would have a hard time convincing enterprise users to trust
> their data to the new and unproven block device code,
> when the old and proven version can is perfectly capable to host
> their ext3 file system...

ext3 will *not work at all* on the host-aware/host-managed SMR drives that vendors will ship. Users will have no choice but to use something other than ext3 on their host-aware/managed SMR drives.

-Dave.

Ideas for supporting shingled magnetic recording (SMR)

Posted Apr 3, 2014 17:26 UTC (Thu) by ttonino (guest, #4073) [Link]

We need what an SSD does internally: it wirtes data into 'erase blocks' and is much data in an erase block is deleted, it is copied compacted into another block, leaving free space at the end.

This is exactly the problem that needs solving for SMR drives. So the methods should be there (cannot be all patented).

There is a light difference: SSDs consist of only erase block, but the block does not need to be written sequentially. While SMR drives are not fully SMR, but have some random access sections. But the SMR parts neet to be written from beginning to end.

Still, there is enough similarity to make me think: solve this on the block layer first.

Ideas for supporting shingled magnetic recording (SMR)

Posted Apr 4, 2014 1:15 UTC (Fri) by PaulWay (guest, #45600) [Link]

It sounds to me like this is actually going to require a completely separate file system:

* One "random access" section (could be small) which contains all the metadata.

* Multiple 'log' sections where data is copied on write.

Remember that you can overwrite a part of a shingled section as long as you rewrite all the data after where you start writing. You can also resume writing from the place you left off with no penalty. It sounds like a drive might have many hundreds of these sections, so picking a 'log' section to write your latest update to your file or block isn't a huge penalty. And you can random-access read from anywhere, shingled or standard.

So each time you want to rewrite something, you create a new entry in the log section, then point to that in the random access section which holds your metadata. File systems can be rebuilt by replaying each of these log sections internally.

It seems to me that trying to make some kind of 'interface layer' which allows an SMR drive to be treated as a normal drive - whether it's in the kernel or on the drive - is a great way to kill performance. The methods that are going to work are methods that use the drive's native capabilities to their advantage.

(Likewise, I can imagine a standard extent-based system having a 'compatibility layer' underneath them on the device that maps virtual sectors into real ones, to make a host-managed device seem like it's not shingled. But again I think it's going to immediately start costing a lot of performance.)

Have fun,

Paul

Ideas for supporting shingled magnetic recording (SMR)

Posted Apr 12, 2014 2:00 UTC (Sat) by weue (guest, #96562) [Link]

Won't this be handled transparently by a translation layer anyway, just like SSDs?

I mean, the drives have to work without drivers on existing installations of Windows and OSX if they plan to actually sell them.

Ideas for supporting shingled magnetic recording (SMR)

Posted Apr 12, 2014 11:03 UTC (Sat) by khim (subscriber, #9252) [Link]

They only “have to work without drivers on existing installations” if they are to be sold as a replacement for traditional drives. If they are sold as some kind of enterprise tool then drivers are just fine and they can even provide megabytes of these on “normal” side (AFAIK currently plan is to provide small slice of “normal-style” access and large slice of “dense” one on the same platter).

Ideas for supporting shingled magnetic recording (SMR)

Posted Jun 6, 2014 3:15 UTC (Fri) by Duncan (guest, #6647) [Link]

And... "enterprise only" might work when first introduced, but I don't expect it to remain there long. Consider, the only thing that has kept spinning rust in the game against SSD, is constantly increasing capacity, and constantly lowering price, in ordered to keep ahead of SSD both in price-per-gig and raw capacity.

As pointed out elsewhere, current spinning rust storage density technology is headed toward a plateau where constant increases in density aren't available any longer, certainly not without reversing the declining price-per-gig trend, and shingled-technology is where the entire spinning-rust industry is headed to keep that going a few further generations.

Thus, not only will the technology not stay exclusive to the enterprise, since it's predicted to be a primary driver in further density increases resulting in continuing lower price-per-gig and higher capacity trends, shingled-storage is likely to fall pretty fast toward the consumer end of the price range and to stay there, crowding out other technologies which will then be limited to the higher end.

That said, it's still quite possible that such drives will require drivers at the consumer level, with those drivers first supplied by the device manufacturers, until they can be included by default in the latest update of the computer platform de jour, just as has been the case for the latest technology de jour for decades, now. "Have to work without drivers on existing installations" is thus as false-flag now, as was the same requirement at the 128 MiB, 2 GiB and 2 TiB boundaries, to pick some relevant (storage) technology examples. The case of SSD FTLs conveniently avoiding that to some degree chicken/egg problem is an exception due to the much higher speeds of the technology involved making it an actually acceptable, not the rule.

/That/ said, new "rules" must start /somewhere/ as initial exceptions, and given the precedent and the fact that even slow embedded CPU speeds are generally /so/ much faster than spinning media transfer speeds these days, it's yet possible that some such embedded translation layer will become the standard implementation here just as it did for SSD. And with SSD as the primary competition already requiring such embedded translation layers, its perhaps even more likely, altho the SSD solution's order-of-magnitude faster speed allowed some tolerance for teething problems as the FTL matured, that shingled media technology may well not get, given its much more incremental advance over current technology.

So we will see, and as always it'll be interesting to watch how developments play out. But regardless, I just can't see this remaining restricted to the enterprise realm for long. One way or another, it'll drop to consumer level, as that's what will make or break it, and with it, very likely the continued fortunes of spinning rust technology.


Copyright © 2014, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds