Filesystem support for SMR devices

By Jake Edge
March 18, 2015

LSFMM 2015

Two back-to-back sessions at the 2015 Linux Storage, Filesystem, and Memory Management Summit looked at different attempts to support Linux filesystems on shingled magnetic recording (SMR) devices. In the first, Hannes Reinecke gave a status report on some prototyping he has done to support SMR in Linux. The second was led by Adrian Palmer of Seagate about a project to port the ext4 filesystem to host-managed SMR devices.

Reinecke described some prototyping he has done in the block layer to support SMR. Those devices have a number of interesting attributes that require code in the kernel to support. For example, SMR devices have multiple zones, some of which are normal random-access disk zones, while others must be written to sequentially. He has been looking specifically at supporting host-managed SMR devices, which require that the host never violate the sequential-write restriction in those types of zones.

SMR drives disallow I/O that spans zones, Reinecke said, which means that I/O operations need to be split at those boundaries. The zone layout could have a different size for each of the different zones, though none of the drives currently does that. To support that possibility, though, he used a red-black tree to track all of the zones. The current SMR specification allows for deferred lookup of some of the zone information, so the tree could just be partially filled for devices with lots of irregular zones.

Ted Ts'o suggested that supporting "insane drives" that have a variety of zone sizes might use a different data structure. That way, the majority of drives that have a straightforward layout could have all of that information available in kernel memory. He was concerned that there might be I/O performance degradation when issuing the "report zones" command once the device has been mounted.

There is also a question about "open zones" and the maximum number of open zones. Reinecke said that it is a topic that is still under discussion among the drive makers. From the LSFMM discussion it seems clear that there is no agreement on what an open zone is. Some believe that any partially filled zone qualifies, while to others it means zones that are simultaneously available to write to. In addition, the maximum may range from the four to eight that Martin Petersen has heard to the 128 that the drive makers have proposed.

In fact, someone from one of the storage vendors asked what the kernel developers would like the maximum to be. The reply was, not surprisingly, "all of them". Reinecke said that he is lobbying that "zone control" (maximum number of open zones) be optional and that any I/O that violates the maximum open zones should be allowed, possibly with a performance penalty. Ts'o agreed with that, saying that writing to one more zone than is allowed must not cause an I/O error, though adding some extra latency would be acceptable. Reinecke said that he had hoped to avoid the whole topic of open zones "because it is horrible".

Reinecke then moved back to his prototype work. He noted that sequential writes must be guaranteed. Each sequential zone has its own write pointer, which is where the next write for that zone must be. That "sort of works" using the NOP I/O scheduler, since it just merges adjacent writes. If out-of-order writes from multiple tasks are encountered, they can be requeued at the tail of the queue. The queue size must be monitored, he said, since if it never gets smaller, the I/O is making no progress, which should cause an I/O error.

But Dave Chinner said that once a filesystem has allocated blocks to different tasks, it must then guarantee an ordering of those writes "all the way down". The only way to do that is to serialize the I/O to the zone once the allocation has been done. Reinecke said that requeueing at the tail can solve that problem, but Chinner said that in a preemptible kernel that won't work. "Sequential I/O is basically synchronous I/O", he said.

There is a philosophical question about whether it makes sense to try to put a regular filesystem on SMR devices, Ts'o said. Chinner said that SMR is really a firmware problem. Actually solving the problems of SMR at the filesystem level is not really possible, he said.

Reinecke wondered if the host-managed SMR drives would actually sell. Petersen piled on, noting that the flash-device makers had made lots of requests for extra code to support their devices, but that eventually all of those requests disappeared when those types of devices didn't sell. Reinecke's conclusion was that it may not make a lot sense to try to make an existing filesystem work for host-managed SMR drives.

Ext4 on host-managed SMR

On the other hand, though, Palmer is quite interested in doing just that. He works on host-managed drives and is trying to get ext4 working on them.

He started by looking at block groups as a way to track the zones, but ran into a problem with that idea. Zones are 256MB in length, but a 4KB block only has enough bits to address 128MB worth of blocks, so he would need to use 8KB blocks, which is a sizable change. He also noted that O_DIRECT I/O was going to be a problem for host-managed SMR, without really going into any details.

As Reinecke said earlier, the order of writes to the disk is critical for host-managed drives. Out-of-order writes may not be written at all. Palmer looked at putting the code to keep write operations sequential into either the I/O scheduler or the block device. For now, the block device seems to be the right place.

Ts'o said that he is mentoring a student who is working on making the ext4 journal writes more SMR-friendly. But Chinner is worried about fsck. A corrupt block in the middle of a sequential zone may need to be rewritten, but it can't be overwritten in place. Ts'o suggested a 256MB read-modify-write with a chuckle.

One attendee noted that the drive makers want to start with host-aware drives (which will perform better with mostly sequential writes to those zones, but will not fail out-of-order writes) to get them working. That will allow the companies to learn from the market how much conventional space (zones without the sequential requirement) and overprovisioning is required.

Chinner suggested that some of that conventional space might be used for metadata sections. Another attendee cautioned that SSD makers are also looking at zone block devices, so it may be more than just SMR drives that need this kind of support. But Chinner said that the kernel developers had "more than enough" on their hands rewriting filesystems for use on SMR.

Another way to approach the problem, Chinner said, might be to have a new kind of write command for disks (perhaps "write allocate") that would return the logical block address (LBA) where the data was written, rather than getting the LBA from the filesystem or block layers with the write. That way, the drive would decide where to place the data and return that to the operating system. One attendee said that the driver vendors would probably welcome a discussion about what the API to these drives would look like.

There was some discussion on how to proceed with a new command, which would (eventually) need to be handled by the T10 committee (for SCSI interface standards). Petersen (who represents Linux on T10) noted that it is difficult to change the standard. An attendee from one of the drive makers thought it might be possible to prototype the idea to try it out completely separate from the standards process.

That is where the conversation trailed off, but the "write allocate" idea seemed to generate some interest. Whether that translates into action (or standards) remains to be seen.

After the summit, on March 16, Dave Chinner posted a pointer to a design document on supporting XFS on host-aware SMR drives.

[I would like to thank the Linux Foundation for travel support to Boston for the summit.]

Index entries for this article
Kernel	Shingled magnetic recording
Conference	Storage, Filesystem, and Memory-Management Summit/2015

Filesystem support for SMR devices

Posted Mar 19, 2015 9:05 UTC (Thu) by ptman (subscriber, #57271) [Link]

There was a recent USENIX paper on drive managed SMR: https://www.usenix.org/system/files/conference/fast15/fas...

Filesystem support for SMR devices

Posted Mar 19, 2015 17:31 UTC (Thu) by zev (subscriber, #88455) [Link]

The proposed "write allocate" command sounds a lot like the "nameless writes" for flash SSDs described in this FAST'12 paper: http://research.cs.wisc.edu/adsl/Publications/namelesswri...

SMRs better as OSDs than block devices?

Posted Mar 21, 2015 7:33 UTC (Sat) by CChittleborough (subscriber, #60775) [Link]

If SMR drives are hard for existing filesystems to handle, maybe they should use a higher-level interface than block I/O? The SCSI object-storage device standard (OSD; see http://lwn.net/Articles/305740/) might do the job, or at least provide a starting point.

In other words, maybe SMR drives should be device-managed instead of host-managed?

Filesystem support for SMR devices

Posted Mar 23, 2015 2:00 UTC (Mon) by naptastic (guest, #60139) [Link]

What about something like logfs, where everything is always sequential anyway? Only use open zones to track pointer updates, and periodically garbage-collect space that's no longer actually in use.

We could write a utility to display garbage collection progress on the screen with little blue and green boxes, like the old defrag tools...

Filesystem support for SMR devices

Posted Mar 29, 2015 21:04 UTC (Sun) by toyotabedzrock (guest, #88005) [Link]

Just a thought, the devs here have a choice that will effect the future of how much power the NSA has.

If you will recall it was exposed that the NSA is able to replace drive firmware. The power of hard disk controllers has to go up if the host system doesn't manage these tasks and once the vendors go that route everyone will be cut off from seeing their firmware source.

Yeah it sucks but we all know how this will play out.

Filesystem support for SMR devices

Posted May 16, 2015 16:19 UTC (Sat) by jhhaller (guest, #56103) [Link]

Btrfs looks like it would be a good starting point for a native SMR filesystem. It's already log-based, and with the metadata separate from the actual data, can write to multiple open regions. It's not a perfect match, as reusing zones requires rewriting all of the data in the zone. This would require some changes to the defragmentation algorithms, which are focused on defragmenting tracks, and not necessarily regions, but this it likely an ordering problem. Storing the superblocks is the other challenge, as they are the root of the tree, and change frequently. Scanning the disk for the latest superblock on mount is probably not time-effective, although since it's in the metadata, may not require scanning too much data. Maintaining multiple superblocks is the other challenge, as they need to be in different regions.

The other issue with SMR is that files which are changed in the middle tend to fragment the file with log-based filesystems, as chunks of the file need to be written elsewhere, causing either much head action to read the file or frequent defragmentation of the file. The later is quite expensive for large files. Concurrent activity in multiple files also either needs multiple active regions, one per file or group of files, or the files will be interleaved and fragmented.

MARS could help here (for some parts)

Posted Mar 13, 2016 8:01 UTC (Sun) by schoebel (guest, #107651) [Link] (1 responses)

Some of the performance problems of SMR drives could be helped by MARS.

MARS turns many random writes into (near-)sequential ones by its memory buffer architecture.

Please look at the appendix slide titled "MARS Light Data Flow Principle" at https://github.com/schoebel/mars/blob/master/docu/MARS_GU... . Also look into mars-manual.pdf in the same directory.

Not only the transaction logfiles are written in a sequential fashion, but also the writeback into the underlying disk is re-ordered by ascending sector numbers. This type of access pattern comes _close_ to what SMR drives probably want.

Notice that MARS can also be used standalone (without replication of the transaction logfiles), just for the sake of performance.

Example: assume we have 10 MARS resources on a storage server in a datacenter. Then you will get 10 near-sequential logfile write streams, and 1 near-sequential writeback stream which alternates / cycles through the resources. Thus only about 11 zones will be open by concept.

The actual situation complicates a little bit by the following effects. The following is only relevant when the /mars filesystem is placed on SMR drives (which can be avoided in datacenter setups):

1) During the MARS log-rotate operation, two transaction logfiles may be open during a short transitional period. Thus the number of open zones could double in worst case. The number of open zones could be reduced by running these operations more sequentially, resource by resource, instead of in parallel.

2) The current MARS implementation assumes _hardware_ RAID controllers with BBUs. This provides a very fast RAM cache for frequent inode updates caused by fsync(). This painpoint should be resolved, either by cooperation with hardware RAID vendors, and/or via modification of MARS (e.g. use of fallocate() for pre-allocation of larger logfile chunks could be feasible in future).

3) Currently all metadata information is stored in symlinks at the /mars filesystem. Happily their update is not performance-critical. The symlinks will be replaced anyway for kernel upstream submission of MARS. Details are to be discussed.

In order to get the best out of MARS as quickly as possible, I would suggest the following:

A) First of all, don' try to offload performance characteristics into higher layers of operating systems at all. I am supporting the opinions of Hannes and Dave by 100%. Don't introduce new interfaces just for the sake of performance. Performance issues should be solved inside of blackboxes whenever possible. Please _emulate_ a classical block layer interface (preferably at the drive firmware).

B) On that basis, start with a /mars filesystem on dedicated _conventional_ spindles, attached to a hardware RAID controller with BBU. Typically less than 1 TB is sufficient for /mars. Only the _mass_ data (in the range of several hundred TB) should be placed on SMR drives.

C) Cooperate with hardware RAID vendors to optimize RAID parameters like RAID stripe sizes etc for best performance of the internal writeback strategy from the BBU cache to the SMR media. In particular, delayed writeback (which is typically already implemented) could be _tuned_ for better coalescing of SMR zone updates.

D) Hint: look at blkreplay.org for real-live workloads (recorded via blktrace at 1&1 datacenters). They are differing vastly from artifical benchmarks. Otherwise you might be trapped by wrong assumptions about real workload behaviour. Compare MARS vs non-MARS setups via such real-life workloads.

E) After that, talk with me for improvements of MARS.

Cheers, Thomas

MARS could help here (for some parts)

Posted Mar 19, 2016 6:59 UTC (Sat) by schoebel (guest, #107651) [Link]

I think there are different main trunks to follow:

1) Short term: SMR drives need to be established in their market niche (regardless how big this niche might be at the beginning).

If SMR cannot gain their own market niche, the following trunks would be pointless. So I think this is prio #1.

Here MARS could help at the block layer, needing only minimal modifications itself (quick win).

Altering higher layers will need some time anyway (e.g. major developments at the FS layer are measured in decades nowadays).

2) Medium term: adapt block layer components (not limited to MARS) to the specifics of SMR.

This is easier than 3) because of lower complexity.

Please talk with me, I have some ideas about this.

Best would be a discussion at some Linux conference with all block layer people, changing the title to "Block Layer Support for SMR devices".

3) Long term: FS layer adaptations, as already started the discussion here.

Best after SMR would have really established their niche. This would be a much better motivation for long-lasting tedious work ;)