User: Password:
|
|
Subscribe / Log in / New account

Preparing for large-sector drives

Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

By Jonathan Corbet
January 29, 2014
Back in the distant past (2010), kernel developers were working on supporting drives with 4KB physical sectors in Linux. That work is long since done, and 4KB-sector drives work seamlessly. Now, though, the demands on the hard drive industry are pushing manufacturers toward the use of sectors larger than 4KB. A recent discussion ahead of the upcoming (late March) Linux Storage, Filesystem and Memory Management Summit suggests that getting Linux to work on such devices may be a rather larger challenge requiring fundamental kernel changes — unless it isn't.

Ric Wheeler started the discussion by proposing that large-sector drives could be a topic of discussion at the Summit. The initial question — when such drives might actually become reality — did not get a definitive answer; drive manufacturers, it seems, are not ready to go public with their plans. Clarity increased when Ted Ts'o revealed a bit of information that he was able to share on the topic:

In the opinion of at least one drive vendor, the pressure for 64k sectors will start increasing (roughly paraphrasing that vendor's engineer, "it's a matter of physics"), and it might not be surprising that in 2 or 3 years, we might start seeing drives with 64k sectors.

Larger sectors would clearly bring some inconvenience to kernel developers, but, since they can help drive manufacturers offer more capacity at lower cost, they seem almost certain to show up at some point.

Do (almost) nothing

One possible response, espoused by James Bottomley, is to do very little in anticipation of these drives. He pointed out that much of the work done to support 4KB-sector drives was not strictly necessary; the drive manufacturers said that 512-byte transfers would not work on such drives, but the reality has turned out to be different. Not all operating systems were able to adapt to the 4KB size, so drives have read-modify-write (RMW) logic built into their firmware to handle smaller transfers properly. So Linux would have worked anyway, albeit with some performance impact.

James's point is that the same story is likely to play out with larger sector sizes; even if manufacturers swear that only full-sector transfers will be supported, those drives will still, in the end, have to work with popular operating systems. To do that, they will have to support smaller transfers with RMW. So it comes down to what's needed to perform adequately on those drives. Large transfers will naturally include a number of full-sector chunks, so they will mostly work already; the only partial-sector transfers would be the pieces at either end. Some minor tweaks to align those transfers to the hardware sector boundary would improve the situation, and a bit of higher-level logic could cause most transfers to be sized to match the underlying sector size. So, James said:

I'm asking what can we do with what we currently have? Increasing the transfer size is a way of mitigating the problem with no FS support whatever. Adding alignment to the FS layout algorithm is another. When you've done both of those, I think you're already at the 99% aligned case, which is "do we need to bother any more" territory for me.

But Martin Petersen, arguably the developer most on top of what manufacturers are actually doing with their drives, claimed that, while consumer-level drives all support small-sector emulation with RMW, enterprise-grade drives often do not. If the same holds true for larger-sector drives, the 99% solution may not be good enough and more will need to be done.

Larger blocks in the kernel

There are many ways in which large sector support could be implemented in the kernel. One possibility, mentioned by Chris Mason, would be to create a mapping layer in the device mapper that would hide the larger sector sizes from the rest of the kernel. This option just moves the RMW work into a low-level kernel layer, though, and does nothing to address the performance issues associated with that extra work.

Avoiding the RMW overhead requires that filesystems know about the larger sector size and use a block size that matches. Most filesystems are nearly ready to do that now; they are generally written with the idea that one filesystem's block size may differ from another. The challenges are, thus, not really at the filesystem level; where things get interesting is with the memory management (MM) subsystem.

The MM code deals with memory in units of pages. On most (but not all) architectures supported by Linux, a page is 4KB of memory. The MM code charged with managing the page cache (which occupies a substantial portion of a system's RAM) assumes that individual pages can easily be moved to and from the filesystems that provide their backing store. So a page fault may just bring in a single 4KB page, without regard for the fact that said page may be embedded within a larger sector on the storage device. If the 4KB page cannot be read independently, the filesystem code must read the whole sector, then copy the desired page into its destination in the page cache. Similarly, the MM code will write pages back to persistent store with no understanding of the other pages that may share the same hardware sector; that could force the filesystem code to reassemble sectors and create surprising results by writing out pages that were not, yet, meant to be written.

Avoiding these problems almost certainly means teaching the MM code to manage pages in larger chunks. There have been some attempts to do so over the years; consider, for example, Christoph Lameter's large block patch set that was covered here back in 2007. This patch enabled variable-sized chunks in the page cache, with anything larger than the native page size being stored in compound pages. And that is where this patch ran into trouble.

Compound pages are created by grouping together a suitable number of physically contiguous pages. These "higher-order" pages have always been risky for any kernel subsystem to rely on; the normal operation of the system tends to fragment memory over time, making such pages hard to find. Any code that allocates higher-order pages must be prepared for those allocations to fail; reducing the reliability of the page cache in this way was not seen as desirable. So this patch set never was seriously considered for merging.

Nick Piggin's fsblock work, also started in 2007, had a different goal: the elimination of the "buffer head" structure. It also enabled the use of larger blocks when passing requests to filesystems, but at a significant cost: all filesystems would have had to be modified to use an entirely different API. Fsblock also needed higher-order pages, and the patch set was, in general, large and intimidating. So it didn't get very far, even before Nick disappeared from the development community.

One might argue that these approaches should be revisited now. The introduction of transparent huge pages, memory compaction, and more, along with larger memory sizes in general, has made higher-order allocations much more reliable than they once were. But, as Mel Gorman explained, relying on higher-order allocations for critical parts of the kernel is still problematic. If the system is entirely out of memory, it can push some pages out to disk or, if really desperate, start killing processes; that work is guaranteed to make a number of single pages available. But there is nothing the kernel can do to guarantee that it can free up a higher-order page. Any kernel functionality that depends on obtaining such pages could be put out of service indefinitely by the wrong workload.

Avoiding higher-order allocations

Most Linux users, if asked, would not place "page cache plagued by out-of-memory errors" near the top of their list of desired kernel features, even if it comes with support for large-sector drives. So it would seem that any scheme based on being able to allocate physically contiguous chunks of memory larger than the base allocation size used by the MM code is not going to get very far. The alternatives, though, are not without their difficulties.

One possibility would be to move to the use of virtually contiguous pages in the page cache. These large pages would still be composed of a multitude of 4KB pages, but those pages could be spread out in memory; page-table entries would then be used to make them look contiguous to the rest of the kernel. This approach has special challenges on 32-bit systems, where there is little address space available for this kind of mapping, but 64-bit systems would not have that problem. All systems, though, would have the problem that these virtual pages are still groups of small pages behind the mapping. So there would still be a fair amount of overhead involved in setting up the page tables, creating scatter/gather lists for I/O operations, and more. The consensus seems to be that the approach could be workable, but that the extra costs would reduce any performance benefits considerably.

Another possibility is to increase the size of the base unit of memory allocation in the MM layer. In the early days, when a well-provisioned Linux system had 4MB of memory, the page size was 4KB. Now that memory sizes have grown by three orders of magnitude — or more — the page size is still 4KB. So Linux systems are managing far more pages than they used to, with a corresponding increase in overhead. Memory sizes continue to increase, so this overhead will increase too. And, as Ted pointed out in a different discussion late last year, persistent memory technologies on the horizon have the potential to expand memory sizes even more.

So there are good reasons to increase the base page size in Linux even in the absence of large-sector drives. As Mel put it, "It would get more than just the storage gains though. Some of the scalability problems that deal with massive amount of struct pages may magically go away if the base unit of allocation and management changes." There is only one tiny little problem with this solution: implementing it would be a huge and painful exercise. There have been attempts to implement "page clustering" in the kernel in the past, but none have gotten close to being ready to merge. Linus has also been somewhat hostile to the concept of increasing the base page size in the past, fearing the memory waste caused by internal fragmentation.

A number of unpleasant options

In the end, Mel described the available options in this way:

So far on the table is
  1. major filesystem overhaul
  2. major vm overhaul
  3. use compound pages as they are today and hope it does not go completely to hell, reboot when it does

With that set of alternatives to choose from, it is not surprising that none have, thus far, developed an enthusiastic following. It seems likely that all of this could lead to a most interesting discussion at the Summit in March. Even if large-sector drives could be supported without taking any of the above options, chances are that, sooner or later, the "major VM overhaul" option is going to require serious consideration. It may mostly be a matter when somebody feels the pain badly enough to be willing to try to push through a solution.


(Log in to post comments)

Preparing for large-sector drives

Posted Jan 30, 2014 3:23 UTC (Thu) by dlang (subscriber, #313) [Link]

did anyone ever identify a specific drive model that doesn't support 512 byte sectors?

Preparing for large-sector drives

Posted Jan 30, 2014 4:32 UTC (Thu) by magila (subscriber, #49627) [Link]

Even if there aren't any in the wild right now they are definitely coming in the not-too-distant future. Enterprise customers are hungry for high density storage but they also don't like the overhead of 512 byte sector emulation. Because of this a lot of enterprise drives are still using 512 byte physical sectors, but that's not going to last with the push for 6TB+ drives. Given the choice between 512 byte emulation and 4K native at least some tier 1 customers are going to go with the latter.

Preparing for large-sector drives

Posted Jan 30, 2014 4:42 UTC (Thu) by dlang (subscriber, #313) [Link]

the drive vendors claimed that back in 2010 as well, but it didn't happen in practice then either.

I think it makes a huge difference if someone has actually shipped such a drive long enough to verify that people are actually using them (as opposed to something just introduced that may end up getting pulled if there are massive numbers of returns)

I expect that the vast majority of enterprise disk I/O is RMW anyway, at the raid stripe level.

is there a list of RAID controllers that support 4K sectors in their firmware? (and therefor I would presume do the RMW in the card firmware)

Preparing for large-sector drives

Posted Feb 2, 2014 3:38 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

Enterprise customers are hungry for high density storage but they also don't like the overhead of 512 byte sector emulation.

What is the overhead of 512-byte sector emulation where the client does only 4K-aligned read and write commands?

Preparing for large-sector drives

Posted Feb 2, 2014 16:06 UTC (Sun) by magila (subscriber, #49627) [Link]

The overhead for aligned access is negligible but they still don't like the idea that performance drops off a cliff if they do unaligned access for whatever reason. When you're paying $500+ a drive and buying them by the pallet you get to be picky about this sort of thing.

Preparing for large-sector drives

Posted Feb 2, 2014 18:18 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

OK, but I believe the question was should we expect to see drives that are incapable of doing 512 byte granularity. So far, I don't see any up side for a drive manufacturer in producing such a drive. It's hard to imagine any user preferring that Linux refuse to use the drive at all to the drive going really slowly when the user neglects to align.

The reason the question is relevant is that the article points out some circumstances in which a drive that that has decent performance only on 4K aligned transfers but still works with unaligned transfers would be acceptable. And would work with existing Linux.

Preparing for large-sector drives

Posted Feb 2, 2014 18:35 UTC (Sun) by magila (subscriber, #49627) [Link]

The alternative to 512 byte sector emulation is native 4K sectors. That is, the sector size on the host interface is 4K. That way it is absolutely impossible to read or write less than 4K. In theory the host controller could still emulate 512 byte sectors, but to my knowledge nobody does this. The OS on such systems needs to handle not being able to generate commands at a granularity less than 4K.

Preparing for large-sector drives

Posted Feb 2, 2014 18:45 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

Right, and the question is, should we expect to see drives that have a 4K sector size on the host interface.

Some commenters have said they have seen such a drive, though couldn't identify the model. Others have said there is no reason for such a drive to exist, so doubt that it will be common.

Preparing for large-sector drives

Posted Feb 6, 2014 6:11 UTC (Thu) by kevinm (guest, #69913) [Link]

If you want transfers of unaligned access to fail rather than succeed slowly you could accomplish that entirely in software at the block layer.

Preparing for large-sector drives

Posted Feb 6, 2014 18:18 UTC (Thu) by giraffedata (subscriber, #1954) [Link]

If you want transfers of unaligned access to fail rather than succeed slowly you could accomplish that entirely in software at the block layer.

But not without changing your software. The decisions about what function to put in a new disk drive product are based on the concept that it is easier to change hardware than to change software. Seagate wants to be able to ship a disk drive to a customer, have the customer take it out of the box and insert it somewhere, and have the benefits of the new technology. If the customer must also (or instead) update his OS, that is a major impediment to deployment.

So if, hypothetically, users want their programs that do unaligned access to fail instead of run slowly, there is significant value in making that happen by having the drive reject a command instead of by having the kernel refuse to send it.

Incidentally, one of these hypothetical 4K-only disk drives wouldn't actually fail on unaligned access. The article doesn't say what the failure mode would be, but it seems to imply that Linux would refuse to use the drive at all. The way the client-drive protocol works is the drive tells the client its logical block size and all disk addresses in the protocol are in terms of that unit. It sounds like Linux today expects disk addresses to be in 512-byte units, so it would refuse to use the drive once it finds out the drive expects different units.

(I have seen recent Linux refuse to use a 520-byte-block drive, so I guess it's reasonable to guess that it would reject 4K the same way).

Preparing for large-sector drives

Posted Feb 12, 2014 12:03 UTC (Wed) by etienne (guest, #25256) [Link]

I think you can use 2048 bytes/sectors on Linux without problem, I once created a VFAT filesystem on a DVD-Ram with read/write access.
Booting from that was more of a problem, the DVD-Ram drive would not autodetect which disk was inserted and reply to an ATAPI read_sector transparently, you have to tell the DVD drive which kind of DVD is inserted... Anyone has a name of a DVD-Ram drive which autodetect at boot which disk is present and autoconfigure itself?
Also under Linux, blocks are 512 bytes long even on such 2048 bytes/sector device.

Preparing for large-sector drives

Posted Feb 12, 2014 18:17 UTC (Wed) by giraffedata (subscriber, #1954) [Link]

I think you can use 2048 bytes/sectors on Linux without problem, I once created a VFAT filesystem on a DVD-Ram with read/write access.
Isn't that because Linux used the SCSI CD-ROM (etc) driver to drive your device, and that driver is OK with 2048 bytes/sector and knows how to convert it into 512 as seen by the rest of Linux? As even the first CD-ROM devices had 2048 byte sectors, that's not surprising.

The present discussion, on the other hand, is about devices for which Linux uses the SCSI disk drive driver, which until recently wouldn't have had any reason to tolerate sector sizes other than 512.

Preparing for large-sector drives

Posted Feb 13, 2014 10:45 UTC (Thu) by etienne (guest, #25256) [Link]

Maybe, but I think fdisk and most standard tools were reporting 2048 bytes/sectors (so stuff over the SCSI driver were able to handle > 512 bytes/sectors).
Obviously if you cannot perfectly fit a number of sectors into the x86 4 Kbytes pages you will have problems, the page cache cannot work (problem with 520 bytes/sectors drives when the extra 8 bytes are not used as CRC by the driver itself and hidden from Linux, or by audio 2352 bytes/sectors CD-ROM).
Note that VFAT is one of the few filesystem made not assuming 512 bytes/sectors, for ext*fs the filesystem descriptor is at an offset of 1024 for now - either sector No 2 or in the middle of a 4 Kbyte sector...

Preparing for large-sector drives

Posted Feb 1, 2014 7:46 UTC (Sat) by ricwheeler (subscriber, #4980) [Link]

There are SAS drives - mostly used in enterprise storage - that only do 4K sectors.

The kernel works just fine with them as does UEFI firmware and boot loaders.

I don't recall which ones we have tested directly at Red Hat, but they do exist.

In fact, the SMR drives have been shipping for more than a year as well but they are the ones that use an FTL like approach to hiding the drive's nature from users.

Premature implementation

Posted Jan 30, 2014 8:09 UTC (Thu) by Felix.Braun (subscriber, #3032) [Link]

Isn't there an accepted corrolary of ye olde adage "thou shalt not optimise prematurely" that extends this sentiment to premature implementations?

Premature implementation

Posted Jan 30, 2014 9:31 UTC (Thu) by cwillu (guest, #67268) [Link]

Not really: "Premature optimization is the root of all evil" is a quote taken out of context, and when it's clear that a particular direction is being taken by the hardware, I tend to trust the usual suspects when they say it's time to start thinking about the problems that direction may introduce.

Preparing for large-sector drives

Posted Jan 30, 2014 18:07 UTC (Thu) by NRArnot (subscriber, #3033) [Link]

Implementing read-modify-write in a low level of the kernel does have a performance advantage over letting the drive do it, if the kernel caches the surrounding blocks. That may mean that the next 4K write doesn't require another read, just a modify-rewrite of the new and surrounding-unchanged data

Won't help for truly random 4K writes to a disk device, but will help in common cases where writes are clustered (data sequentially to extents, updates to heavily accessed index areas, etc.)

Of course a drive can also do such cacheing, except it's probably got a lot less RAM to play with than the actual system does.

Preparing for large-sector drives

Posted Jan 30, 2014 21:09 UTC (Thu) by marcH (subscriber, #57642) [Link]

Is there any overlap with this problem?

Optimizing Linux with cheap flash drives
http://lwn.net/Articles/428584/

Preparing for large-sector drives

Posted Jan 31, 2014 17:29 UTC (Fri) by Jonno (subscriber, #49613) [Link]

All modern CPU architectures support hugepages of one sort or another, and the core kernel already support compile-time selection of page size (to work with different architectures using different page sizes), so wouldn't it be simpler to just use the smallest hardware-supported hugepage as the kernel page size?

This would require some changes to the arch specific code, but should require no changes to the core kernel, mm subsystem, vfs subsystem or file system code, so I would think it would be comparatively easy while "fixing" several problems at once (>4k disk block size, memory fragmentation, struct page array size, etc)

Using huge pages

Posted Jan 31, 2014 18:03 UTC (Fri) by corbet (editor, #1) [Link]

The smallest huge page size is usually 2MB or 4MB. That's awfully big to use as the base page size, even in contemporary systems.

Using huge pages

Posted Jan 31, 2014 18:17 UTC (Fri) by andresfreund (subscriber, #69562) [Link]

Afair the TLB cache is also separate for hugepages, so you'd throw a fair bit of performance relative cache out of the window.

Using huge pages

Posted Feb 1, 2014 21:49 UTC (Sat) by james (subscriber, #1325) [Link]

This depends on the processor: some have TLBs that can store either huge pages or normal pages in each entry, while others (as you say) have only a few huge page TLB entries.

Obviously, each huge page TLB entry covers a lot more memory, so performance in certain workloads could improve.

Using huge pages

Posted Feb 7, 2014 15:44 UTC (Fri) by quanstro (guest, #77996) [Link]

well, that's x86 specific. many architectures, such as power, give more sensible options like 64k. it's an unfortunate side-effect of making the page tables efficient.

when evaluating just this problem for a different operating system, two different strategies were tried for x86. 2m pages everywhere, and 4K, 2M and 1G pages, as required and available. the first approach used way too much memory. imagine needing to clear a 2M stack segment. the second approach was pretty good for user space using a scheme much like direct, indirect and double-indirect blocks). doesn't waste too much memory on small processes, and doesn't waste too many page structures on large ones. this system did not use page-based allocation in the kernel, but since linux does, 64k virtual pages might be the path of least resistance.

Preparing for large-sector drives

Posted Feb 4, 2014 9:50 UTC (Tue) by ikm (subscriber, #493) [Link]

> [..] drives have read-modify-write (RMW) logic built into their firmware [..]. So Linux would have worked anyway, albeit with some performance impact.

Once upon a time I bought a couple of those green 4k drives, and the performance was truly atrocious - Debian install would take twice as long as it would on normal drives. I triple-checked the alignment of my partitions, and everything was by the book. The setup was an encrypted soft RAID1. I never managed to understand what the problem was exactly, and just went back to the store to exchange those drives for ones with the normal 512-byte sectors. Since then I avoid those 4k drives like a plague. The problem is, since this whole RMW thing happens completely transparently, it's quite hard to debug performance problems - you can't just disable RMW and see where it all breaks. Instead, you just get horrible performance with no clue on how to improve it.

4k-native hard drives would not have this problem. However, Wikipedia claims that there are still no 4K-native hard drives on the market (as of October 2013). Hopefully a day would come when they are out there and BIOSes, bootloaders and kernels all support them. Until then, I'm staying away from all this 4k phenomena.

Preparing for large-sector drives

Posted Feb 6, 2014 18:42 UTC (Thu) by stevem (subscriber, #1512) [Link]

Ummm. In my server at home right now I have several drives that *claim* to be native 4K (see the 3 at the bottom):

# for i in a b c d e f g h i j k; do sg_inq /dev/sd$i | grep "Product identification" ; sg_readcap /dev/sd$i | grep -e "Logical block length" -e "Logical blocks per" ; done
Product identification: WDC WD5000AAKS-2
Logical block length=512 bytes
Product identification: Hitachi HDS5C302
Logical block length=512 bytes
Product identification: ST2000DM001-9YN1
Logical block length=512 bytes
Product identification: ST2000DM001-9YN1
Logical block length=512 bytes
Product identification: SAMSUNG HD103UJ
Logical block length=512 bytes
Product identification: Hitachi HDP72505
Logical block length=512 bytes
Product identification: SAMSUNG HD103UJ
Logical block length=512 bytes
Product identification: ST3500641AS
Logical block length=512 bytes
Product identification: HGST HDS724040AL
Logical block length=512 bytes
Logical blocks per physical block exponent=3
Product identification: WDC WD40EFRX-68W
Logical block length=512 bytes
Logical blocks per physical block exponent=3
Product identification: WDC WD40EFRX-68W
Logical block length=512 bytes
Logical blocks per physical block exponent=3

Preparing for large-sector drives

Posted Feb 7, 2014 2:23 UTC (Fri) by magila (subscriber, #49627) [Link]

That is not 4k native. 4K native would be if the Logical block length was 4096 bytes.


Copyright © 2014, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds