Linux and 4K disk sectors

March 11, 2009

This article was contributed by Goldwyn Rodrigues

As storage devices become bigger and bigger in capacity, the areal density (number of bits packed per physical square inch) increases; hard drives are now hitting the limits. Hard drive manufacturers are now pushing to increase the basic unit of data transfer in hard drives - physical sector size - from 512 bytes to 4096 bytes (or 4KB) to improve storage efficiency and performance. However, there are a lot of subsystems affected by this change that are currently not ready to accept a 4K sector size.

The first hard drive, the RAMAC, was shipped on September 13, 1956. It weighed 2,140 pounds and held a total of 5 megabytes (MB) of data on fifty 24-inch platters. It was available for lease for $35,000 USD, the equivalent of approximately $300,000 in today's dollars.

We have come a long way since then. Hard drive capacities are now measured in terabytes, but some legacy parameters, such as the sector size, have remained unchanged. The sector size is wired into a lot of data structures in the kernel, for example, the i_blocks field of struct inode stores the number of 512-byte physical blocks it occupies on the media. Even though the core kernel deals with 512-byte sectors, the block layer is capable of handling hardware with different length sector sizes.

Why the Change?

Any sort of data communication must contend with noise. This noise is also present during the data transfer from the magnetic surface of the physical hard drive platter to the head of the hard drive. Noise can be introduced by physical defects on the hard drive platter. Noise such as this is measured with respect to the signal strength, more commonly known as Signal to Noise Ratio (SNR). As disk drive areal density increases, the signal to noise ratio decreases, thereby creating increased sensitivity to defects.

Hard Disk Drives have special reserved bits in addition to the packed data, called the Error-Correcting Code (ECC) bits. Each physical data byte sector block is followed by, besides other bytes, the ECC bytes on the physical medium. ECC is responsible for the reliability of the data transferred. Usually the Reed-Solomon Algorithm is used to compute the ECC bits; to detect and to a certain extent, correct the errors read; it is an efficient algorithm to correct errors which come in bursts. The ECC bits are placed immediately after the data bytes (as shown in the diagram below), so the error, if any, can be corrected as the disk spins. Besides the ECC, the disk also has bits reserved before the data bits, for the preamble, data sync mark; and the Inter Sector Gap (ISG) after the ECC bits.

With the increase in areal density, more bits are packed in a square inch of physical surface. A physical defect of, say 100 nanometers, would require more ECC bits to correct than is needed at lower densities. The physical defect induces more noise than signal hence the SNR decreases. This requires more bytes packed in ECC fields of the sector to compensate for the decrease in SNR and ensure the reliability of the data stored on the disk. For example: on disks with a density of 215 kbpi (kilo bytes per square inch), a 512-byte data sector requires 24 bytes of ECC; a format efficiency (number of user data bytes vs total number of bytes on disk) of 92%. With an increase of areal density to 750 kbpi, each 512-byte sector requires 40 bytes per sector to achieve the same level of disk reliability. The format efficiency of such a drive is 87%.

A sector size of 4096 bytes requires 100 bytes for ECC to maintain the same level of reliability at an areal density of 750kbpi; that yields a format efficiency of 96%. As areal densities in disk drives continue to increase, the physical size of each sector on the surface of the disk become smaller. If the mean size and number of disk defects and scratches does not scale at the same rate, then we expect more sectors to be corrupted, and we expect the resulting burst errors to more easily exceed the error correction capability of each sector. Having larger sectors, would enable such burst errors to be detected for larger sectors, hence decreasing the total ECC overhead. Besides the ECC, the disk also has bits reserved before the data bits, for the preamble, data sync mark, and the Inter Sector Gap (ISG). Increasing the sector size to 4K from 512 bytes, would decrease the occurrences of these fields, thus improving the format efficiency further.

For all of these reasons, the storage industry wants to move to larger sector sizes. The IDEMA International Disk Drive Equipment and Materials Association (IDEMA) was formed to increase co-operation among competing hard drive brands. IDEMA is responsible for the smooth transition of sector size from 512 bytes to 4Kbytes. Also, bigsector.org was set up to maintain documentation of the transition. The documentation section of bigsector.org contains more information about the transition.

Transition

This change affects a lot of areas in the storage system chain: from the drive interface, the host interface, BIOS, OS to applications such as partition managers. A change affecting so many subsystems might not be readily acceptable to the market. To make a smooth transition, the following stages are planned:

512 byte logical with 512 byte physical. This is the current state of hard drives
512-byte logical with 4096-byte physical sector size. This would facilitate a smooth transition from 512-byte to 4096-byte sector sizes.
4096-byte logical with 4096-byte physical sectors. This would be done once all hardware and software would be aware of the underlying change and geometry with respect to sector size. This change would first be seen in SCSI devices and later in ATA devices.

During the transition phase (step 2), drives are planned to use 512 byte emulation, known as read-modify write (RMW). Read-modify-write is a technique used to emulate 512-byte sector size over a 4K physical sector size. Written data which does not correspond to full 4K sectors would result in the drive first reading the existing 4K sector, modifying the part of data which changed, and writing the 4K sector data back to the drive. More information on RMW and its implementation can be found in this set of slides. Needless to say, RMW decreases the throughput of the device, though the shorter ECC will compensate by giving an overall better performance (hopefully). Such drives are expected to be commercially available in the first quarter of 2011.

Matthew Wilcox recently posted a patch to support 4K sectors according to the ATA-8 standard (PDF). The patch adds an interface function by the name sector_size_supported(). Individual drivers are required to implement this function and return the sector size used by the hardware. The size returned is stored in the sect_size field of the ata_device structure. This function returns 512 if the device does not recognize the ATA-8 command, or the driver does not implement the interface. The sect_size is used instead of ATA_SECT_SIZE when the data transfer is a multiple of 512-byte sectors.

The partitioning system and the bootloader will also require changes because they rely on the fact that partitions start from the 63rd sector of the drive, which is misaligned with the 4K sector boundary. This problem will be solved, in the short term, by using the 4K physical - 512 byte logical drives. The 512-byte sectors are aligned in a way that the 1st logical sector starts from the 1st octant of the physical 1st 4K sector, as shown below.

This scheme to coincide the logical and physical sectors to optimize data storage and transfer is known as odd-aligned physical/logical sectors. It can lead to other problems though: odd-aligned sectors might misalign the data with respect to filesystem blocks. Assuming a 4K page size, a random read would require two 4K sector reads. This is the reason, applications such as bootloaders and partitioning systems should be ready for 4K sector size hard drives (step 3), for overall throughput efficiency.

An increased sector size is required by hard drives to break the current limits of hard drive capacity while minimizing the overhead of error checking data. However, a smooth transition will decide the acceptability of these drives in the market. The previous transition, which broke the 8.4GB limit using Large Block Access (LBA), was easily accepted. However, with so many drives in use currently, the transition would be determined by the co-operation of various subsystems of the data supply chain, such as filesystems and applications dealing with hard drives.

Index entries for this article
Kernel	Block layer/Large physical sectors
Kernel	Device drivers/Block drivers
GuestArticles	Rodrigues, Goldwyn

Linux and 4K disk sectors

Posted Mar 12, 2009 1:55 UTC (Thu) by jimparis (guest, #38647) [Link] (3 responses)

The odd alignment trick only works if the partition tables are laid out that way -- for a while now, Windows has been starting the partition table at a 1MB boundary instead. H. Peter Anvin says "this is a disaster". Check out that whole thread from last month for more info.

Linux and 4K disk sectors

Posted Mar 14, 2009 0:30 UTC (Sat) by giraffedata (guest, #1954) [Link] (2 responses)

And in big systems, disks often don't use DOS-partitioning, with some modern space manager taking the whole disk and allocating it at normal alignments. Most of the disks I use are that way. I guess those would fare poorly with odd alignment too.

Will the user get to choose between odd alignment and natural alignment? Or is the whole world going to be tuned for personal computers?

Linux and 4K disk sectors

Posted Mar 15, 2009 14:54 UTC (Sun) by willy (subscriber, #9762) [Link] (1 responses)

We're going to expose to userspace what the alignment is of this particular drive. Programs which lay out partitions (or partition-like things) will have to become aware of this problems. Filesystems should also care, so they don't cause the drive to do RMW.

Your average user-space program (be it 'cat' or openoffice) should let the filesystem do what it does best and ignore the underlying drive issues.

Linux and 4K disk sectors

Posted Mar 15, 2009 17:52 UTC (Sun) by giraffedata (guest, #1954) [Link]

Your average user-space program (be it 'cat' or openoffice) should let the filesystem do what it does best and ignore the underlying drive issues.

That's good, though the filesystem driver should also let the device driver do what it does best and ignore the underlying drive implementation. We learned a long time ago that there's value in having a generic block device interface, e.g. such that a filesystem driver doesn't concern itself with tracks and cylinders.

But I'm inferring from your silence that there will not be a way for the user to choose the alignment in the drive. That seems disastrous, since it means that at the very least everyone will need new device drivers to use the new drives with good performance (or maybe the RMW won't really be that noticeable?).

It would be a whole lot easier if the user could just get a special program to set the drive to the alignment his application requires and then use the drive with existing systems. I'd even say jumper-selectable, but I hear jumpers cost a fortune.

4K, why not also 64K?

Posted Mar 12, 2009 13:44 UTC (Thu) by zmi (guest, #4829) [Link] (8 responses)

Why only 4K? Seeing the speed at which disk sizes grow, they should define
a 64K block size already. Even if it's not used in the beginning, that
should make it clear to software developers that it's time to rethink
interfaces. 64k blocks are used in RAIDs since years, and would only be a
natural thing to do in hardware. Maybe not now, but in 5 years, when there
are 12TB hard disks, we might be happy about it. Especially as size grows
quicker than speed of disks.

I always wondered why bigger sectors have never been used, where "always"
is the time when the 2K sector CD-Roms arrived. That's like "forever" in IT
anyway.

zmi

4K, why not also 64K?

Posted Mar 12, 2009 15:30 UTC (Thu) by BenHutchings (subscriber, #37955) [Link] (7 responses)

The x86 basic page size is fixed at 4K, and paging to disks with a larger block size is problematic. However, the ATA-8 standard does allow for larger block sizes, and Linux will probably adapt to this at some point.

4K, why not also 64K?

Posted Mar 12, 2009 15:40 UTC (Thu) by clugstj (subscriber, #4020) [Link] (6 responses)

So, x86 brain damage will constrain us for at least another generation?

4K, why not also 64K?

Posted Mar 12, 2009 17:49 UTC (Thu) by james (subscriber, #1325) [Link] (5 responses)

4K page sizes are not necessarily brain damage. They're a tradeoff: with 64K pages, you may get sixteen times more memory in the same size TLBs, but you lose a lot of memory if you're dealing with a lot of small datastructures that have to be page-aligned -- mmap for example.

Linus Torvalds has a classic rant on the subject at realworldtech.com (the rant starts a page into his post):

I've actually done the math. Even 64kB pages is totally useless for a lot of file system access stuff: you need to do memory management granularity on a smaller basic size, because otherwise you just waste all your memory on unused left-over-space.

and

So reasonable page sizes range from 4kB to 16kB (and quite frankly, 16kB is pushing it - exactly because it has fragmentation issues that blow up memory usage by a huge amount on some loads). Anything bigger than that is no longer useful for general-purpose file access through mmap, for example.

and

For a particular project I care about, if I were to use a cache granularity of 4kB, I get about 20% lossage due to memory fragmentation as compared to using a 1kB allocation size, but hey, that's still ok. For 8kB blocks, it's another 21% memory fragmentation cost on top of the 4kB case. For 16kB, about half the memory is wasted on fragmentation. For 64kB block sizes, the project that takes 280MB to cache in 4kB blocks, now takes 1.4GB!

James.

Linus Torvalds on realworldtech

Posted Mar 12, 2009 19:02 UTC (Thu) by anton (subscriber, #25547) [Link] (3 responses)

Linus Torvalds has a classic rant on the subject at realworldtech.com

Is the Linus Torvalds on realworldtech.com the same person as the well-known author of Linux?

Linus Torvalds on realworldtech

Posted Mar 12, 2009 21:12 UTC (Thu) by james (subscriber, #1325) [Link] (2 responses)

It seems highly likely, yes.

He's got the same use of English (it sounds like Linus), the same technical knowledge, the same opinions on various processors (especially Itanium), the same enjoyment of a good flame war, posts using the torvalds-at-linux-foundation.org email address (although the site doesn't necessarily validate those), and some of these posts have been the subject of news reports on mainstream IT websites (so there's a fair chance that Linus would hear about those posts).

Andi Kleen posts there too (or, again, someone calling himself Andi Kleen), and some of the other posters are also very knowledgeable. So there's a good chance that any slips would be noticed.

If it is an imposter, he's managed to keep a lot of people fooled for a long time over a lot of arguments.

So we can't be as certain as we can that, for example, Alan Cox isn't really a whole load of little Welsh gnomes hiding down disused Welsh coal-mines, but it's pretty likely.

Linus Torvalds on realworldtech

Posted Mar 13, 2009 0:35 UTC (Fri) by njs (subscriber, #40338) [Link]

>So we can't be as certain as we can that, for example, Alan Cox isn't really a whole load of little Welsh gnomes hiding down disused Welsh coal-mines

...and how do we know that?

Linus Torvalds on realworldtech

Posted Mar 13, 2009 23:23 UTC (Fri) by jd (guest, #26381) [Link]

My information is that they're Cornish Pixies hiding down Welsh Coal Mines. Freshly picked Cornish Pixies.

4K, why not also 64K?

Posted Mar 12, 2009 19:11 UTC (Thu) by zmi (guest, #4829) [Link]

> I've actually done the math. Even 64kB pages is totally useless for
> a lot of file system access stuff: you need to do memory management
> granularity on a smaller basic size, because otherwise you just
> waste all your memory on unused left-over-space.

Depends on the design of the FS. Example: ReiserFS already combined the endings of several files in a single 4K disk block and thus saved a *lot* of disk space. And for a company using the server to store their documents, there won't be too many files anymore with <64KiB (yes I know with 65KiB you still loose 63KiB), but those should be fast. Speed starts to be a limitation, while disk space is not (hey, just put another terabyte disk into the RAID).

Regarding memory page size: I don't understand why that limits a FS block size, there's scatter/gather I/O and a 64KiB block from disk doesn't need to be linear in memory. I'm not a coder, but I think that limitation should be resolvable.

zmi

Linux and 4K disk sectors

Posted Mar 12, 2009 20:23 UTC (Thu) by Thue (guest, #14277) [Link] (4 responses)

Why bother with a new block-level abstraction, instead of just skipping a generation and implementing a object storage interface?

Linux and 4K disk sectors

Posted Mar 12, 2009 22:18 UTC (Thu) by willy (subscriber, #9762) [Link]

Object storage is, so far, an unproven, interesting, exciting paradigm. I wouldn't like to trust my data to it today. The 4k sector drives are coming *soon* (I don't think I can say exactly when). OSDs have been in development for at least ten years, and don't show any signs of getting traction any time soon.

Linux and 4K disk sectors

Posted Mar 12, 2009 23:22 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (2 responses)

Object storage? Meh. What makes you believe that disk firmware can do a better job of allocating blocks than a filesystem can?

Linux and 4K disk sectors

Posted Mar 13, 2009 10:47 UTC (Fri) by gnb (subscriber, #5132) [Link]

It does have the advantage of more information: it knows the real geometry of
the disk, as well as whether or not a given block is actually remapped to
somewhere completely different. So *in principle* I can imagine it
managing to allocate for better performance.

Linux and 4K disk sectors

Posted Mar 13, 2009 20:37 UTC (Fri) by Thue (guest, #14277) [Link]

It would also give the hardware manufacturers total freedom to reimplement the hardware, which is generally a good feature of an interface.

Linux and 4K disk sectors

Posted Mar 12, 2009 22:13 UTC (Thu) by willy (subscriber, #9762) [Link] (1 responses)

Thanks for noticing my ATA patches. Martin Petersen has actually been doing more work than I have on supporting 4k sectors in the kernel, Peter Anvin has been working on some of the boot loader issues, and I know Matt Domsch has been putting in some effort too. I just did some of the easy bits ;-)

I have some hopefully constructive criticism for this article:

LBA stands for Logical Block Address, not Large Block Access. LBA was actually a move away from CHS (Cylinder-Head-Sector) addressing. I'm not sure whether LBA coincided with the move from 28-bit commands to 48-bit commands or not. Adding 48-bit commands was a much smaller undertaking than changing the sector size; there are many more intricate dependencies on sector size than there were on maximum drive size.

I don't know if I've explained myself terribly well if an article like this contains the line "The sect_size is used instead of ATA_SECT_SIZE when the data transfer is a multiple of 512-byte sectors." Let me see if I can do better:

Not all ATA commands that transfer data do so in multiples of the sector size. DOWNLOAD MICROCODE is a great example of this. It sends data to the drive in multiples of 512 bytes, regardless of the sector size. So when we issue a command, we have to decide what transfer size to use; it could be the sector size, or it could be 512 bytes.

My patchset implements a pair of lookup tables for this; one to say "This command transfers in units of sector size" and another to say "We know what this command is". If it's in the first table, we know to use sect_size. If it's not in the second table, we don't know what size to use, so we print an error, assume 512 bytes and add it to the second table. My rationale for this design is that new commands are less likely to be read/write data commands than they are to be some kind of management command.

Hope these clarifications are useful.

Linux and 4K disk sectors

Posted Mar 13, 2009 12:59 UTC (Fri) by ranmachan (guest, #21283) [Link]

FWIW, I'm pretty sure the move to LBA came first and only later on LBA48 addressing was introduced after we already had LBA28.