Sunsetting buffer heads

Posted May 23, 2023 9:55 UTC (Tue) by farnz (subscriber, #17727)
In reply to: Sunsetting buffer heads by Fowl
Parent article: Sunsetting buffer heads

Flash doesn't do read-modify-write cycles; it has a translation layer that maintains a mapping from LBAs to where on the physical flash the data is actually stored, and can thus do small writes in a log-structured fashion. SMR drives can operate in one of two modes; one where the device exposes the write limitations to the host, and you have to comply, and one where it either log-structures things itself or does read-modify-write (whichever is faster).

But note that even if your device does do read-modify-write, working in larger blocks isn't guaranteed to be faster; it may well still be faster to read 512 bytes, modify 16 or 128 bytes (a single partition table entry in MBR or GPT), then write 512 bytes than to read 1024 KiB, do the modification, and write 1024 KiB back. While the 1024 KiB write may take the same time (or even a little less) than a 512 byte write, it'll be dominated by the bus transfer time involved in transferring all that data to and from the host.

And note that if the device has a decent amount of cache on it, it may well service the 512 byte read request by transferring a full physical sector into its cache in anticipation of a 512 byte write - so the read cost of the read-modify-write cycle has already been paid by the time you transfer 512 bytes back to write, and your write is as cheap as a bigger write.

This is not to be confused with the issues with partition alignment on early "advanced format" drives, where the underlying hardware had 4096 byte sectors and 512 byte sectors; the issue there was that you could have a situation where the host was making 4096 byte writes (optimal size for the hardware), but was misaligned with the physical sectors, so every 4096 byte write turned into a read-modify-write. The same could happen with 1 GiB writes, misaligned by 512 bytes, where a consequence is that what should have been a single large operation becomes a read-modify-write.

Sunsetting buffer heads

Posted May 24, 2023 18:22 UTC (Wed) by willy (subscriber, #9762) [Link] (1 responses)

You're not wrong, but you're not aware of more recent problems encountered by flash drives which is that it takes a lot of memory to maintain the translation layer. If you double the number of LBAs, you double the size of the amount of RAM needed to hold the translation layer. Unless you go past the 32->64 threshold, then it quadruples. Anyway.

To shrink the size of this table (saving money and hopefully resulting in a cheaper drive), some drives track block mapping on a 4kB or larger boundary instead of the 512 byte LBA boundary. That shrinks the table by a factor of 8! Downside ... we're back to a read-modify-write cycle for single-block writes. So even an NVMe drive may prefer 4kB aligned writes.

Sunsetting buffer heads

Posted May 25, 2023 10:32 UTC (Thu) by farnz (subscriber, #17727) [Link]

That doesn't affect what I said at all, though - the difference between a host read-modify-write and a device read-modify-write is unlikely to be in the host's favour, since it's the same medium accesses, but one also involves a bus transfer, and the other doesn't.

And there's other weirdness with FTLs out there - one I've encountered tracked the entire logical device in large chunks, where a chunk could either be a pointer to flash locations, or a pointer to a 32k split. Each 32k split could, in turn, point to a flash location, or to a 512 byte split. And both sizes of split were limited resources, statically allocated; if you ran out of splits of the required size to satisfy a write, the FTL would delay the write while it did the defragmentation needed to free up a split of the required size. In the worst case, a 512 byte write would force you to defrag a 32k split into a large chunk, and then you could use the newly freed 32k split to defrag a 512 byte write, and then use the newly freed 512 byte split to handle the write. But, this process is quicker than having the host read a large chunk, modify 512 bytes, and write a large chunk - since in the read and then write case, the FTL has to identify the 32k and 512 byte splits that are affected by this large write, and mark them as free for reuse.