Not logged in
Log in now
Create an account
Subscribe to LWN
An unexpected perf feature
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
PostgreSQL 9.3 beta: Federated databases and more
LWN.net Weekly Edition for May 9, 2013
Solid-state storage devices and the block layer
Posted Oct 5, 2010 17:29 UTC (Tue) by strappe (guest, #53440)
I can easily imagine that flash will displace hard drives in most laptops and desktops, but server farms are still going to need massive amounts of cheap storage. Rotating media still has a huge lead in $/bit (100X) so I don't think it will be displaced in there any time soon.
Posted Oct 5, 2010 18:04 UTC (Tue) by jzbiciak (✭ supporter ✭, #5246)
I was thinking more in terms of treating flash specifically as less like an "I/O" device and more like a slow memory. I have no doubt that spinning rust will be around for awhile--a decade or more at least. It just seems like wrapping the flash behind a "disk drive" abstraction in hardware puts some artificial upper limits on how well it can perform. It's acceptable with spinning rust because the electronics are so much faster. When you go all solid-state, it just feels like a bottleneck.
Imagine what would happen if the immense creativity of the kernel crowd were unleashed on the problem of load balancing writes, erases and reads across a parallel array of raw flash modules?
Approaches such as UBI/UBIFS sound rather promising. I generally like the idea of owning the problem in kernel space, where it seems like we ought to be able do much more deliberate and proactive scheduling.
Posted Oct 5, 2010 18:36 UTC (Tue) by dlang (✭ supporter ✭, #313)
the requirement to do bulk deletes makes it far more like spinning disks than ram.
Posted Oct 5, 2010 19:27 UTC (Tue) by jzbiciak (✭ supporter ✭, #5246)
It certainly is random access. I can generally send a command for address X followed by a command for address Y to the same chip, where the response time is not a function of the distance between X and Y, except when they overlap. Instead, the performance is most strongly determined by what commands I sent[*]. Reads are much faster than writes, and both are much, much faster than sector erase.
The opposite is generally true of disks. There, the cost of an operation is more strongly determined by whether it triggered a seek (and how far the seek went) than if the operation was a read or a write. Both reads and writes require getting the head to a particular position on the platter, ignoring any cache that might be built into the drive. Also, under normal operation, spinning-rust drives don't really have an analog to "sector erase." (Yes, there's the old "low-level format" commands, but those aren't generally used during normal filesystem operation.)
[*] Ok, so that's not 100% true, but essentially true in the current context. NAND flash has a notion of "sequential page read" versus "random page read". If you're truly reading random bytes a'la DRAM w/out cache, you'll see noticeably slower performance if the two reads are in different pages. But, if you're doing block transfers, such as 512-byte sector reads, you're reading the whole page. Hopping between any two sectors always costs about the same. Here, read a data sheet! For this particular flash, a random sector read is 10us, sector write is 250us, and page erase is 2ms. The whole page-open/page-close architecture makes it look much more like modern SDRAM than disk.
Posted Oct 5, 2010 19:42 UTC (Tue) by dlang (✭ supporter ✭, #313)
Posted Oct 5, 2010 20:38 UTC (Tue) by jzbiciak (✭ supporter ✭, #5246)
You can do random writes to random empty sectors. Again, that's nothing like how a hard disk works. I'm still strenuously disagreeing with your earlier statement that flash's properties make it more like a disk than like RAM. It's really an entirely different beast worthy of separate consideration, which is why I think wrapping it up in an SSD limits its potential.
With flash, you need entirely new strategies that apply neither to disks nor RAM to get the full benefit from the technology. Much of the effort spent on disks revolves (no pun intended) around eliminating seeks. No such effort is required with RAM or with flash. Flash does require you to think about how you pool your free sectors, though, and how you schedule writing versus erasing. I won't deny that. Rather, I say it only further invalidates your original conjecture that it makes flash more like disks. (I will agree it makes it less like RAM though.)
Because seeks are "free", I could totally see load balancing algorithms of the form "write this block to the youngest free sector on the first available flash device", so that a new write doesn't get held up by devices busy with block erases. That looks nothing like what you'd want to do with a disk. It takes advantage of the "free seek" property of the flash while helping to hide the block erase penalty it imposes. Neither property is a property of a disk drive. Of course, neither property is a property of RAM, either.
Am I splitting hairs over semantics here? Let me step back and summarize, and see if you agree: Raw flash's random access capability and relatively low access time can make it much more like RAM than disk, especially in terms of bandwidth and latency. Raw flash's limitations on writes, however, require the OS to have flash-specific write strategies. They prevent the OS from treating flash identically to RAM, and will require careful thought to be handled correctly. This is similar to how we had to put careful thought into disk scheduling algorithms, even if flash requires entirely different algorithms to address its unique properties.
Posted Oct 9, 2010 14:10 UTC (Sat) by joern (subscriber, #22392)
Intriguing. Can you elaborate a bit? What difference does it make vs. the naïve approach of erasing before writing?
Posted Oct 9, 2010 14:55 UTC (Sat) by dlang (✭ supporter ✭, #313)
you also have the problem that erasing takes a significant amount of time and power to accomplish, so you don't want to wait until you need to erase to do so and you don't want to erase when you don't need to and are on battery
Posted Oct 9, 2010 15:03 UTC (Sat) by jzbiciak (✭ supporter ✭, #5246)
Note: I'm not an expert. Please do not mistake me for one. :-) Here are my observations, though, along with things I've read elsewhere.
Flash requires wear leveling in order to maximize its life. For the greatest effect, you want to wear level across the entire device, which means picking up and moving otherwise quiescent data so that each sector sees approximately the same number of erasures. That's one aspect.
Another aspect is that erase blocks are generally much larger than write sectors. So, when you do erase, you end up erasing quite a lot. Furthermore erasure is about an order of magnitude slower than writing, and writing is about an order of magnitude slower than reading. For a random flash device whose data sheet I just pulled up, a random read takes 25us, page program takes 300us, and block erase takes 2ms. Pages are 2K bytes, whereas erase blocks are 128K bytes.
(Warning: This is where I get speculative!) And finally, if you have multiple flash devices (or multiple independent zones on the same flash device), you can take advantage of that fact and the fact that "seeks are free" by redirecting writes to idle flash units if others are busy. That's probably the most interesting area to explore algorithmically, IMO. Given that an erase operation can take a device out of commission for 2ms, picking which device to start an erase operation on and when to do it can have a pretty big impact on performance. If you can do background erase on idle devices, for example, then you can hide the cost.
Posted Oct 7, 2010 12:38 UTC (Thu) by nix (subscriber, #2304)
NAND flash has a notion of "sequential page read" versus "random page read". If you're truly reading random bytes a'la DRAM w/out cache, you'll see noticeably slower performance if the two reads are in different pages.
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds