LWN.net Logo

SSD

SSD

Posted Oct 2, 2009 9:31 UTC (Fri) by dwmw2 (subscriber, #2063)
Parent article: LinuxCon: Kernel roundtable covers more than just bloat

"The flash hardware itself is better placed to know about and handle failures of its cells, so that is likely to be the place where it is done, he said."
I was biting my tongue when he said that, so I didn't get up and heckle.

I think it's the wrong approach. It was all very well letting "intelligent" drives remap individual sectors underneath us so that we didn't have to worry about bad sectors or C-H-S and interleaving. But what the flash drives have to do to present a "disk" interface is much more than that; it's wrong to think that the same lessons apply here.

What the SSD does internally is a file system all of its own, commonly called a "translation layer". We then end up putting our own file system (ext4, btrfs, etc.) on top of that underlying file system.

Do you want to trust your data to a closed source file system implementation which you can't debug, can't improve and — most scarily — can't even fsck when it goes wrong, because you don't have direct access to the underlying medium?

I don't, certainly. The last two times I tried to install Linux to a SATA SSD, the disk was corrupted by the time I booted into the new system for the first time. The 'black box' model meant that there was no chance to recover — all I could do with the dead devices was throw them away, along with their entire contents.

File systems take a long time to get to maturity. And these translation layers aren't any different. We've been seeing for a long time that they are completely unreliable, although newer models are supposed to be somewhat better. But still, shipping them in a black box with no way for users to fix them or recover lost data is a bad idea.

That's just the reliability angle; there are also efficiency concerns with the filesystem-on-filesystem model. Flash is divided into "eraseblocks" of typically 128KiB or so. And getting larger as devices get larger. You can write in smaller chunks (typically 512 bytes or 2KiB, but also getting larger), but you can't just overwrite things as you desire. Each eraseblock is a bit like an Etch-A-Sketch. Once you've done your drawing, you can't just change bits of it; you have to wipe the whole block.

Our flash will fill up as we use it, and some of the data on the flash will be still relevant. Other parts will have been rendered obsolete; replaced by other data or just deleted files that aren't relevant any more. Before our flash fills up completely, we need to recover some of the space taken by obsolete data. We pick an eraseblock, write out new copies of the data which are still valid, and then we can erase the selected and re-use it. This process is called garbage collection.

One of the biggest disadvantages of the "pretend to be disk" approach is addressed by the recent TRIM work. The problem was that the disk didn't even know that certain data blocks were obsolete and could just be discarded. So it was faithfully copying those sectors around from eraseblock to eraseblock during its garbage collection, even though the contents of those sectors were not at all relevant — according to the file system, they were free space!

Once TRIM gets deployed for real, that'll help a lot. But there are other ways in which the model is suboptimal.

The ideal case for garbage collection is that we'll find an eraseblock which contains only obsolete data, and in that case we can just erase it without having to copy anything at all. Rather than mixing volatile, short-term data in with the stable, long-term data we actually want to keep them apart, in separate eraseblocks. But in the SSD model, the underlying "disk" can't easily tell which data is which — the real OS file system code can do a much better job.

And when we're doing this garbage collection, it's an ideal time for the OS file system to optimise its storage — to defragment or do whatever else it wants (combining data extents, recompressing, data de-duplication, etc.). It can even play tricks like writing new data out in a suboptimal but fast fashion, and then only optimising it later when it gets garbage collected. But when the "disk" is doing this for us behind our back in its own internal file system, we doesn't get the opportunity to do so.

I don't think Ted is right that the flash hardware is in the best place to handle "failures of its cells". In the SSD model, the flash hardware doesn't do that anyway — it's done by the file system on the embedded microcontroller sitting next next to the flash.

I am certain that we can do better than that in our own file system code. All we need is a small amount of information from the flash. Telling us about ECC corrections is a first step, of course — when we had to correct a bunch of flipped bits using ECC, it's getting on for time to GC the eraseblock in question, writing out a clean copy of the data elsewhere. And there are technical reasons why we'll also want the flash to be able to say "please can you GC eraseblock #XX soon".

But I see absolutely no reason why we should put up with the "hardware" actually doing that kind of thing for us, behind our back. And badly.

Admittedly, the need to support legacy environments like DOS and to provide INT 13h "DISK BIOS" calls or at least a "block device" driver will never really go away. But that's not a problem. There are plenty of examples of translation layers done in software, where the OS really does have access to the real flash but still presents a block device driver to the OS. Linux has about 5 of them already. The corresponding "dumb" devices (like the M-Systems DiskOnChip which used to be extremely popular) are great for Linux, because we can use real file systems on them directly.

At the very least, we want the "intelligent" SSD devices to have a pass-through mode, so that we can talk directly to the underlying flash medium. That would also allow us to try to recover our data when the internal "file system" screws up, as well as allowing us to do things properly from our own OS file system code.


(Log in to post comments)

SSD

Posted Oct 3, 2009 7:45 UTC (Sat) by job (guest, #670) [Link]

Did Ted not read Val's articles on LWN, or does he just not agree?

SSD

Posted Oct 5, 2009 13:09 UTC (Mon) by i3839 (guest, #31386) [Link]

I agree with you, but I understand the other side too.

> But I see absolutely no reason why we should put up with the "hardware"
> actually doing that kind of thing for us, behind our back. And badly.

Three reasons:

- Interoperability without losing flexibility.
For a "dumb" block driver to work the translation table must be fixed.
And when running in legacy mode the drive will be very slow. There are
a heap of problems waiting when it gives write access too.
It's very hard to let people switch filesystems. They mostly use the
default one from their OS, or FAT. As long as disks are sold as units
and not as part of a system this will be true.

- Performance.
Having dedicated hardware doing everything is faster and uses less power.
Not talking about an ARM mc, but a custom asic. (Just compare Intel's
SSD idle/active power usage to others.)

- Fast development.
Currently the flash chip interface is standardized with ONFI and the
other end with SATA. All the interesting development happens in-between.
Because it's wedged between stable interfaces it can change a lot without
impacting anything else (except when it's done badly ;-).

So short term the situation is quite hopeless for direct hardware access.
The best hope is probably SSDs with free specifications and open source
firmware.

Long term I think we should get rid of the notion of a disk and go more
to a model resembling RAM. You could buy arrays of flash and plug them in,
instead of whole disks (they could look like "disks" to the user, but
that's another matter). This decouples the flash from the controller
and makes data recovery easier in case the controller dies.

What is needed is a flash specific interface which replaces SATA and
implements the features that all good flash file systems need: Things
like multi-chip support, ECC handling, background erases and scatter
gather DMA. Perhaps throw in enough support to implement RAID fast
in software, if it makes enough sense. Basically a standardized small,
simple and fast flash controller, preferably accessible via some kind of
host memory interface (PCIe on x86). Make sure to make it flexible enough
to handle things like MRAM/PRAM/FeRAM too.

Maybe AHCI is good enough after a few adaptations, but it can be probably
a lot better. Or perhaps some other interface from the embedded world fits
the description.

This will make embedding the controller on a SoC easier too, without the
need for special support in the OS. SATA is really redundant in such cases.

I'm pretty sure most flash controllers already have such chip, but don't
expose it. Instead they hide it behind an embedded microcontroller which
handles SATA and implements the FTL. They should standardize it and get
rid of the power hungry, complexity adding bloat.

I also think it's crucial to get optimal performance, flash gets faster
and faster. It seems pointless to get faster and faster SATA when PCIe
is already fast enough. It's also silly to fake SATA by just implementing
a AHCI controller with direct flash access. Keep SATA around for real
external storage, not something that doesn't take much space anyway.

People that look at SSDs and see them just as disks and don't think about
the future will think it's best if the hardware does as much as possible.
But if you forget the classic disk model and look at what's really going
on it seems obvious that the classic disk model isn't that simple anyway
and doesn't fit flash or how the hardware looks like and could be used.

SSD

Posted Oct 6, 2009 6:23 UTC (Tue) by dwmw2 (subscriber, #2063) [Link]

We should probably take this discussion elsewhere. Your input would be welcome on the MTD list, where I've started a thread about what we want the hardware to look like, if we could have it our way.

But just a brief response...

"- Interoperability without losing flexibility."
This is still possible with a more flexible hardware design — you just implement the translation layer inside your driver, for legacy systems. M-Systems were doing this years ago with the DiskOnChip. More recently, take a look at the Moorestown NAND flash driver. You can happily use FAT on top of those. But of course you do have the opportunity to do a whole lot better, too. And also you have the opportunity to fix the translation layer if/when it goes wrong. And to recover your data.

"- Performance"
But this isn't being done in hardware. It's being done in software, on an extra microcontroller.

Yes, we do need to look carefully at the interface we ask for, and make sure it can perform well. But there's no performance-based reason for the SSD model.

"- Fast development"
You jest, surely? We had TRIM support for FTL in Linux last year, developed in order to test the core TRIM code. When do we get it on "real hardware"? This year? Next?

Being "wedged between stable interfaces" isn't a boon, in this case. Because it's wedged under an inappropriate stable interface, we are severely hampered in what we can do with it.

"People that look at SSDs and see them just as disks and don't think about the future will think it's best if the hardware does as much as possible. But if you forget the classic disk model and look at what's really going on it seems obvious that the classic disk model isn't that simple anyway and doesn't fit flash or how the hardware looks like and could be used."
Agreed. I think it's OK for the hardware to do the same kind of thing that disk hardware does for us — ECC, and some block remapping to hide bad blocks. But that's all; we don't want it implementing a whole file system of its own just so it can pretend to be spinning rust. In particular, perpetuating the myth of 512-byte sectors is just silly.

SSD

Posted Oct 29, 2009 18:11 UTC (Thu) by wookey (subscriber, #5501) [Link]

Well said David. I thought Linus was talking rot there too. It's bad enough all the manufacturers thinking the black-box approach is a sensible one - we really don't want kernel people who don't know any better getting that idea as well.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds