Log-structured file systems: There's one in every SSD

Posted Sep 19, 2009 6:57 UTC (Sat) by butlerm (subscriber, #13312)
In reply to: Log-structured file systems: There's one in every SSD by flewellyn
Parent article: Log-structured file systems: There's one in every SSD

If you are making an open SSD solution, the last thing you want to do is
make the basic hardware device a "disk", or contain a controller, a cpu,
wear leveling algorithms, intelligence, or any standard disk interface.

What you want is something like a PCIe or perhaps Firewire interface that
lets software running on the central CPU (or perhaps a peripheral card)
read and write the flash with software that is open, customizable, and
upgradeable.

That would make Solid State "Disk" storage much cheaper, much more
reliable, and much more customizable at a cost in hardware compatibility of
course. SATA is all dead weight - other than the serial interface, it
seems a gigantic step backward in the state of storage I/O technology.

SAS/SCSI is similarly over burdened, if not quite so backward as SATA.
SATA is one of those "make the simple things simple, and the hard things
impossible" sort of technologies".

Log-structured file systems: There's one in every SSD

Posted Sep 19, 2009 7:14 UTC (Sat) by dlang (guest, #313) [Link] (21 responses)

one huge advantage of the SATA interface is that it works in any machine, it also doesn't require any special drivers for the OS.

it's also surprisingly complicated to make a PCI interface correctly. in many ways it's far easier to make a SATA device.

Log-structured file systems: There's one in every SSD

Posted Sep 19, 2009 15:23 UTC (Sat) by butlerm (subscriber, #13312) [Link] (2 responses)

The problem is that the SATA interface is technically defective in a number
of respects, one of which is particularly bad for SSDs. The other problems
are well described in the original article - a black box interface to obscure
what is definitely not a black box. Switch manufacturers, and you could have
entirely different performance characteristics in a manner that could take to
months to evaluate.

The only way SATA/SAS physical interface would work well for SSDs is to
develop an entirely new command set that allowed the off-loading of virtually
all the intelligence to the host, i.e. presenting a flash memory interface
rather than a disk interface over the serial bus. At that point you really
couldn't call it a SATA "disk" any more, it would be more like a large
capacity SATA "memory stick". That is the way it should be.

Log-structured file systems: There's one in every SSD

Posted Sep 19, 2009 19:55 UTC (Sat) by dlang (guest, #313) [Link] (1 responses)

the only problem I have heard of is the fact that trim is not a NCQ command. what are the other problems?

the black box argument doesn't apply to a open DIY product like was being proposed.

I don't see why the host needs to address the raw flash. the problem is that there is currently zero visability to how the flash is being managed. if you had access to the source running on the device, why would you have to push all those details back to the OS?

Log-structured file systems: There's one in every SSD

Posted Sep 20, 2009 6:55 UTC (Sun) by butlerm (subscriber, #13312) [Link]

The other major problem is effective control over when blocks physically hit
the platter, and which ones. There are no write barriers, the command to
flush the cache can't be queued, and Force Unit Access write commands
generally flush all other dirty data as well, making them slow to the degree
that no one uses them. All this stuff is important for reliable operation of
modern filesystems and databases, especially in portable devices.

I don't think do-it-yourself has anything to do with it - the question is
whether the person(s) concerned want to re-implement what other companies are
already doing with no obvious advantage, or do something that could
potentially run circles around current devices, if only due to the
flexibility and performance characteristics of the interface. A single level
filesystem ought to outperform one filesystem on top of another filesystem
every time.

Log-structured file systems: There's one in every SSD

Posted Sep 20, 2009 8:48 UTC (Sun) by nhippi (subscriber, #34640) [Link] (17 responses)

It is just incredibly stupid to program flash memory with a very low-level interface designed for controlling spinning disks. You could extend ATA enough to give a raw flash (erase this block, then write with this data) or semi-raw (distribute write this block to nand chips 1,3,5,7, ...), but it still doesn't change the fact that the bus and protocol are engineered around spinning disks and overcoming its latency problems.

Other problems with sata (and pci) is that they are incredibly power-hungry busses.

While we have some interesting NAND flash-optimized code in linux kernel (UBIFS), microsofts dominant position in desktop market means they only appear in embedded systems. Instead of using flash-optimized free software filesystems, we attach a cpu on the flash chips to emulate the behaviour of a hard drive - just to keep microsoft operating systmems happy.

How much more evidence do people need that microsofts monopoly stiffles innovation?

Log-structured file systems: There's one in every SSD

Posted Sep 20, 2009 9:19 UTC (Sun) by dlang (guest, #313) [Link] (16 responses)

keep in mind that the same hardware standardization that you are blaming microsoft for is the same thing that let linux run on standard hardware.

think about the problems that we have in the cases where hardware doesn't use standard interfaces, and the only code that is around to make things work is the closed-source windows drivers.

examples of this are the wifi and video drivers right now (just about everything else has standardized, and I believe that wifi is in the process)

another example s sound hardware, some of it is open, but some of it is not.

if you want other historic examples, go take a look at the nightmare that was CD interfaces before they standardized on IDE.

it's not just microsoft that benifits from this standardization, it's other opensource operating systems (eventually including whatever is going to end up replacing linux someday ;-)

as for your comments about the horrible storage interface designed for rotating disk. drives export their rotating disks as if they were a single logical array of sectors. they didn't use to do this, but nowdays that's what's needed. the cylinder/head/sector counts that they report are pure fiction for the convenience of the OS. what the interface does is lets you store and retrieve blocks of data. it doesn't really matter of those blocks are stored on a single drive, a raid array, ram, flash, or anything else.

Log-structured file systems: There's one in every SSD

Posted Sep 20, 2009 18:59 UTC (Sun) by drag (guest, #31333) [Link] (13 responses)

The problem is that memory devices are NOT block devices and you can't treat them like block devices. Raw flash has "Erase Blocks", but that is just about were the similarities ends.

And I suspect, even though I have very little knowledge of SATA, that the ATA protocols used are going to be a very poor match for dealing with non-block devices.

Also on top of that people are talking about handling things that are now currently reserved to drive firmware... For example current harddrives reserve sectors to replace back sectors, right? Well in flash your going to have to record drive memory usage someplace on the device so that if you plug the drive into another OS or reinstall you don't loose your wear leveling information.

--------------------------------

So probably the best thing to do is forgo the entire SATA interface and create a PCI Express card so you can create a new interface.

This should allow a much quicker and low-level interface to bypass any and all "black boxes".

The trick is if you want to have low-level access you need to be able to map portions of your flash memory to the machine's memory map so you can access the flash _as_memory_ and not as block devices.

And obviously if your dealing with a 250GB drive you probably don't want to map the entire drive to your system's memory (which I doubt is even possible to have a PCI Express thingy do that and still be able to take advantage of things like DMA). So you'll need some sort of sliding window mechanism in PCI configuration space so that the kernel can say "I want to look at memory addresses 0xX to memory 0xY relative to the flash drive"

Then your probably going to want to have multiple "sliding windows".. for example on a 4-lane PCI E device I would think its possible to be reading/writing to 4 different portions of the flash memory simultaneously. Sort of like hyperthreading, but with flash memory.

So for example if you can write to the drive at 200MB/s then having a 8-lane PCI Express memory device means that you can have a total write performance of 1600MB/s.

Then on top of that you'll want to have a special area of flash memory, with very small erase blocks, that you can store statistical information about each erase block of flash memory as well as real-to-virtual block mappings for wear leveling algorithms to use, and then extra space for block flags for future proofing.. probably in more expensive SLC flash memory so you don't have to worry about wear leveling and whatnot compared to the regular MLC-style flash that you'll use for your actual mass storage.

------------------

And of course you'll have to realize that if you want low-level access you can't use existing file systems for Linux.

So no BTRFS, no Ext3, no Ext4 or anything like that. The flash memory file systems for Linux are pretty worthless for large amounts of storage. You could use software memory-to-block translation to run BTRFS on top of it, but you lose the advantage of "one file system", although you still retain the ability to actually know what is going on and be able to do layer violations so that BTRFS's behavior can be modified to optimize flash memory access.

And then on top of that you have no Windows compatibility. So no formatting it as vfat or anything like that. Sure you could still use Linux's memory-to-block translation stuff, but that won't be compatible for Windows.

Log-structured file systems: There's one in every SSD

Posted Sep 20, 2009 19:43 UTC (Sun) by cmccabe (guest, #60281) [Link]

The problem is that once something becomes a standard, it's very hard to change it. It usually requires the major players to hold long meetings in standards organizations. A lot of horse trading can go on in these meetings. Sometimes companies even try to sabotage the process because they don't think that the proposed standard would help them.

For example, Intel is feeling good about NAND-over-SATA right now because they have one of the most advanced block emulation layers. They have a competitive advantage and they want to make the most of it. I would be surprised if they made any moves at all towards exposing raw flash to the operating system. It would not be in their best interest.

The big sticking point with any raw-flash interface is Windows support. Pretty much all Windows users use NTFS, which works on traditional block devices. Any company hoping to replace NAND-over-SATA would have to supply a filesystem for Windows to use with their product of equivalent quality. Filesystems can take years to become truly stable. In the meantime all those angry users will be banging on *your* door.

Microsoft might create a log-structured filesystem for Windows UltraVista Aquamarine Edition (or whatever their next release will be called), but I doubt it. I just don't see what's in it for them. It's more likely that we'll see another one of those awkward compromises that the PC industry is famous for. Probably the next version of SATA will include some improvements to make NAND-over-SATA more bearable. And filesystem developers will just have to learn to live with having another layer of abstraction between them and the real hardware. Perhaps we'll finally get bigger blocks (512 is way too small for flash).

Log-structured file systems: There's one in every SSD

Posted Sep 20, 2009 23:36 UTC (Sun) by dlang (guest, #313) [Link] (10 responses)

it is very expensive to run all the wires to connect things via a parallel bus, that is why drive interfaces have moved to serial busses

trying to map your flash storage into your address space is going to be expensive, and it also isnt very portable from system to system.

it's already hard to get enough slots for everything needed in a server, dedicating one to flash is a hard decision, and low-end systems have even fewer slots.

there is one commercial company making PCI-E based flash drives, their cost per drive is an order of magnitude higher than the companies making SATA based devices, they also haven't been able to get their device drivers upstream into the kernel so users are forced into binary-only drivers in their main kernel. this is significantly worse than the SATA interface version because now mistakes in the driver can clobber the entire system, not just the storage device.

sorry, I don't buy into the 'just connect the raw flash and everything will be good' reasoning.

Log-structured file systems: There's one in every SSD

Posted Sep 21, 2009 0:38 UTC (Mon) by drag (guest, #31333) [Link] (2 responses)

Ya. I was just trying to show what it would be like to try to access a large amount of flash memory in a 'raw mode'.

People are kinda confused, I think, about the whole block vs flash thing.

The way I see it Flash memory, in a lot of ways, is much more like memory then block devices. Instead of thinking it as block devices with no seek time... think of it more like memory with pecular requirements that makes writes very expensive.

Log-structured file systems: There's one in every SSD

Posted Sep 21, 2009 1:03 UTC (Mon) by dlang (guest, #313) [Link] (1 responses)

the thing is that even main memory access isn't really 'memory like' anymore.

on modern systems, it's impossible to directly access a particular byte of ram. the memory chips actually act more like tape drives, it takes a significant amount of time to get to the start position, then it's cheap to do sequential read/writes from that point forward.

your cpu uses this to treat your ram as if it was a tape device with blocks the size of your cpu cache lines (64-256 bytes each)

In addition, if you want to have checksums to detect problems you need to define what size the chunks of data are that you are doing the checksum over.

Log-structured file systems: There's one in every SSD

Posted Sep 21, 2009 9:10 UTC (Mon) by nix (subscriber, #2304) [Link]

on modern systems, it's impossible to directly access a particular byte of ram. the memory chips actually act more like tape drives, it takes a significant amount of time to get to the start position, then it's cheap to do sequential read/writes from that point forward. your cpu uses this to treat your ram as if it was a tape device with blocks the size of your cpu cache lines (64-256 bytes each)

That's almost entirely inaccurate, I'm afraid. Ulrich Drepper's article on memory puts it better, in section 2.2.1.

The memory is necessarily read in units of cachelines, and it takes a significant amount of time to load uncached data from main memory, and of course it takes time to latch RAS and CAS, but main memory itself has a jagged access pattern, with additional delays from precharging and so on whenever RAS has to change.

But that doesn't make it like a tape drive, it's still random-access: it takes the same time to jump one row forwards as to jump fifty backwards. It's just that the units of this random access are very strange, given that they're dependent on the physical layout of memory in the machine (not machine words and possibly not cachelines), and are shielded from you by multiple layers of caching.

Log-structured file systems: There's one in every SSD

Posted Sep 21, 2009 8:30 UTC (Mon) by butlerm (subscriber, #13312) [Link]

PCI-E really isn't a parallel bus - it is multi-lane differential serial bus.
You can use a single lane if you want to - 500 MB/sec per lane today, 1 GB/s
soon.

The good thing about PCI-E for this application is that you can use simple
external cables, so you can easily locate your flash units in a different
chassis than the CPU. External PCI-E connections are being used for external
disk arrays already. *Much* faster than SAS with more than one lane.

Flash is not exactly a byte addressable memory technology, btw, so you still
need to DMA to and from host memory.

Log-structured file systems: There's one in every SSD

Posted Sep 25, 2009 22:32 UTC (Fri) by giraffedata (guest, #1954) [Link] (5 responses)

it is very expensive to run all the wires to connect things via a parallel bus, that is why drive interfaces have moved to serial busses

That's not why drive interfaces (and every other data communication interface in existence) have moved to serial. They did it to get faster data transfer.

But I'm confused as to the context anyway, because the ancestor posts don't mention parallel busses.

Log-structured file systems: There's one in every SSD

Posted Sep 25, 2009 22:57 UTC (Fri) by dlang (guest, #313) [Link] (4 responses)

they described handling flash as if it was just ram with special write requirements.

that (at least to me) implied the need for a full memory bus (thus the lots of wires)

by the way, parallel buses are inherently faster than serial buses, all else being equal.

if 1 wire lets you transmit data at speed H, N wires will let you transmit data at a speed of NxH.

the problem with parallel buses at high speeds is that we have gotten fast enough that the timing has gotten short enough that the variation in the length of the wires (and therefor the speed-of-light time for signals to get to the other end) and the speed of individual transistors varies enough to run up against the timing limits.

Log-structured file systems: There's one in every SSD

Posted Sep 28, 2009 15:31 UTC (Mon) by giraffedata (guest, #1954) [Link]

They described handling flash as if it was just ram with special write requirements.
That (at least to me) implied the need for a full memory bus (thus the lots of wires)

But the post described doing that with a PCI Express card. PCI-E is as serial as SATA.

by the way, parallel buses are inherently faster than serial buses, all else being equal.
if 1 wire lets you transmit data at speed H, N wires will let you transmit data at a speed of NxH.

It sounds like you consider any gathering of multiple wires to be a parallel bus. That's not how people normally use the word; for example, when you run a trunk of 8 Ethernet cables between two switches, that's not a parallel bus. A parallel bus is where the bits of a byte travel on separate wires at the same time, as opposed to one wire at different times. Skew is an inherent part of it.

crosstalk, not wire length

Posted Oct 2, 2009 0:21 UTC (Fri) by gus3 (guest, #61103) [Link] (2 responses)

> if 1 wire lets you transmit data at speed H, N wires will let you transmit data at a speed of NxH.

That is true, when the bus clock speed is slow enough to allow voltages and crosstalk between the wires to settle. However, as clock speeds approach 1GHz, crosstalk becomes a big problem.

> the problem with parallel buses at high speeds is that we have gotten fast enough that the timing has gotten short enough that the variation in the length of the wires ... and the speed of individual transistors varies enough to run up against the timing limits.

Wire length on a matched set of wires (whether it's cat5 with RJ-45 plugs, or a USB cable, IDE, SCSI, or even a VGA cable) has nothing to do with it. The switching speed on the transmission end can accomplish only so much, but there has to be some delay to allow the signal to settle onto the line. The culprit is the impedance present in even a single wire, that resists changes in current. The more wires there are in a bundle, the longer it takes the transmitted signal to settle across all the wires. By reducing the number of wires, the settling time goes down as well.

Related anecdote/urban legend: On the first day of a new incoming class, RAdm Grace Hopper would hold up a length of wire and ask how long it was. Most of the students would say "one foot", but the correct answer was "one nanosecond."

crosstalk, not wire length

Posted Oct 2, 2009 17:15 UTC (Fri) by giraffedata (guest, #1954) [Link] (1 responses)

That's good information about transmitting signals on electrical wires, but it doesn't distinguish between parallel and serial protocols.

Crosstalk is a phenomenon on bundled wires, which exist in serial protocols too: each wire carries one serial stream. This wire configuration is common and affords faster total data transmission than parallel with the same number of wires.

Signals having to settle onto the line also happens in serial protocols as well as parallel.

Is there some way that crosstalk and settling affect skew between wires but not the basic signal rate capacity of each individual wire?

crosstalk, not wire length

Posted Oct 7, 2009 5:47 UTC (Wed) by gus3 (guest, #61103) [Link]

I had to think about the crosstalk vs. skew issue for a bit, but I think I can explain it. (N.B.: IANAEE; I Am Not An Electronics Engineer. But I did work with one for a couple years, and he explained this behavior to me.)

Take an old 40-conductor IDE cable, for example. Typically, it's flat; maybe it's bundled. Each type creates its own issues.

A flat cable, with full 40-bit skew, basically means that the bit transmitted on pin 1, can't be considered valid until the bit on pin 40 is transmitted, AND its signal settles. Or, with an 8-bit skew, bits 1, 9, 17, 25, and 33 aren't valid until bits 8, 16, 24, 32, and 40 are transmitted.

(IIRC, an 80-conductor cable compensated for this, using differential signaling, transmitting opposite signals on a pin pair, using lower voltages to do so. This permitted less crosstalk between bits, while speeding the signal detection at the other end. But I could be wrong on this.)

A bundled 40-conductor cable is a little better. Think about an ideal compaction: 1 wire in the center, 6 wires around it, 12 around those, 18 around those, and 3 more strung along somewhere. From an engineering view, this could mean bit 1, then bits 2-7 plus settling time, then bits 8-19, plus settling time, then bits 20-37 plus settling time, then bits 38-40 plus settling time. (This from an iterative programmer's mind-set. A scrambled bundle might be better, if an EE person takes up the puzzle.)

Now, consider a SATA bus. Eight wires: ground, 2 differential for data out, ground, 2 differential for data in, ground, and reference notch. Three ground lines, with the center ground isolating the input and output lines. Add to this the mirror-image polarity between input and output; the positive wires are most isolated from each other, while the negative wires are each next to the middle ground wire. The crosstalk between the positive input and positive output drops to a negligible level, and the negative lines, near the center ground, serve primarily for error checking (per my best guess).

I hope my visualization efforts have paid off for you. Corrections are welcome from anyone. Remember, IANAEE, and my info is worth what you pay for it. This stuff has been a hobby of mine for over 30 years now, but alas, it's only a hobby.

Log-structured file systems: There's one in every SSD

Posted Sep 22, 2009 18:30 UTC (Tue) by ttonino (guest, #4073) [Link]

I's rather see the intelligence and the file system in the drive exploited to make an object level store out of the drive. In essence it would present a (possibly very limited) file system to build real file systems on top of.

The real file system layer can then more easily handle things like striping and mirroring, which would involve writing the same block with the same identifier to multiple drives.

Maybe the object level store could support the use of directories. These could be applied in lieu of partitions.

Deleting an object would obviously free the used flash.

One advantage could be that the size of each object is preallocated, and that data that belongs in an object can be kept together by the drive. The current situation is that a large write gets spread over multiple free logical areas, and the drive may have problems to guess that these will be later deleted as a single entity.

Log-structured file systems: There's one in every SSD

Posted Sep 20, 2009 20:48 UTC (Sun) by dwmw2 (subscriber, #2063) [Link] (1 responses)

"keep in mind that the same hardware standardization that you are blaming microsoft for is the same thing that let linux run on standard hardware."

Nonsense. Disk controllers aren't standardised — we have a multitude of different SCSI/IDE/ATA controllers, and we need drivers for each of them. You can't boot Linux on your 'standard' hardware unless you have a driver for its disk controller, and that's one of the main reasons why ancient kernels can't be used on today's hardware. Everything else would limp along just fine.

The disks, or at least the block device interface, might be fairly standard but that makes no real difference to whether you can run Linux on the system or not. It does help you share file systems between Linux and other operating systems, perhaps — but that's a long way from what you were saying.

NAND flash is fairly standard, although as with disks we have a multitude of different controllers to provide access to it. And even the bizarre proprietary things like the M-Systems DiskOnChip devices, with their "speshul" formats to pretend to be disks and provide INT 13h services to DOS, can be used under Linux quite happily. You don't need to implement their translation layer unless you want to dual-boot (although we have implemented it). You can use the raw flash just fine with flash file systems like JFFS2.

Log-structured file systems: There's one in every SSD

Posted Sep 20, 2009 23:28 UTC (Sun) by dlang (guest, #313) [Link]

_now_ linux has support for a huge number of devices.

but think about when linux started. at that point in time it started out with ST506 MFM support and vga video support.

that wasn't enough to drive the hardware optimally, but it was enough to talk to it.

similarly with the IDE/SATA controllers, most of them work with generic settings, but to get the most out of them you want to do the per-controller tweaks.

even in video, the nvidia cards can be used as simple VGA cards and get a display.

the harder you make it to get any functionality the harder it is to get to the point where the system works well enough to be used and start being tuned.