LWN: Comments on "Log-structured file systems: There's one in every SSD"

Log-structured file systems: There's one in every SSD

joern — Wed, 07 Oct 2009 11:27:09 +0000

Of course I did. Thanks.

Log-structured file systems: There's one in every SSD

mcortese — Wed, 07 Oct 2009 07:17:07 +0000

Plus, the programming of PCM is asynchronous.
I guess you meant asymmetric?

crosstalk, not wire length

gus3 — Wed, 07 Oct 2009 05:47:37 +0000

I had to think about the crosstalk vs. skew issue for a bit, but I think I can explain it. (N.B.: IANAEE; I Am Not An Electronics Engineer. But I did work with one for a couple years, and he explained this behavior to me.)

Take an old 40-conductor IDE cable, for example. Typically, it's flat; maybe it's bundled. Each type creates its own issues.

A flat cable, with full 40-bit skew, basically means that the bit transmitted on pin 1, can't be considered valid until the bit on pin 40 is transmitted, AND its signal settles. Or, with an 8-bit skew, bits 1, 9, 17, 25, and 33 aren't valid until bits 8, 16, 24, 32, and 40 are transmitted.

(IIRC, an 80-conductor cable compensated for this, using differential signaling, transmitting opposite signals on a pin pair, using lower voltages to do so. This permitted less crosstalk between bits, while speeding the signal detection at the other end. But I could be wrong on this.)

A bundled 40-conductor cable is a little better. Think about an ideal compaction: 1 wire in the center, 6 wires around it, 12 around those, 18 around those, and 3 more strung along somewhere. From an engineering view, this could mean bit 1, then bits 2-7 plus settling time, then bits 8-19, plus settling time, then bits 20-37 plus settling time, then bits 38-40 plus settling time. (This from an iterative programmer's mind-set. A scrambled bundle might be better, if an EE person takes up the puzzle.)

Now, consider a SATA bus. Eight wires: ground, 2 differential for data out, ground, 2 differential for data in, ground, and reference notch. Three ground lines, with the center ground isolating the input and output lines. Add to this the mirror-image polarity between input and output; the positive wires are most isolated from each other, while the negative wires are each next to the middle ground wire. The crosstalk between the positive input and positive output drops to a negligible level, and the negative lines, near the center ground, serve primarily for error checking (per my best guess).

I hope my visualization efforts have paid off for you. Corrections are welcome from anyone. Remember, IANAEE, and my info is worth what you pay for it. This stuff has been a hobby of mine for over 30 years now, but alas, it's only a hobby.

Log-structured file systems: There's one in every SSD

ikm — Tue, 06 Oct 2009 19:59:42 +0000

True. But it still might be better than the current TRIM implementation -- if not performance-wise, then at least compatibility-wise. The blocks could be zeroed in background when the I/O is otherwise idle.

Log-structured file systems: There's one in every SSD

sethml — Tue, 06 Oct 2009 16:45:39 +0000

The problem with this approach is that transferring zillions of zeros over the disk interface is
slow. Imagine if deleting a 10gb file took several minutes - that would be rather annoying.
Too bad SATA doesn't have a "write a million zeros" command. Of course, that's effectively
what a proper TRIM would do.

Flash Translation Layer

nix — Tue, 06 Oct 2009 00:13:56 +0000

Your "use our product" might have worked better if you were providing raw
flash for the semi-mass-market (i.e. LWN readers, who are willing to pay
over the odds, but not hugely).

However, you're providing... an apparently closed-source FTL. The very
thing that the article you followed up to is (rightly) criticising. (Of
course if it *is* open source, well, that's nicer but we still need raw
flash to use it with.)

Pardon me for thinking that you didn't read it very carefully.

Flash Translation Layer

flash-translation-layer — Mon, 05 Oct 2009 23:39:02 +0000

I was developing SSDs and flash-based file system, so I am very interested in this article.

SSD is currently a small but growing industry, there are many flash-based devices (SSDs, SD cards, USB drives) are designed & manufactured in China, also contains the IC controllers.

Therefore I want this article reader to know the flash-based file system usage situation.

Log-structured file systems: There's one in every SSD

joern — Mon, 05 Oct 2009 20:48:10 +0000

When talking to hardware people who want to market PCM, you may notice that it suffers a similar problem as flash SSDs do. Since there is no PCM software stack, they want to plug into an existing software stack by pretending to be something else. And from what I've heard so far, that something else will be NOR flash.

Which isn't too bad an idea, honestly. 100M may seem big, but if you are ignorant enough and write your filesystem superblock on every sync, you can have that worn out in just 24h. So you still need some amount of wear leveling.

Plus, the programming of PCM is asynchronous. Flipping a bit one direction is about 6x slower than flipping it the other way. Which means that by treating your random-writeable PCM as block-eraseable flash you gain a speedup that can more than counter the slowdowns from garbage collection under fairly realistic conditions.

Flash Translation Layer

tmassey — Mon, 05 Oct 2009 14:10:51 +0000

Did someone just create an account to spam LWN with? I don't think I've ever seen this before. Given the name of the account and the content of the message, it sure seems so...

Wow, that's dedication to marketing your product! At least it's on-topic--better than V1@GR@ blog spamming! :)

Log-structured file systems: There's one in every SSD

djcapelis — Fri, 02 Oct 2009 17:58:07 +0000

IIRC some of the next generation SSDs based on phase-change memory and some other things don't require any of this madness.

Samsung is doing phase-change memory at scale now. The capacity is smaller than we'd like, but it's actually here and in production finally.

(To be fair write cycles on PCM aren't infinite, but at 100mil cycles and writes on the bit level instead of the block level, PCM is a good deal that likely to make SSDs less annoying in the future. PCM isn't the only type of new SSD that's coming out with this stuff.)

crosstalk, not wire length

giraffedata — Fri, 02 Oct 2009 17:15:02 +0000

That's good information about transmitting signals on electrical wires, but it doesn't distinguish between parallel and serial protocols.

Crosstalk is a phenomenon on bundled wires, which exist in serial protocols too: each wire carries one serial stream. This wire configuration is common and affords faster total data transmission than parallel with the same number of wires.

Signals having to settle onto the line also happens in serial protocols as well as parallel.

Is there some way that crosstalk and settling affect skew between wires but not the basic signal rate capacity of each individual wire?

crosstalk, not wire length

gus3 — Fri, 02 Oct 2009 00:21:53 +0000

> if 1 wire lets you transmit data at speed H, N wires will let you transmit data at a speed of NxH.

That is true, when the bus clock speed is slow enough to allow voltages and crosstalk between the wires to settle. However, as clock speeds approach 1GHz, crosstalk becomes a big problem.

> the problem with parallel buses at high speeds is that we have gotten fast enough that the timing has gotten short enough that the variation in the length of the wires ... and the speed of individual transistors varies enough to run up against the timing limits.

Wire length on a matched set of wires (whether it's cat5 with RJ-45 plugs, or a USB cable, IDE, SCSI, or even a VGA cable) has nothing to do with it. The switching speed on the transmission end can accomplish only so much, but there has to be some delay to allow the signal to settle onto the line. The culprit is the impedance present in even a single wire, that resists changes in current. The more wires there are in a bundle, the longer it takes the transmitted signal to settle across all the wires. By reducing the number of wires, the settling time goes down as well.

Related anecdote/urban legend: On the first day of a new incoming class, RAdm Grace Hopper would hold up a length of wire and ask how long it was. Most of the students would say "one foot", but the correct answer was "one nanosecond."

Flash Translation Layer

flash-translation-layer — Thu, 01 Oct 2009 21:50:56 +0000

Zeeis Flash Translation Layer have integrated in over 160 million flash based devices (SSDs, TransFlash Cards, SD cards, CF Cards, USB flash drives, MP3 players and mobile phones) and over 62% market share in China, 2008.

Log-structured file systems: There's one in every SSD

victusfate — Thu, 01 Oct 2009 11:44:35 +0000

So I joined LWN.net just so I could comment on this post. First off as a
new timer to the hardware end of things, this post was not only informative
but available to new readers. I'm more curious now about memory handling
than ever before.

As a long time coder, my memory concerns ended with mallocs/frees/
news/deletes and eventually I just forgot them inside of other high data
structures that cleaned themselves up when out of scope.

Do you think it would be possible to write something similar up for NTFS or
other formats? Is this article strictly unix centric so that windows SSD
formats have their own particulars. I'd love to enhance a goofy squidoo
lens I wrote up about SSDs a while back with REAL detailed information.
(here's the link for the curious
http://www.squidoo.com/KingstonSSDNowSolidStateHardDrive)

I keep thinking about the guy that had his OS running on two dozen or so
SSD raid 0 (youtube video: http://www.youtube.com/watch?v=96dWOEa4Djs)

Log-structured file systems: There's one in every SSD

giraffedata — Mon, 28 Sep 2009 15:31:45 +0000

They described handling flash as if it was just ram with special write requirements.
That (at least to me) implied the need for a full memory bus (thus the lots of wires)

But the post described doing that with a PCI Express card. PCI-E is as serial as SATA.

by the way, parallel buses are inherently faster than serial buses, all else being equal.
if 1 wire lets you transmit data at speed H, N wires will let you transmit data at a speed of NxH.

It sounds like you consider any gathering of multiple wires to be a parallel bus. That's not how people normally use the word; for example, when you run a trunk of 8 Ethernet cables between two switches, that's not a parallel bus. A parallel bus is where the bits of a byte travel on separate wires at the same time, as opposed to one wire at different times. Skew is an inherent part of it.

Log-structured file systems: There's one in every SSD

dlang — Fri, 25 Sep 2009 22:57:05 +0000

they described handling flash as if it was just ram with special write requirements.

that (at least to me) implied the need for a full memory bus (thus the lots of wires)

by the way, parallel buses are inherently faster than serial buses, all else being equal.

if 1 wire lets you transmit data at speed H, N wires will let you transmit data at a speed of NxH.

the problem with parallel buses at high speeds is that we have gotten fast enough that the timing has gotten short enough that the variation in the length of the wires (and therefor the speed-of-light time for signals to get to the other end) and the speed of individual transistors varies enough to run up against the timing limits.

Log-structured file systems: There's one in every SSD

giraffedata — Fri, 25 Sep 2009 22:32:27 +0000

it is very expensive to run all the wires to connect things via a parallel bus, that is why drive interfaces have moved to serial busses

That's not why drive interfaces (and every other data communication interface in existence) have moved to serial. They did it to get faster data transfer.

But I'm confused as to the context anyway, because the ancestor posts don't mention parallel busses.

Log-structured file systems: There's one in every SSD

knweiss — Fri, 25 Sep 2009 14:38:49 +0000

FWIW they have separate read and write caches (both SSD).

Log-structured file systems: There's one in every SSD

dwmw2 — Wed, 23 Sep 2009 00:48:06 +0000

Not really.

The 'read and then reprogram elsewhere from internal buffer' facility is all very well in theory, but your ECC is off-chip. So if you want to be able to detect and correct ECC errors as you're moving the data, rather than allowing them to propagate, then you need to do a proper read and write instead.

Linux has never bothered to use the 'copy page' operation on NAND chips which support it.

Log-structured file systems: There's one in every SSD

ttonino — Tue, 22 Sep 2009 18:30:04 +0000

I's rather see the intelligence and the file system in the drive exploited to make an object level store out of the drive. In essence it would present a (possibly very limited) file system to build real file systems on top of.

The real file system layer can then more easily handle things like striping and mirroring, which would involve writing the same block with the same identifier to multiple drives.

Maybe the object level store could support the use of directories. These could be applied in lieu of partitions.

Deleting an object would obviously free the used flash.

One advantage could be that the size of each object is preallocated, and that data that belongs in an object can be kept together by the drive. The current situation is that a large write gets spread over multiple free logical areas, and the drive may have problems to guess that these will be later deleted as a single entity.

Is there a better way to use flash memory?

jzbiciak — Tue, 22 Sep 2009 16:00:34 +0000

Oh, and I forgot to mention: The width of a row could actually be pretty wide for a large storage array in a single chip. One embedded microcontroller I use has a row width of 192 bytes. (It's a 24-bit wide memory, hence the weird number.) I could imagine the row being much, much wider in a higher density flash such as what SSDs are made of.

Still, redundant row compression seems like an interesting idea to me. I'm not sure what you'd store the reverse map in, though, to make it effective, since that too needs to be stored somewhere non-volatile. This is where having a multi-tiered storage setup (volatile RAM + non-volatile RAM + flash) could be really interesting as compared to just having a PC + flash.

Is there a better way to use flash memory?

jzbiciak — Tue, 22 Sep 2009 15:55:52 +0000

My understanding and experience is that flash is rather similar to EPROM. You erase the entire erase block, sending it to all 1s. This is an indivisible operation—the whole block gets clobbered, and there's no way to clobber only a section of it. Then, over whatever period of time is convenient to you, you fill in sections of that erase block with live data. The size of the section you have to fill in at a time is governed by the width of the memory, since a programming pulse has to be applied for all of the bits across the width of the memory, but you only have to program one row. So, erasure erases a group of rows, and then you can fill the rows in at your leisure.

If your ECC lives within the the row as your data, then your ECC encoding doesn't really matter. Since row writes are atomic, the fact that ECC bits toggle back and forth as you monotonically clear 1s to 0s in your data bits doesn't matter. You have to present your data and ECC in parallel when you write the row. Typical ECCs such as Reed-Solomon are built around this block principle.

(Now here's where I don't know how similar EPROMs and flash are: You could keep reprogramming the same row as long as you only flip 1s to 0s, which is where your initial idea becomes relevant. At least one flash-based embedded device I've used tells me to never program a row more than twice without an intervening erase, which suggests there may be an issue with storing too much charge on the floating gate, which in turn could physically damage the gate. That charge is what makes a 1 turn into a 0. Old school EPROMs were a bit more durable in this regards. But, then, you also blast them with bright UV for 15-30 minutes to erase them.)

If the rows are fine enough granularity, you could in theory encode the data, a version number and an ECC in that row, and do some sort of delta-update. If only a few bytes in a block changed, there's no reason to store an entire new copy of the whole block. Only store the changed rows. This would provide great compression for certain types of updates, such as appending to a file or doing filesystem metadata updates (ie. ext2 block-bitmap updates, where only some of the bits in the bitmap flip).

If you also included an internal map that hashed all the data rows into a reverse map database, you could use that to quickly collapse all of the identical rows across the entire drive into a single row. That is, whenever you decide to go store a particular row of data, find out of that row already exists on the physical media and instead point to that. For typical storage patterns (ie. lots of similar text across many files due to duplicated files, lots of end-of-block empty fill, etc.), this could result in a huge on-disk savings. That savings would then directly translate to a larger erase block pool for the same apparent loading vs. advertised capacity.

Log-structured file systems: There's one in every SSD

k8to — Tue, 22 Sep 2009 15:15:16 +0000

I think the idea is they are physically at 100% capacity, even if not logically so. That is, they've probably hit full at some point.

Well, that's my guess, since it makes the most sense.

Log-structured file systems: There's one in every SSD

nye — Tue, 22 Sep 2009 11:08:58 +0000

This is indeed obvious now that I'm awake (somewhat embarrassingly so :P).
I think the bit that I was somehow missing was that 'reading' from an unmapped block should just return all zeroes.

But thanks to both for making it explicit.

Log-structured file systems: There's one in every SSD

butlerm — Tue, 22 Sep 2009 05:22:21 +0000

Apparently Micron's flash chips have the ability to internally move data
around without having it leave the chip. No doubt very useful in this
application.

Log-structured file systems: There's one in every SSD

ikm — Mon, 21 Sep 2009 20:22:34 +0000

> I wondered about this, but how would it know that you didn't really want to store all zeroes in that block?

Because it's the same thing. There's no point in actually storing zeroes -- you could just mark the block as unused instead, so if later read, it would read back as zeroes.

> When you read back from it you might find that it's been re-used in the meantime and filled with something else

SSD maintains a mapping between logical disk blocks and physical flash blocks. If a logical block is marked as unmapped, it just doesn't use any physical space at all. SSD can't reuse your logical block since it doesn't "use" logical blocks -- it uses physical flash blocks.

> the SSD firmware would complain that the (logical) block you are requesting doesn't currently map to any (physical) block.

It would return zeroes, since the block is empty (unmapped). And that's the intent of it.

Log-structured file systems: There's one in every SSD

farnz — Mon, 21 Sep 2009 16:35:36 +0000

It takes a little intelligence on the part of the SSD designer. Reads from a logical block not currently mapped to any physical flash must be defined as returning all zeros; once you've done that, you can treat writing a logical block with all zeros as a "free" operation, not a write. The "free" operation just marks that logical block as not mapped to any physical flash.

You then know you don't need to hang onto the data that used to be important to that logical block; initially, it's still there, but when you garbage collect the physical block that used to contain the bytes from that logical block, you don't bother copying the data.

Log-structured file systems: There's one in every SSD

nye — Mon, 21 Sep 2009 16:14:06 +0000

I wondered about this, but how would it know that you didn't really want to store all zeroes in that block? When you read back from it you might find that it's been re-used in the meantime and filled with something else, or more likely the SSD firmware would complain that the (logical) block you are requesting doesn't currently map to any (physical) block.

I'm getting pretty tired this afternoon - am I missing something obvious?

Log-structured file systems: There's one in every SSD

nix — Mon, 21 Sep 2009 09:10:08 +0000

on modern systems, it's impossible to directly access a particular byte of ram. the memory chips actually act more like tape drives, it takes a significant amount of time to get to the start position, then it's cheap to do sequential read/writes from that point forward. your cpu uses this to treat your ram as if it was a tape device with blocks the size of your cpu cache lines (64-256 bytes each)

That's almost entirely inaccurate, I'm afraid. Ulrich Drepper's article on memory puts it better, in section 2.2.1.

The memory is necessarily read in units of cachelines, and it takes a significant amount of time to load uncached data from main memory, and of course it takes time to latch RAS and CAS, but main memory itself has a jagged access pattern, with additional delays from precharging and so on whenever RAS has to change.

But that doesn't make it like a tape drive, it's still random-access: it takes the same time to jump one row forwards as to jump fifty backwards. It's just that the units of this random access are very strange, given that they're dependent on the physical layout of memory in the machine (not machine words and possibly not cachelines), and are shielded from you by multiple layers of caching.

Log-structured file systems: There's one in every SSD

mjthayer — Mon, 21 Sep 2009 09:09:04 +0000

And if a DIY device could be made to work reasonably well, then some company would be sure to start using the software in a commercial device, which would have a price advantage on the market. Then just make sure that some of the critical components are GPLv3, and you have a load of devices that will let you update their firmware for further experiments.

Log-structured file systems: There's one in every SSD

butlerm — Mon, 21 Sep 2009 08:30:18 +0000

PCI-E really isn't a parallel bus - it is multi-lane differential serial bus.
You can use a single lane if you want to - 500 MB/sec per lane today, 1 GB/s
soon.

The good thing about PCI-E for this application is that you can use simple
external cables, so you can easily locate your flash units in a different
chassis than the CPU. External PCI-E connections are being used for external
disk arrays already. *Much* faster than SAS with more than one lane.

Flash is not exactly a byte addressable memory technology, btw, so you still
need to DMA to and from host memory.

Is there a better way to use flash memory?

dlang — Mon, 21 Sep 2009 06:40:41 +0000

I've had (and posted) thoughts along the same lines. the last time I posted them one person responded that checksums and ecc codes could prevent writing to small portions of flash

I think it's an idea worth investigating (it could trade some space for reduced erase cycles, being especially effective in metadata), but it would require either a smart drive (that doesn't move a block if it can be modified in place from the current to the desired value) or raw access to the flash.

Is there a better way to use flash memory?

PaulWay — Mon, 21 Sep 2009 06:19:45 +0000

One thing I've wondered with flash memory is whether there's a better way of using it than 'blank the entire block when you write it'. If one imagines every bit in a byte being represented by the parity on a byte of flash - so that 11111111 has parity 0, 11111110 has parity 1, etc. - then toggling that bit is equivalent to zeroing out one more bit from the flash byte. There's probably other, much more intelligent algorithms for spreading one source byte out over multiple target bytes such that a change of value in the source is simply a process of zeroing out (or one-ing out) certain bits in the target.

Another approach might be to treat each block as a miniature log - a megabyte block on flash equals 64K (call it a 'chunk') of filesystem. Each time you write to that 64K chunk you copy the chunk into the next available 64K of space on the block. When the block is full, you flush it and rewrite the chunk at the start of the block. In this way, sixteen writes to the chunk can occur before you have to rewrite the whole block, reducing the wear on that block (and thus the chance that that chunk of filesystem will fail). An optimal size might be 1KB chunks per 1MB block, where the first 1KB is used in the 'parity' fashion above to notate where the current 1KB is in that block, giving you 1023 possible rewrites before you have to flush the block and rewrite it from scratch.

As many people in this article's comments and elsewhere have said, flash memory is incredibly cheap. I think there's a body of people who would pay more for solid state disk space for it to behave with exactly the same MTBF characteristics as a regular spinning-rust hard disk.

Have fun,

Paul

Log-structured file systems: There's one in every SSD

dlang — Mon, 21 Sep 2009 01:03:41 +0000

the thing is that even main memory access isn't really 'memory like' anymore.

on modern systems, it's impossible to directly access a particular byte of ram. the memory chips actually act more like tape drives, it takes a significant amount of time to get to the start position, then it's cheap to do sequential read/writes from that point forward.

your cpu uses this to treat your ram as if it was a tape device with blocks the size of your cpu cache lines (64-256 bytes each)

In addition, if you want to have checksums to detect problems you need to define what size the chunks of data are that you are doing the checksum over.

Log-structured file systems: There's one in every SSD

drag — Mon, 21 Sep 2009 00:38:58 +0000

Ya. I was just trying to show what it would be like to try to access a large amount of flash memory in a 'raw mode'.

People are kinda confused, I think, about the whole block vs flash thing.

The way I see it Flash memory, in a lot of ways, is much more like memory then block devices. Instead of thinking it as block devices with no seek time... think of it more like memory with pecular requirements that makes writes very expensive.

Log-structured file systems: There's one in every SSD

dlang — Sun, 20 Sep 2009 23:37:32 +0000

we know that they do include more flash than they report, but it still takes a significant amount of time to erase it, so the problem doesn't quite hit the worst-case scenerio, but it's still not pretty.

Log-structured file systems: There's one in every SSD

dlang — Sun, 20 Sep 2009 23:36:13 +0000

it is very expensive to run all the wires to connect things via a parallel bus, that is why drive interfaces have moved to serial busses

trying to map your flash storage into your address space is going to be expensive, and it also isnt very portable from system to system.

it's already hard to get enough slots for everything needed in a server, dedicating one to flash is a hard decision, and low-end systems have even fewer slots.

there is one commercial company making PCI-E based flash drives, their cost per drive is an order of magnitude higher than the companies making SATA based devices, they also haven't been able to get their device drivers upstream into the kernel so users are forced into binary-only drivers in their main kernel. this is significantly worse than the SATA interface version because now mistakes in the driver can clobber the entire system, not just the storage device.

sorry, I don't buy into the 'just connect the raw flash and everything will be good' reasoning.

Log-structured file systems: There's one in every SSD

dlang — Sun, 20 Sep 2009 23:28:32 +0000

_now_ linux has support for a huge number of devices.

but think about when linux started. at that point in time it started out with ST506 MFM support and vga video support.

that wasn't enough to drive the hardware optimally, but it was enough to talk to it.

similarly with the IDE/SATA controllers, most of them work with generic settings, but to get the most out of them you want to do the per-controller tweaks.

even in video, the nvidia cards can be used as simple VGA cards and get a display.

the harder you make it to get any functionality the harder it is to get to the point where the system works well enough to be used and start being tuned.

Log-structured file systems: There's one in every SSD

dwmw2 — Sun, 20 Sep 2009 20:48:41 +0000

"keep in mind that the same hardware standardization that you are blaming microsoft for is the same thing that let linux run on standard hardware."

Nonsense. Disk controllers aren't standardised — we have a multitude of different SCSI/IDE/ATA controllers, and we need drivers for each of them. You can't boot Linux on your 'standard' hardware unless you have a driver for its disk controller, and that's one of the main reasons why ancient kernels can't be used on today's hardware. Everything else would limp along just fine.

The disks, or at least the block device interface, might be fairly standard but that makes no real difference to whether you can run Linux on the system or not. It does help you share file systems between Linux and other operating systems, perhaps — but that's a long way from what you were saying.

NAND flash is fairly standard, although as with disks we have a multitude of different controllers to provide access to it. And even the bizarre proprietary things like the M-Systems DiskOnChip devices, with their "speshul" formats to pretend to be disks and provide INT 13h services to DOS, can be used under Linux quite happily. You don't need to implement their translation layer unless you want to dual-boot (although we have implemented it). You can use the raw flash just fine with flash file systems like JFFS2.

Log-structured file systems: There's one in every SSD

cmccabe — Sun, 20 Sep 2009 19:43:13 +0000

The problem is that once something becomes a standard, it's very hard to change it. It usually requires the major players to hold long meetings in standards organizations. A lot of horse trading can go on in these meetings. Sometimes companies even try to sabotage the process because they don't think that the proposed standard would help them.

For example, Intel is feeling good about NAND-over-SATA right now because they have one of the most advanced block emulation layers. They have a competitive advantage and they want to make the most of it. I would be surprised if they made any moves at all towards exposing raw flash to the operating system. It would not be in their best interest.

The big sticking point with any raw-flash interface is Windows support. Pretty much all Windows users use NTFS, which works on traditional block devices. Any company hoping to replace NAND-over-SATA would have to supply a filesystem for Windows to use with their product of equivalent quality. Filesystems can take years to become truly stable. In the meantime all those angry users will be banging on *your* door.

Microsoft might create a log-structured filesystem for Windows UltraVista Aquamarine Edition (or whatever their next release will be called), but I doubt it. I just don't see what's in it for them. It's more likely that we'll see another one of those awkward compromises that the PC industry is famous for. Probably the next version of SATA will include some improvements to make NAND-over-SATA more bearable. And filesystem developers will just have to learn to live with having another layer of abstraction between them and the real hardware. Perhaps we'll finally get bigger blocks (512 is way too small for flash).