Samsung's F2FS filesystem

Posted Oct 10, 2012 12:02 UTC (Wed) by cgrey8 (guest, #87131)
In reply to: Samsung's F2FS filesystem by jpfrancois
Parent article: Samsung's F2FS filesystem

This new FS is interesting to me because I work for a company that is currently developing a new product line that will be based on DDR3 & some kind of NAND flash. Historically, our products have always been NOR flash and battery backed SRAM or SDRAM using an RTOS. But we recognize how limiting this is and how liberating it will be to function in a Linux environment. But this whole paradigm is new to us and everything we've researched about raw NAND is, well, scarey making eMMC very compelling. What we've heard is:
1. Raw NAND changes a lot requiring near constant requalifying of parts as vendors change their manufacturing processes

2. Raw NAND required both hardware ECC correction in your microcontroller (not a problem for the ARM Cortex based TI AM335x we plan to use), but also requires a FS to handle the wear leveling. While UBIFS and others are probably decent at that, the act of doing that in software has been shown to take up considerable processing power on the processor we've selected.

In contrast, eMMC abstracts many details away from us and our processor so we don't have to deal quite so directly with them nor take the performance hit. But we aren't so naive to believe that the issues are gone. We are still in the very early stages of R&D on this project, so we have the opportunity to architect our firmware in a way that works best with the eMMC. What we don't have a solid handle on is what exactly that will require of us.

We have identified we have both data that doesn't change often (Uboot, Linux Kernel, most of the File system), but then there's other aspects to what we do that will write data far more than it'll be written for the purposes of trend logging. This would suggest breaking the eMMC into 2 partitions, one dedicated to the stuff that doesn't change often, and the other with LOTS of free space and configured for high durability (pseudo-SLC mode). But this is relatively high level. We'd like to know more about the best practices AND worst practices when working with eMMC in an embedded Linux environment.

TI's Arago Linux is up to 3.2.x and will likely be 3.4.x by the time we actually deploy the product to market, so getting things like TRIM, I would assume are a given. Is that a bad assumption? If so, we will need to look into reconfiguring the kernel build to include it if TRIM is a commonly excluded feature in embedded Linux kernel builds.

But more to the point of this thread, the fact that this new FS from Samsung was designed by makers of eMMC to be "flash friendly" is inviting. But before we jump on a band-wagon, I, like others on this site, would like more details on exactly what this new FS does differently from others (i.e. ext4) to improve the performance and reduce write amplification of random-write accesses to eMMC including what "tunable" parameters are there available to us? And how would we know when we would need to tinker with them?

Samsung's F2FS filesystem

Posted Oct 10, 2012 14:49 UTC (Wed) by dwmw2 (subscriber, #2063) [Link] (4 responses)

"What we've heard is:
1. Raw NAND changes a lot requiring near constant requalifying of parts as vendors change their manufacturing processes.

There's a certain amount of that, but not nearly as much as with eMMC devices. As I mentioned elsewhere, those have been known to change the whole of their internals — both the FTL implementation and all of the hardware, microcontroller and all — without even changing the model number. It's just completely a different device, with completely different performance and reliability characteristics, from one batch to the next. But because it's still a black box which pretends to be a 16GiB 'disk', they don't see the need to change the model number.

At least with NAND devices, the issue is mostly just that your board manufacturer finds alternative suppliers for the "same" device so you need to requalify timings, etc. Unless you have a particularly recalcitrant ODM, you should at least get some notice of that kind of thing rather than finding it out post-build-run.

I know there are some people claiming that NAND technology changes too fast, and it's impossible for software to keep up. I'm not sure if that's what you're referring to above, but I'll bring it up so that I can point out that it isn't really true; the people who say this are not speaking from a position of experience. Erase block and page sizes have grown over the years, and the number of writes you can make to a given page has decreased from about 10 writes to a 512-byte page in the 1990s, to a single write (or less than that with paired pages) now. And you need stronger ECC as the underlying technology gets less reliable (especially with MLC). But fundamentally, it's fairly much the same. NAND support was retrofitted to JFFS2 after it had been designed to work with only NOR flash, and the same NAND support is just as valid on today's flash as it was then.

"2. Raw NAND required both hardware ECC correction in your microcontroller (not a problem for the ARM Cortex based TI AM335x we plan to use), but also requires a FS to handle the wear leveling. While UBIFS and others are probably decent at that, the act of doing that in software has been shown to take up considerable processing power on the processor we've selected."

Most decent SoCs have hardware ECC support on the NAND controller so that shouldn't suck CPU time. And with DMA support, even garbage collection shouldn't take time. If you have specific examples of UBIFS taking "considerable processing power" it'd be interesting to analyse them and see where it's being spent. Note that it'll be garbage collection, not wear levelling per se, which takes the time. Wear levelling is mostly just a matter of choosing which victim blocks to garbage collect from; in itself it isn't time-consuming. Occasionally it means you choose a completely clean block as a victim, and move its entire contents to another physical block. But that should be rare.

Samsung's F2FS filesystem

Posted Oct 10, 2012 15:50 UTC (Wed) by cgrey8 (guest, #87131) [Link] (3 responses)

Some of these questions and comments are getting off topic to Samsung's F2FS, but are enough related, I'll post them anyway. If they need to be broken out as their own threads, let me know.

We don't have a lot of information on this at all. In fact, all we have is from the TI website. Here's an entry from their Wiki that indicates UBIFS having a significant performance hit on the processor, but doesn't elaborate:
http://processors.wiki.ti.com/index.php/AM335x-PSP_04.06....

Here's the same wiki lower in the article talking about MMC using ext2 and showing an increase in write bandwidth as well as an order of magnitude less processor load as compared to raw NAND:
http://processors.wiki.ti.com/index.php/AM335x-PSP_04.06....

The wiki doesn't make specific mention of a difference between eMMC and external SD/MMC although I believe they are actually two different peripherals in the controller (don't quote me on that). Point is, this is where we got our info from.

It is interesting that you mention paired paging. That's new to me. I was aware that partial page writes common in SLC had gone the way of the dodo bird with MLC, but I wasn't aware of them taking it a step further to paired paging. Forgive my ignorance, but what's the difference between paired 4kByte pages and a single 8kByte page size?

It's also beneficial for us to be aware that eMMC parts can change even within the same part number. Do you care to share who the notorious offenders of doing this are? Feel free to email me if you don't care to share publicly (cgrey AT automatedlogic DOT com).

We plan on grossly over-sizing the flash parts relative to the space we plan to consume so we can make heavy use of wear leveling to get the durability we need. But what you are suggesting is we could settle on a part that proves to work well for us because it can be written to with a frequency of say 50k write/erase cycles. But a year later, the same part number only gets us 20k write/erase cycles. I would hope that even if the physical NAND used realizes a decrease of this magnitude that the manufacturer would, behind-the-scenes, add more physical flash to wear level across even though they are only exposing the advertised capacity. Even if they did this, I can see how that could change the realized durability of the part. Thus it would be nice if they notified customers of these type changes and marked/versioned their parts so identified problems could be bound to a specific batch for recall purposes to our customers.

There's at least 1 eMMC vendor out there that is doing something rather interesting in their eMMC parts. They have a cache area used specifically for random write accesses. The cache consists of high durability true SLC and the internal flash manager, presumably uses this like a journaling area to avoid write amplification since they have partial page writes all throughout the cache area. If we were to go with that vendor, would having that internal architecture play into which FS would work best? Or do these type architectures fall into the category of gimmicks that look great in sales pitches, but don't really help much in practice?

Based only on what I know so far, I have to believe that any improvements made to create a "flash friendly" file system relate to compacting small files that originally got written to multiple pages down to consume a single page when the part wear levels. That doesn't help you on the initial write amplification, but it can at least reduce/mitigate the amplification of future writes performed due to wear leveling. Or is this already done in other FSs?

Samsung's F2FS filesystem

Posted Oct 10, 2012 17:57 UTC (Wed) by dwmw2 (subscriber, #2063) [Link] (2 responses)

Perhaps we should continue this in email...

The benchmarks you link to seem to indicate that DMA isn't being used; it's doing PIO instead. It's not clear why they'd do that. Is this a legacy platform? I'm not familiar with AM335x... what NAND driver does it use? I'm not surprised it uses excessive CPU time if it's doing everything with PIO. If you run your MMC controller in PIO mode it'll be slow too.

I don't have specific examples of changing devices without changing model numbers, but I've heard it repeatedly over the years from people analysing such devices. Personally, I've mostly focused on real flash and left MMC to other people.

Paired pages... when they put 2 bits per flash cell in MLC, you'd have hoped that those two bits would be in the *same* logical page, right? So they get programmed at the same time? But no. They are in *different* logical pages, just to make life fun for you. And they aren't even *adjacent* pages; they can often pair page 0 and 3, page 1 and 2 or something like that. $DEITY knows why.... actually ISTR it's for speed; programming both bits at once would be slower in the microbenchmarks.

Samsung's F2FS filesystem

Posted Oct 10, 2012 18:13 UTC (Wed) by cgrey8 (guest, #87131) [Link] (1 responses)

Send me an email, and I'll gladly fill you in on what I know. I don't seem to have a way to get your email address via your user name.

Samsung's F2FS filesystem

Posted Oct 10, 2012 18:41 UTC (Wed) by dwmw2 (subscriber, #2063) [Link]

https://www.google.com/search?q=dwmw2 should get you close ☺